• Aucun résultat trouvé

part 2

N/A
N/A
Protected

Academic year: 2022

Partager "part 2"

Copied!
110
0
0

Texte intégral

(1)

Non-Convex Non-IID Non-Stochastic Non-Serial

Stochastic Variance-Reduced Optimization for Machine Learning

Parts 2: Weakening the Assumptions

Presenters: Francis Bach and Mark Schmidt

2017 SIAM Conference on Optimization

May 23, 2017

(2)

Non-Convex Non-IID Non-Stochastic Non-Serial

Outline

1 Non-Convex

2 Non-IID

3 Non-Stochastic

4 Non-Serial

(3)

Non-Convex Non-IID Non-Stochastic Non-Serial

Linear of Convergence of Gradient-Based Methods

We’ve seen a variety of results of the form:

Smoothness + Strong-Convexity ⇒ Linear Convergence Error on iterationt isO(ρt), or we need O(log(1/)) iterations.

But even simple models are oftennot strongly-convex. Least squares, logistic regression, SVMs with bias, etc.

How much can we relax strong-convexity?

Smoothness + ???

Strong-Convexity ⇒ Linear Convergence

(4)

Non-Convex Non-IID Non-Stochastic Non-Serial

Linear of Convergence of Gradient-Based Methods

We’ve seen a variety of results of the form:

Smoothness + Strong-Convexity ⇒ Linear Convergence Error on iterationt isO(ρt), or we need O(log(1/)) iterations.

But even simple models are oftennot strongly-convex.

Least squares, logistic regression, SVMs with bias, etc.

How much can we relax strong-convexity?

Smoothness + ???

Strong-Convexity ⇒ Linear Convergence

(5)

Non-Convex Non-IID Non-Stochastic Non-Serial

Linear of Convergence of Gradient-Based Methods

We’ve seen a variety of results of the form:

Smoothness + Strong-Convexity ⇒ Linear Convergence Error on iterationt isO(ρt), or we need O(log(1/)) iterations.

But even simple models are oftennot strongly-convex.

Least squares, logistic regression, SVMs with bias, etc.

How much can we relax strong-convexity?

Smoothness + ???

Strong-Convexity ⇒ Linear Convergence

(6)

Non-Convex Non-IID Non-Stochastic Non-Serial

Polyak- Lojasiewicz (PL) Inequality

For example, in 1963 Polyak showed linear convergence of GD only assuming 1

2k∇f(x)k2≥µ(f(x)−f), that gradient grows as quadratic function of sub-optimality.

Holds for SC problems, but also problems of the form f(x) =g(Ax), for strongly-convexg. Includes least squares, logistic regression (on compact set), etc. A special case of the Lojasiewicz’ inequality [1963].

We’ll call this thePolyak- Lojasiewicz (PL) inequality. Using the PL inequality we can show

Smoothness + PL Inequality

Strong-Convexity ⇒ Linear Convergence

(7)

Non-Convex Non-IID Non-Stochastic Non-Serial

Polyak- Lojasiewicz (PL) Inequality

For example, in 1963 Polyak showed linear convergence of GD only assuming 1

2k∇f(x)k2≥µ(f(x)−f), that gradient grows as quadratic function of sub-optimality.

Holds for SC problems, but also problems of the form f(x) =g(Ax), for strongly-convexg. Includes least squares, logistic regression (on compact set), etc.

A special case of the Lojasiewicz’ inequality [1963]. We’ll call this thePolyak- Lojasiewicz (PL) inequality. Using the PL inequality we can show

Smoothness + PL Inequality

Strong-Convexity ⇒ Linear Convergence

(8)

Non-Convex Non-IID Non-Stochastic Non-Serial

Polyak- Lojasiewicz (PL) Inequality

For example, in 1963 Polyak showed linear convergence of GD only assuming 1

2k∇f(x)k2≥µ(f(x)−f), that gradient grows as quadratic function of sub-optimality.

Holds for SC problems, but also problems of the form f(x) =g(Ax), for strongly-convexg. Includes least squares, logistic regression (on compact set), etc.

A special case of the Lojasiewicz’ inequality [1963].

We’ll call this thePolyak- Lojasiewicz (PL) inequality.

Using the PL inequality we can show Smoothness + PL Inequality

Strong-Convexity ⇒ Linear Convergence

(9)

Non-Convex Non-IID Non-Stochastic Non-Serial

Polyak- Lojasiewicz (PL) Inequality

For example, in 1963 Polyak showed linear convergence of GD only assuming 1

2k∇f(x)k2≥µ(f(x)−f), that gradient grows as quadratic function of sub-optimality.

Holds for SC problems, but also problems of the form f(x) =g(Ax), for strongly-convexg. Includes least squares, logistic regression (on compact set), etc.

A special case of the Lojasiewicz’ inequality [1963].

We’ll call this thePolyak- Lojasiewicz (PL) inequality.

Using the PL inequality we can show Smoothness + PL Inequality

Strong-Convexity ⇒ Linear Convergence

(10)

Non-Convex Non-IID Non-Stochastic Non-Serial

PL Inequality and Invexity

PL inequality doesn’t require uniqueness or convexity.

However, it impliesinvexity.

For smoothf, invexityall stationary points are global optimum. Example of invex but non-convex function satisfying PL:

f(x) =x2+ 3 sin2(x).

Gradient descent converges linearly on this non-convex problem.

(11)

Non-Convex Non-IID Non-Stochastic Non-Serial

PL Inequality and Invexity

PL inequality doesn’t require uniqueness or convexity.

However, it impliesinvexity.

For smoothf, invexityall stationary points are global optimum.

Example of invex but non-convex function satisfying PL: f(x) =x2+ 3 sin2(x).

Gradient descent converges linearly on this non-convex problem.

(12)

Non-Convex Non-IID Non-Stochastic Non-Serial

PL Inequality and Invexity

PL inequality doesn’t require uniqueness or convexity.

However, it impliesinvexity.

For smoothf, invexityall stationary points are global optimum.

Example of invex but non-convex function satisfying PL:

f(x) =x2+ 3 sin2(x).

Gradient descent converges linearly on this non-convex problem.

(13)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds [Luo and Tseng, 1993]. QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013]. RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”. Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(14)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013]. RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”. Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(15)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013]. RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”. Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(16)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013]. RSI plus convexity is “restricted strong-convexity”. Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(17)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”.

Semi-strong convexity[Gong & Ye, 2014]. Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(18)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”.

Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015]. Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(19)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”.

Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015].

Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015]. Proofs are often more complicated under these conditions. Are they more general?

(20)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”.

Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015].

Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015].

Proofs are often more complicated under these conditions. Are they more general?

(21)

Non-Convex Non-IID Non-Stochastic Non-Serial

Weaker Conditions than Strong Convexity (SC)

How does PL inequality [1963] relate to more recent conditions?

EB:error bounds[Luo and Tseng, 1993].

QG:quadratic growth[Anitescu, 2000]

ESC:essential strong convexity[Liu et al., 2013].

RSI:restricted secant inequality[Zhang & Yin, 2013].

RSI plus convexity is “restricted strong-convexity”.

Semi-strong convexity[Gong & Ye, 2014].

Equivalent to QG plus convexity.

Optimal strong convexity[Liu & Wright, 2015].

Equivalent to QG plus convexity.

WSC:weak strong convexity [Necoara et al., 2015].

Proofs are often more complicated under these conditions.

Are they more general?

(22)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relationships Between Conditions

For a function f with a Lipschitz-continuous gradient, we have:

(SC) →(ESC) → (WSC)→ (RSI)→ (EB) ≡(PL)→ (QG).

If we further assume that f is convex, then

(RSI) ≡(EB) ≡(PL) ≡(QG).

QG is the weakest condition but allows non-global local minima. PL ≡ EB are most general conditionsgiving global min.

(23)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relationships Between Conditions

For a function f with a Lipschitz-continuous gradient, we have:

(SC) →(ESC) → (WSC)→ (RSI)→ (EB) ≡(PL)→ (QG).

If we further assume that f is convex, then

(RSI) ≡(EB) ≡(PL) ≡(QG).

QG is the weakest condition but allows non-global local minima. PL ≡ EB are most general conditionsgiving global min.

(24)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relationships Between Conditions

For a function f with a Lipschitz-continuous gradient, we have:

(SC) →(ESC) → (WSC)→ (RSI)→ (EB) ≡(PL)→ (QG).

If we further assume that f is convex, then

(RSI) ≡(EB) ≡(PL) ≡(QG).

QG is the weakest condition but allowsnon-global local minima.

PL ≡ EB are most general conditionsgiving global min.

(25)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relationships Between Conditions

For a function f with a Lipschitz-continuous gradient, we have:

(SC) →(ESC) → (WSC)→ (RSI)→ (EB) ≡(PL)→ (QG).

If we further assume that f is convex, then

(RSI) ≡(EB) ≡(PL) ≡(QG).

QG is the weakest condition but allowsnon-global local minima.

PL ≡ EB are most general conditionsgiving global min.

(26)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence of Huge-Scale Methods

For large datasets, we typically don’t use GD.

But the PL inequalitycan be used to analyze other algorithms.

It has now been used to analyze:

Classic stochastic gradientmethods [Karimi et al., 2016]: O(1/k) without strong-convexityusing basic method. Coordinate descent methods [Karimi et al, 2016]. Frank-Wolfe [Garber & Hazan, 2015].

Variance-reduced stochastic gradient(like SAGA and SVRG) [Reddi et al., 2016]. Linear convergence without strong-convexity.

(27)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence of Huge-Scale Methods

For large datasets, we typically don’t use GD.

But the PL inequalitycan be used to analyze other algorithms.

It has now been used to analyze:

Classic stochastic gradientmethods [Karimi et al., 2016]:

O(1/k) without strong-convexityusing basic method.

Coordinate descent methods [Karimi et al, 2016].

Frank-Wolfe [Garber & Hazan, 2015].

Variance-reduced stochastic gradient(like SAGA and SVRG) [Reddi et al., 2016]. Linear convergence without strong-convexity.

(28)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence of Huge-Scale Methods

For large datasets, we typically don’t use GD.

But the PL inequalitycan be used to analyze other algorithms.

It has now been used to analyze:

Classic stochastic gradientmethods [Karimi et al., 2016]:

O(1/k) without strong-convexityusing basic method.

Coordinate descent methods [Karimi et al, 2016].

Frank-Wolfe [Garber & Hazan, 2015].

Variance-reduced stochastic gradient(like SAGA and SVRG) [Reddi et al., 2016].

Linear convergence without strong-convexity.

(29)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relevant Problems for Proximal-PL

Proximal-PLis a generalization for non-smooth composite problems.

Reddi et al. [2016] analyzeproximal-SVRGandproximal-SAGA.

Proximal-PL is satisfied when:

f is strongly-convex.

f satisfies PL andg is constant.

f =h(Ax) for strongly-convexhandg is indicator of polyhedral set.

F is convex and satisfies QG (SVM and LASSO)

Any problem satisfying KL inequality or error bounds (equivalent to these). Group L1-regularization, nuclear-norm regularization.

Another important problem class: principal component analysis(PCA) Non-convex and doesn’t satisfy PL, but we can find global optimum. But itsatisfies PL on Riemannian manifold[Zhang et al., 2016].

Newfaster method based on SVRG[Shamir, 2015, Garber & Hazan, 2016].

(30)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relevant Problems for Proximal-PL

Proximal-PLis a generalization for non-smooth composite problems.

Reddi et al. [2016] analyzeproximal-SVRGandproximal-SAGA.

Proximal-PL is satisfied when:

f is strongly-convex.

f satisfies PL andg is constant.

f =h(Ax) for strongly-convexhandg is indicator of polyhedral set.

F is convex and satisfies QG (SVM and LASSO)

Any problem satisfying KL inequality or error bounds (equivalent to these).

Group L1-regularization, nuclear-norm regularization.

Another important problem class: principal component analysis(PCA) Non-convex and doesn’t satisfy PL, but we can find global optimum. But itsatisfies PL on Riemannian manifold[Zhang et al., 2016].

Newfaster method based on SVRG[Shamir, 2015, Garber & Hazan, 2016].

(31)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relevant Problems for Proximal-PL

Proximal-PLis a generalization for non-smooth composite problems.

Reddi et al. [2016] analyzeproximal-SVRGandproximal-SAGA.

Proximal-PL is satisfied when:

f is strongly-convex.

f satisfies PL andg is constant.

f =h(Ax) for strongly-convexhandg is indicator of polyhedral set.

F is convex and satisfies QG (SVM and LASSO)

Any problem satisfying KL inequality or error bounds (equivalent to these).

Group L1-regularization, nuclear-norm regularization.

Another important problem class: principal component analysis (PCA) Non-convex and doesn’t satisfy PL, but we can find global optimum.

But itsatisfies PL on Riemannian manifold[Zhang et al., 2016].

Newfaster method based on SVRG[Shamir, 2015, Garber & Hazan, 2016].

(32)

Non-Convex Non-IID Non-Stochastic Non-Serial

Relevant Problems for Proximal-PL

Proximal-PLis a generalization for non-smooth composite problems.

Reddi et al. [2016] analyzeproximal-SVRGandproximal-SAGA.

Proximal-PL is satisfied when:

f is strongly-convex.

f satisfies PL andg is constant.

f =h(Ax) for strongly-convexhandg is indicator of polyhedral set.

F is convex and satisfies QG (SVM and LASSO)

Any problem satisfying KL inequality or error bounds (equivalent to these).

Group L1-regularization, nuclear-norm regularization.

Another important problem class: principal component analysis (PCA) Non-convex and doesn’t satisfy PL, but we can find global optimum.

But itsatisfies PL on Riemannian manifold[Zhang et al., 2016].

Newfaster method based on SVRG[Shamir, 2015, Garber & Hazan, 2016].

(33)

Non-Convex Non-IID Non-Stochastic Non-Serial

But can we say anything about general non-convexfunctions?

What if all we know is ∇f is Lipschitz and f is bounded below?

(34)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Gradient Descent

For strongly-convexfunctions, GD satisfies

kxt−xk2=O(ρt).

For convex functions, for GD still satisfies

f(xt)−f(x) =O(1/t).

For non-convex and bounded below functions, GD still satisfies mink≤tk∇f(xk)k2 =O(1/t),

a convergence rate in terms of getting to a critical point[Nesterov, 2003].

(35)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Gradient Descent

For strongly-convexfunctions, GD satisfies

kxt−xk2=O(ρt).

For convex functions, for GD still satisfies

f(xt)−f(x) =O(1/t).

For non-convex and bounded belowfunctions, GD still satisfies mink≤tk∇f(xk)k2 =O(1/t),

a convergence rate in terms of getting to a critical point[Nesterov, 2003].

(36)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

For stochastic gradientmethods, Ghadimi & Lan [2013] show a similar result, E[k∇f(xk)k2] =O(1√

t), for a randomly-chosen k ≤t.

For variance-reducedmethods, Reddi et al. [2016] show we get faster rate, E[k∇f(xk)k2] =O(1/t),

for a randomly-chosen k ≤t.

(37)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

For stochastic gradientmethods, Ghadimi & Lan [2013] show a similar result, E[k∇f(xk)k2] =O(1√

t), for a randomly-chosen k ≤t.

For variance-reducedmethods, Reddi et al. [2016] show we get faster rate, E[k∇f(xk)k2] =O(1/t),

for a randomly-chosen k ≤t.

(38)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

(39)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

Number of gradient evaluations to guarantee -close to critical:

Gradient descent O(n/) Stochastic gradient O(1/2)

Variance-reduced O(n+n2/3/)

We have analogous results forvariance-reduced proximal+stochastic methods.

[Reddi et al., 2016]

We cannot show analogous results for classic proximal stochastic methods. All existing proximal+stochastic results require noise to go to zero.

Open problemthat needs to be resolved: are analogous results possible?

(40)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

Number of gradient evaluations to guarantee -close to critical:

Gradient descent O(n/) Stochastic gradient O(1/2) Variance-reduced O(n+n2/3/)

We have analogous results forvariance-reduced proximal+stochastic methods.

[Reddi et al., 2016]

We cannot show analogous results for classic proximal stochastic methods. All existing proximal+stochastic results require noise to go to zero.

Open problemthat needs to be resolved: are analogous results possible?

(41)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

Number of gradient evaluations to guarantee -close to critical:

Gradient descent O(n/) Stochastic gradient O(1/2) Variance-reduced O(n+n2/3/)

We have analogous results forvariance-reduced proximal+stochastic methods.

[Reddi et al., 2016]

We cannot show analogous results for classic proximal stochastic methods. All existing proximal+stochastic results require noise to go to zero.

Open problemthat needs to be resolved: are analogous results possible?

(42)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-Convex Rates for Stochastic Gradient

Number of gradient evaluations to guarantee -close to critical:

Gradient descent O(n/) Stochastic gradient O(1/2) Variance-reduced O(n+n2/3/)

We have analogous results forvariance-reduced proximal+stochastic methods.

[Reddi et al., 2016]

We cannot show analogous results for classic proximal stochastic methods.

All existing proximal+stochastic results require noise to go to zero.

Open problemthat needs to be resolved: are analogous results possible?

(43)

Non-Convex Non-IID Non-Stochastic Non-Serial

Outline

1 Non-Convex

2 Non-IID

3 Non-Stochastic

4 Non-Serial

(44)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-IID Setting

We discussed stochastic minimizationproblems argmin

x E[fi(x)], where we have the ability to generate IID samplesfi(x).

Using IID samples is justified by the law of large numbers.

But it’salmost never true.

What if wecan’t get IID samples?

Classic non-IID sampling scheme [Bertsekas & Tsitsiklis, 1996]: Samples follow a Markov chainwith stationary distribution ofE[fi(x)].

Obtain standard guarantees if Markov chain mixes fast enough [Duchi et al., 2012].

(45)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-IID Setting

We discussed stochastic minimizationproblems argmin

x E[fi(x)], where we have the ability to generate IID samplesfi(x).

Using IID samples is justified by the law of large numbers.

But it’salmost never true.

What if wecan’t get IID samples?

Classic non-IID sampling scheme [Bertsekas & Tsitsiklis, 1996]: Samples follow a Markov chainwith stationary distribution ofE[fi(x)].

Obtain standard guarantees if Markov chain mixes fast enough [Duchi et al., 2012].

(46)

Non-Convex Non-IID Non-Stochastic Non-Serial

Non-IID Setting

We discussed stochastic minimizationproblems argmin

x E[fi(x)], where we have the ability to generate IID samplesfi(x).

Using IID samples is justified by the law of large numbers.

But it’salmost never true.

What if wecan’t get IID samples?

Classic non-IID sampling scheme [Bertsekas & Tsitsiklis, 1996]:

Samples follow a Markov chainwith stationary distribution ofE[fi(x)].

Obtain standard guarantees if Markov chain mixes fast enough [Duchi et al., 2012].

(47)

Non-Convex Non-IID Non-Stochastic Non-Serial

General Sampling

What about general non-IID sampling schemes?

What if oursamples fi come from an adversary? Can we say anything in this case?

Optimization error can bearbitrarily bad, but we can boundregret...

(48)

Non-Convex Non-IID Non-Stochastic Non-Serial

General Sampling

What about general non-IID sampling schemes?

What if oursamples fi come from an adversary?

Can we say anything in this case?

Optimization error can bearbitrarily bad, but we can boundregret...

(49)

Non-Convex Non-IID Non-Stochastic Non-Serial

General Sampling

What about general non-IID sampling schemes?

What if oursamples fi come from an adversary?

Can we say anything in this case?

Optimization error can bearbitrarily bad, but we can boundregret...

(50)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Consider theonline convex optimization (OCO) framework [Zinkevich, 2003]:

At timet, make a predictionxt.

Receive nextarbitrary convex lossft. Pay a penalty offt(xt).

The regretat time t is given by

t

X

k=1

[fk(xk)−fk(x)],

thetotal error compared to the bestx we could have chosenfor firstt functions. The x is not the solution to the problem, it’s just the best we could have done. The x depends on t, the“solution” is changing over time.

(51)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Consider theonline convex optimization (OCO) framework [Zinkevich, 2003]:

At timet, make a predictionxt. Receive nextarbitrary convex lossft.

Pay a penalty offt(xt). The regretat time t is given by

t

X

k=1

[fk(xk)−fk(x)],

thetotal error compared to the bestx we could have chosenfor firstt functions. The x is not the solution to the problem, it’s just the best we could have done. The x depends on t, the“solution” is changing over time.

(52)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Consider theonline convex optimization (OCO) framework [Zinkevich, 2003]:

At timet, make a predictionxt. Receive nextarbitrary convex lossft. Pay a penalty offt(xt).

The regretat time t is given by

t

X

k=1

[fk(xk)−fk(x)],

thetotal error compared to the bestx we could have chosenfor firstt functions. The x is not the solution to the problem, it’s just the best we could have done. The x depends on t, the“solution” is changing over time.

(53)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Consider theonline convex optimization (OCO) framework [Zinkevich, 2003]:

At timet, make a predictionxt. Receive nextarbitrary convex lossft. Pay a penalty offt(xt).

The regretat time t is given by

t

X

k=1

[fk(xk)−fk(x)],

thetotal error compared to the bestx we could have chosenfor firstt functions.

The x is not the solution to the problem, it’s just the best we could have done. The x depends on t, the“solution” is changing over time.

(54)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Consider theonline convex optimization (OCO) framework [Zinkevich, 2003]:

At timet, make a predictionxt. Receive nextarbitrary convex lossft. Pay a penalty offt(xt).

The regretat time t is given by

t

X

k=1

[fk(xk)−fk(x)],

thetotal error compared to the bestx we could have chosenfor firstt functions.

The x is not the solution to the problem, it’s just the best we could have done.

The x depends on t, the“solution” is changing over time.

(55)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Assuming everything is bounded, doingnothinghas a regret ofO(t).

Consider applying stochastic gradient, treating the ft as the samples. For convex functions, has a regret ofO(

t) [Zinkevich, 2003]. For strongly-convex, has a regret ofO(log(t)) [Hazan et al., 2006]. These areoptimal.

Key idea: x isn’t moving faster than stochastic gradient is converging.

(56)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Assuming everything is bounded, doingnothinghas a regret ofO(t).

Consider applying stochastic gradient, treating the ft as the samples.

For convex functions, has a regret ofO(

t) [Zinkevich, 2003].

For strongly-convex, has a regret ofO(log(t)) [Hazan et al., 2006].

These areoptimal.

Key idea: x isn’t moving faster than stochastic gradient is converging.

(57)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

Assuming everything is bounded, doingnothinghas a regret ofO(t).

Consider applying stochastic gradient, treating the ft as the samples.

For convex functions, has a regret ofO(

t) [Zinkevich, 2003].

For strongly-convex, has a regret ofO(log(t)) [Hazan et al., 2006].

These areoptimal.

Key idea: x isn’t moving faster than stochastic gradient is converging.

(58)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

AdaGrad is a very-popular online method [Duchi et al., 2011]:

Improves on constants in regret bounds using diagonal-scaleing xt+1=xtαtDt−1∇ft(xt), with diagonal entries (Dt)ii =δ+

q Pt

k=1ifk(xk).

Adam is a generalization that is incredibly-popular for deep learning.

[Kingma & Ba, 2015]

Though trend is returning to variations on accelerated stochastic gradient. Online learning remains active area and many variations exist:

Banditmethods only receive evaluationft(xt) rather than functionft. Main application: internet advertising and recommender systems.

(59)

Non-Convex Non-IID Non-Stochastic Non-Serial

Online Convex Optimization

AdaGrad is a very-popular online method [Duchi et al., 2011]:

Improves on constants in regret bounds using diagonal-scaleing xt+1=xtαtDt−1∇ft(xt), with diagonal entries (Dt)ii =δ+

q Pt

k=1ifk(xk).

Adam is a generalization that is incredibly-popular for deep learning.

[Kingma & Ba, 2015]

Though trend is returning to variations on accelerated stochastic gradient.

Online learning remains active area and many variations exist:

Banditmethods only receive evaluationft(xt) rather than functionft. Main application: internet advertising and recommender systems.

(60)

Non-Convex Non-IID Non-Stochastic Non-Serial

Outline

1 Non-Convex

2 Non-IID

3 Non-Stochastic

4 Non-Serial

(61)

Non-Convex Non-IID Non-Stochastic Non-Serial

Graph-Structured Optimization

Another structure arising in machine learning is graph-structured problems, argmin

x

X

(i,j)∈E

fij(xi,xj) +

n

X

i=1

fi(xi).

whereE is the set of edges in graph.

Includes quadratic functions,

xTAx+bTx=

n

X

i=1 n

X

j=1

aijxixj +

n

X

i=1

bixi,

and other models likelabel propagation for semi-supervised learning. Thegraph is sparsity patternofA.

(62)

Non-Convex Non-IID Non-Stochastic Non-Serial

Graph-Structured Optimization

Another structure arising in machine learning is graph-structured problems, argmin

x

X

(i,j)∈E

fij(xi,xj) +

n

X

i=1

fi(xi).

whereE is the set of edges in graph.

Includes quadratic functions,

xTAx+bTx=

n

X

i=1 n

X

j=1

aijxixj +

n

X

i=1

bixi,

and other models likelabel propagation for semi-supervised learning.

Thegraph is sparsity pattern ofA.

(63)

Non-Convex Non-IID Non-Stochastic Non-Serial

Coordinate Descent for Graph-Structured Optimization

Coordinate descentseems well-suited to this problem structure:

argmin

x

X

(i,j)∈E

fij(xi,xj) +

n

X

i=1

fi(xi).

To update xi, we only need to consider fi and the fij for each neighbour.

Withrandom selection of coordinates, expected iteration cost is O(|E|/n). This isn-times faster than GD iteration which costO(|E|).

(64)

Non-Convex Non-IID Non-Stochastic Non-Serial

Coordinate Descent for Graph-Structured Optimization

Coordinate descentseems well-suited to this problem structure:

argmin

x

X

(i,j)∈E

fij(xi,xj) +

n

X

i=1

fi(xi).

To update xi, we only need to consider fi and the fij for each neighbour.

Withrandom selection of coordinates, expected iteration cost is O(|E|/n).

This isn-times faster than GD iteration which costO(|E|).

(65)

Non-Convex Non-IID Non-Stochastic Non-Serial

Coordinate Descent for Graph-Structured Optimization

But for many problems randomized coordinate descent doesn’t work well...

Epochs

0 5 10 15 20 25 30

Objective

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cyclic Random

Lipschitz

GS GSL

Graph-based label propagation

Often outperformed by the greedy Gauss-Southwell rule.

But is plotting “epochs” cheating because Gauss-Southwell is more expensive?

(66)

Non-Convex Non-IID Non-Stochastic Non-Serial

Coordinate Descent for Graph-Structured Optimization

But for many problems randomized coordinate descent doesn’t work well...

Epochs

0 5 10 15 20 25 30

Objective

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Cyclic Random

Lipschitz

GS GSL

Graph-based label propagation

Often outperformed by the greedy Gauss-Southwell rule.

But is plotting “epochs” cheating because Gauss-Southwell is more expensive?

(67)

Non-Convex Non-IID Non-Stochastic Non-Serial

Greedy Coordinate Descent

Gauss-Southwell greedy rule for picking a coordinate to update:

argmax

i

|∇if(x)|.

x1 x2 x3

Looks expensive because computing the gradient costsO(|E|).

(68)

Non-Convex Non-IID Non-Stochastic Non-Serial

Greedy Coordinate Descent

Gauss-Southwell greedy rule for picking a coordinate to update:

argmax

i

|∇if(x)|.

x1 x2 x3

Looks expensive because computing the gradient costsO(|E|).

(69)

Non-Convex Non-IID Non-Stochastic Non-Serial

Greedy Coordinate Descent

Gauss-Southwell greedy rule for picking a coordinate to update:

argmax

i

|∇if(x)|.

x1 x2 x3

Looks expensive because computing the gradient costsO(|E|).

(70)

Non-Convex Non-IID Non-Stochastic Non-Serial

Greedy Coordinate Descent

Gauss-Southwell greedy rule for picking a coordinate to update:

argmax

i

|∇if(x)|.

x1 x2 x3

Gauss-Southwell

Looks expensive because computing the gradient costsO(|E|).

(71)

Non-Convex Non-IID Non-Stochastic Non-Serial

Greedy Coordinate Descent

Gauss-Southwell greedy rule for picking a coordinate to update:

argmax

i

|∇if(x)|.

x1 x2 x3

Gauss-Southwell

Looks expensive because computing the gradient costsO(|E|).

(72)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar: Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1). Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012]. Updatingxi, it only changes|∇jf(xk)|fori and its neighbours.

We can use amax-heap to track the maximum.

(73)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar:

Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1). Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012]. Updatingxi, it only changes|∇jf(xk)|fori and its neighbours.

We can use amax-heap to track the maximum.

(74)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar:

Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1).

Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012]. Updatingxi, it only changes|∇jf(xk)|fori and its neighbours.

We can use amax-heap to track the maximum.

(75)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar:

Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1).

Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012]. Updatingxi, it only changes|∇jf(xk)|fori and its neighbours.

We can use amax-heap to track the maximum.

(76)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar:

Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1).

Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012].

Updatingxi, it only changes|∇jf(xk)|fori and its neighbours. We can use amax-heap to track the maximum.

(77)

Non-Convex Non-IID Non-Stochastic Non-Serial

Cost of Greedy Coordinate Descnet

Gauss-Southwell cost depends on graph structure.

Same is true of Lipschitz sampling.

Consider problems where maximum degree and average degree are similar:

Lattice graphs (max is 4, average is4).

Complete graphs (max and average degrees aren1).

Facebook graph (max is 7000, average is200).

Here we can efficientlytrack the gradient and it’s max [Meshi et al., 2012].

Updatingxi, it only changes|∇jf(xk)|fori and its neighbours.

We can use amax-heap to track the maximum.

(78)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence Rate of Greedy Coordinate Descent

But don’t random and greedy have the same rate?

Nutini et al. [2015] show that rate for Gauss-Southwell is f(xk)−f

1−µ1 L

k

[f(x0)−f)], where µ1 is strong-convexity constant in the 1-norm.

Constantµ1 satisfies

µ n

|{z}

random

≤ µ1

|{z}

greedy

≤ µ

|{z}

gradient

,

so we should expect more progress under Gauss-Southwell.

(79)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence Rate of Greedy Coordinate Descent

But don’t random and greedy have the same rate?

Nutini et al. [2015] show that rate for Gauss-Southwell is f(xk)−f

1−µ1 L

k

[f(x0)−f)], whereµ1 is strong-convexity constant in the 1-norm.

Constantµ1 satisfies

µ n

|{z}

random

≤ µ1

|{z}

greedy

≤ µ

|{z}

gradient

,

so we should expect more progress under Gauss-Southwell.

(80)

Non-Convex Non-IID Non-Stochastic Non-Serial

Convergence Rate of Greedy Coordinate Descent

But don’t random and greedy have the same rate?

Nutini et al. [2015] show that rate for Gauss-Southwell is f(xk)−f

1−µ1 L

k

[f(x0)−f)], whereµ1 is strong-convexity constant in the 1-norm.

Constantµ1 satisfies

µ n

|{z}

random

≤ µ1

|{z}

greedy

≤ µ

|{z}

gradient

,

so we should expect more progress under Gauss-Southwell.

(81)

Non-Convex Non-IID Non-Stochastic Non-Serial

Gauss-Southwell-Lipschitz Rule

Nutini et al. [2015] also give a rule with faster rate by incorporating theLi, ik = argmax

i

|∇if(xk)|

√Li

, which is called the Gauss-Southwell-Lipschitzrule.

At least as fast as GS and Lipschitz samplingrules.

Intuition: if gradients are similar, more progress if Li is small.

x1

x2

Greedy rules have lead to new methods for computing leading eigenvectors. Coordinate-wise power methods[Wei et al., 2016, Wang et al., 2017].

(82)

Non-Convex Non-IID Non-Stochastic Non-Serial

Gauss-Southwell-Lipschitz Rule

Nutini et al. [2015] also give a rule with faster rate by incorporating theLi, ik = argmax

i

|∇if(xk)|

√Li

, which is called the Gauss-Southwell-Lipschitzrule.

At least as fast as GS and Lipschitz samplingrules.

Intuition: if gradients are similar, more progress ifLi is small.

x1

x2

Greedy rules have lead to new methods for computing leading eigenvectors. Coordinate-wise power methods[Wei et al., 2016, Wang et al., 2017].

(83)

Non-Convex Non-IID Non-Stochastic Non-Serial

Gauss-Southwell-Lipschitz Rule

Nutini et al. [2015] also give a rule with faster rate by incorporating theLi, ik = argmax

i

|∇if(xk)|

√Li

, which is called the Gauss-Southwell-Lipschitzrule.

At least as fast as GS and Lipschitz samplingrules.

Intuition: if gradients are similar, more progress ifLi is small.

x1

x2 Gauss-Southwell

Greedy rules have lead to new methods for computing leading eigenvectors. Coordinate-wise power methods[Wei et al., 2016, Wang et al., 2017].

Références

Documents relatifs

In the context of deterministic, strongly monotone variational inequalities with Lipschitz continuous operators, the last iterate of the method was shown to exhibit a

Fi- nally, by departing from a plain averaged stochastic gradient recursion, Bach and Moulines (2013) have considered an online Newton algorithm with the same running-time

In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.. Keywords: Stochastic

Although the stochastic gradient algorithms, SGD and 2SGD, are clearly the worst optimization algorithms (third row), they need less time than the other algorithms to reach a

Regression We consider the following four settings: squared loss, the -insensitive loss using the -trick, Huber’s robust loss function, and trimmed mean estimators.. For con-

In this paper, we point out that dual coordinate descent methods make crucial advantage of the linear kernel and outperform other solvers when the numbers of data and features

In this paper, we apply coordinate descent methods to solve the dual form of logistic regression and maximum entropy.. Interest- ingly, many details are different from the situation

Il se dégage le gaz de dihydrogéne (H 2 ) et du chlorure d’aluminium de formule chimique (AℓCℓ 3 ).laréaction s'accompagne d'une élévation de température a) Donner