• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
151
0
0

Texte intégral

(1)

Large-scale machine learning and convex optimization

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

IFCAM, Bangalore - July 2014

(2)

“Big data” revolution?

A new scientific context

• Data everywhere: size does not (always) matter

• Science and industry

• Size and variety

• Learning from examples

– n observations in dimension d

(3)

Search engines - advertising

(4)

Search engines - Advertising

(5)

Marketing - Personalized recommendation

(6)

Visual object recognition

(7)

Personal photos

(8)

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

(9)

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n – d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising – Ideal running-time complexity: O(dn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(10)

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n – d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn) – Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(11)

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n – d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(12)

Outline

1. Large-scale machine learning and optimization

• Traditional statistical analysis

• Classical methods for convex optimization 2. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

• Strongly convex vs. non-strongly convex

3. Smooth stochastic approximation algorithms

• Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes

5. Finite data sets

(13)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θΦ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θRd

1 n

n

X

i=1

ℓ yi, θΦ(xi)

+ µΩ(θ)

convex data fitting term + regularizer

(14)

Usual losses

• Regression: y ∈ R, prediction yˆ = θΦ(x) – quadratic loss 12(y − y)ˆ 2 = 12(y − θΦ(x))2

(15)

Usual losses

• Regression: y ∈ R, prediction yˆ = θΦ(x) – quadratic loss 12(y − y)ˆ 2 = 12(y − θΦ(x))2

• Classification : y ∈ {−1, 1}, prediction yˆ = sign(θΦ(x)) – loss of the form ℓ(y θΦ(x))

– “True” 0-1 loss: ℓ(y θΦ(x)) = 1y θΦ(x)<0 – Usual convex losses:

−30 −2 −1 0 1 2 3 4

1 2 3 4 5

0−1 hinge square logistic

(16)

Main motivating examples

• Support vector machine (hinge loss)

ℓ(Y, θΦ(X)) = max{1 − Y θΦ(X), 0}

• Logistic regression

ℓ(Y, θΦ(X)) = log(1 + exp(−Y θΦ(X)))

• Least-squares regression

ℓ(Y, θΦ(X)) = 1

2(Y − θΦ(X))2

(17)

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: kθk22 = Pd

j=1j|2 – Numerically well-behaved

– Representer theorem and kernel methods : θ = Pn

i=1 αiΦ(xi)

– See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and Cristianini (2004)

• Sparsity-inducing norms

– Main example: ℓ1-norm kθk1 = Pd

j=1j|

– Perform model selection as well as regularization – Non-smooth optimization and structured sparsity

– See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

(18)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θΦ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θRd

1 n

n

X

i=1

ℓ yi, θΦ(xi)

+ µΩ(θ)

convex data fitting term + regularizer

(19)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θΦ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

ℓ yi, θΦ(xi)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: fˆ(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θΦ(x)) testing cost

• Two fundamental questions: (1) computing θˆ and (2) analyzing θˆ – May be tackled simultaneously

(20)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θΦ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

ℓ yi, θΦ(xi)

+ µΩ(θ)

convex data fitting term + regularizer

• Empirical risk: fˆ(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θΦ(x)) testing cost

• Two fundamental questions: (1) computing θˆ and (2) analyzing θˆ – May be tackled simultaneously

(21)

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θΦ(x) of features Φ(x) ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

ℓ yi, θΦ(xi)

such that Ω(θ) 6 D

convex data fitting term + constraint

• Empirical risk: fˆ(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θΦ(x)) testing cost

• Two fundamental questions: (1) computing θˆ and (2) analyzing θˆ – May be tackled simultaneously

(22)

General assumptions

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Bounded features Φ(x) ∈ Rd: kΦ(x)k2 6 R

• Empirical risk: fˆ(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θΦ(x)) testing cost

• Loss for a single observation: fi(θ) = ℓ(yi, θΦ(xi))

⇒ ∀i, f(θ) = Efi(θ)

• Properties of fi, f, fˆ – Convex on Rd

– Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity

(23)

Lipschitz continuity

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ Rd, kθk2 6 D ⇒ kf(θ)k2 6 B

• Machine learning – with f(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi))

– G-Lipschitz loss and R-bounded data: B = GR

(24)

Smoothness and strong convexity

• A function f : Rd → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

∀θ1, θ2 ∈ Rd, kf1) − f2)k2 6 Lkθ1 − θ2k2

• If f is twice differentiable: ∀θ ∈ Rd, f′′(θ) 4 L · Id

smooth non−smooth

(25)

Smoothness and strong convexity

• A function f : Rd → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

∀θ1, θ2 ∈ Rd, kf1) − f2)k2 6 Lkθ1 − θ2k2

• If f is twice differentiable: ∀θ ∈ Rd, f′′(θ) 4 L · Id

• Machine learning – with f(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi)Φ(xi) – ℓ-smooth loss and R-bounded data: L = ℓR2

(26)

Smoothness and strong convexity

• A function f : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, f(θ1) > f(θ2) + f2)1 − θ2) + µ21 − θ2k22

• If f is twice differentiable: ∀θ ∈ Rd, f′′(θ) < µ · Id

convex strongly

convex

(27)

Smoothness and strong convexity

• A function f : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, f(θ1) > f(θ2) + f2)1 − θ2) + µ21 − θ2k22

• If f is twice differentiable: ∀θ ∈ Rd, f′′(θ) < µ · Id

• Machine learning – with f(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi)Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

(28)

Smoothness and strong convexity

• A function f : Rd → R is µ-strongly convex if and only if

∀θ1, θ2 ∈ Rd, f(θ1) > f(θ2) + f2)1 − θ2) + µ21 − θ2k23

• If f is twice differentiable: ∀θ ∈ Rd, f′′(θ) < µ · Id

• Machine learning – with f(θ) = n1 Pn

i=1 ℓ(yi, θΦ(xi)) – Hessian ≈ covariance matrix n1 Pn

i=1 Φ(xi)Φ(xi)

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by µ2kθk2

– creates additional bias unless µ is small

(29)

Summary of smoothness/convexity assumptions

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ Rd, kθk2 6 D ⇒ kf(θ)k2 6 B

• Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f:

∀θ1, θ2 ∈ Rd, kf1) − f2)k2 6 Lkθ1 − θ2k2

• Strong convexity of f: The function f is strongly convex with respect to the norm k · k, with convexity constant µ > 0:

∀θ1, θ2 ∈ Rd, f(θ1) > f(θ2) + f2)1 − θ2) + µ21 − θ2k22

(30)

Analysis of empirical risk minimization

• Approximation and estimation errors: C = {θ ∈ Rd, Ω(θ) 6 D} f(ˆθ) − min

θ∈Rd f(θ) =

f(ˆθ) − min

θ∈C f(θ)

+

minθ∈C f(θ) − min

θ∈Rd f(θ)

– NB: may replace min

θ∈Rd f(θ) by best (non-linear) predictions 1. Uniform deviation bounds, with θˆ ∈ arg min

θ∈C

fˆ(θ)

f(ˆθ) − min

θ∈C f(θ) 6 2 sup

θ∈C |fˆ(θ) − f(θ)| (proof) – Typically slow rate O 1

√n

2. More refined concentration results with faster rates

(31)

Motivation from least-squares

• For least-squares, we have ℓ(y, θΦ(x)) = 12(y − θΦ(x))2, and

f(θ) fˆ(θ) = 1 2θ

1 n

n

X

i=1

Φ(xi)Φ(xi) EΦ(X)Φ(X)

θ

θ 1

n

n

X

i=1

yiΦ(xi) EY Φ(X)

+ 1 2

1 n

n

X

i=1

yi2 EY 2

,

sup

kθk26D |f(θ) fˆ(θ)| 6 D2 2

1 n

n

X

i=1

Φ(xi)Φ(xi) EΦ(X)Φ(X) op

+D

1 n

n

X

i=1

yiΦ(xi) EY Φ(X) 2

+ 1 2

1 n

n

X

i=1

yi2 EY 2 ,

sup

kθk26D |f(θ) fˆ(θ)| 6 O(1/

n) with high probability

(32)

Slow rate for supervised learning

• Assumptions (f is the expected risk, fˆ the empirical risk) – Ω(θ) = kθk2 (Euclidean norm)

– “Linear” predictors: θ(x) = θΦ(x), with kΦ(x)k2 6 R a.s.

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk2 6 D} – No assumptions regarding convexity

(33)

Slow rate for supervised learning

• Assumptions (f is the expected risk, fˆ the empirical risk) – Ω(θ) = kθk2 (Euclidean norm)

– “Linear” predictors: θ(x) = θΦ(x), with kΦ(x)k2 6 R a.s.

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk2 6 D} – No assumptions regarding convexity

• With probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Expectated estimation error: E

sup

θ∈C |fˆ(θ) − f(θ)|

6 4GRD

√n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

(34)

Symmetrization with Rademacher variables

• Let D = {x1, y1 , . . . , xn, yn } an independent copy of the data D = {x1, y1, . . . , xn, yn}, with corresponding loss functions fi(θ)

E

sup

θΘ

f(θ) fˆ(θ)

= E

sup

θΘ

f(θ) 1 n

n

X

i=1

fi(θ)

= E

sup

θΘ

1 n

n

X

i=1

E fi(θ) fi(θ)|D

6 E

E

sup

θΘ

1 n

n

X

i=1

fi(θ) fi(θ)

D

= E

sup

θΘ

1 n

n

X

i=1

fi(θ) fi(θ)

= E

sup

θΘ

1 n

n

X

i=1

εi fi(θ) fi(θ)

with εi uniform in {−1,1}

6 2E

sup

θΘ

1 n

n

X

i=1

εifi(θ)

= Rademacher complexity

(35)

Rademacher complexity

• Define the Rademacher complexity of the class of functions (X, Y ) 7→

ℓ(Y, θΦ(X)) as

Rn = E

sup

θΘ

1 n

n

X

i=1

εifi(θ)

.

• Note two expectations, with respect to D and with respect to ε

• Main property:

E

sup

θΘ

f(θ) − fˆ(θ)

6 2Rn

(36)

From Rademacher complexity to uniform bound

• Let Z = supθΘ

f(θ) − fˆ(θ)

• By changing the pair (xi, yi), Z may only change by 2

n sup |ℓ(Y, θΦ(X))| 6 2

n sup |ℓ(Y, 0)|+GRD

6 2

n ℓ0+GRD

= c

with sup |ℓ(Y, 0)| = ℓ0

• MacDiarmid inequality: with probability greater than 1 − δ, Z 6 EZ +

rn 2c ·

r

log 1

δ 6 2Rn +

√2

√n ℓ0 + GRD r

log 1 δ

(37)

Bounding the Rademacher average - I

• We have, with ϕi(u) = ℓ(yi, u)−ℓ(yi, 0) is almost surely B-Lipschitz:

Rn = E

sup

θΘ

1 n

n

X

i=1

εifi(θ)

6 E

sup

θΘ

1 n

n

X

i=1

εifi(0)

+ E

sup

θΘ

1 n

n

X

i=1

εi

fi(θ) fi(0)

6 0

n + E

sup

θΘ

1 n

n

X

i=1

εi

fi(θ) fi(0)

= 0

n + E

sup

θΘ

1 n

n

X

i=1

εiϕiΦ(xi))

• Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕi is G-Lipschitz, we get:

Rn 6 0

n + 2G · E

sup

kθk26D

1 n

n

X

i=1

εiθΦ(xi)

(38)

Bounding the Rademacher average - II

• We have:

Rn 6 0

n + 2GE

sup

kθk26D

1 n

n

X

i=1

εiθΦ(xi)

= 0

n + 2GE

D1 n

n

X

i=1

εiΦ(xi) 2

6 0

n + 2GD v u u tE

1 n

n

X

i=1

εiΦ(xi)

2

2

6 2(ℓ0 + GRD)

n

• Overall, we get, with probability 1 − δ:

sup

θΘ

f(θ) − fˆ(θ)

6 1

√n ℓ0 + GRD)(4 + r

2 log 1 δ

(39)

Putting it all together

• We have, with probability 1 − δ, for all θ ∈ Θ:

f(θ) − f(θ) 6

f(θ) − fˆ(θ)

+ fˆ(θ) − min

θΘ

fˆ(θ)

+

θminΘ

fˆ(θ) − fˆ(θ) 6 2

√n(ℓ0 + GRD)(4 + r

2 log 1

δ) + fˆ(θ) − min

θΘ

fˆ(θ)

• Only need to optimize with precision 2n(ℓ0 + GRD)

(40)

Slow rate for supervised learning (summary)

• Assumptions (f is the expected risk, fˆ the empirical risk) – Ω(θ) = kθk2 (Euclidean norm)

– “Linear” predictors: θ(x) = θΦ(x), with kΦ(x)k2 6 R a.s.

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk2 6 D} – No assumptions regarding convexity

• With probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 (ℓ0 + GRD)

√n

2 + r

2 log 2 δ

• Expectated estimation error: E

sup

θ∈C |fˆ(θ) − f(θ)|

6 4(ℓ0 + GRD)

√n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

(41)

Motivation from mean estimation

• Estimator θˆ = n1 Pn

i=1 zi = arg minθR 2n1 Pn

i=1(θ − zi)2 = ˆf(θ)

• From before:

– f(θ) = 12E(θ − z)2 = 12(θ − Ez)2 + 12 var(z) = ˆf(θ) + O(1/√ n) – f(ˆθ) = 12(ˆθ − Ez)2 + 12 var(z) = f(Ez) + O(1/√

n)

(42)

Motivation from mean estimation

• Estimator θˆ = n1 Pn

i=1 zi = arg minθR 2n1 Pn

i=1(θ − zi)2 = ˆf(θ)

• From before:

– f(θ) = 12E(θ − z)2 = 12(θ − Ez)2 + 12 var(z) = ˆf(θ) + O(1/√ n) – f(ˆθ) = 12(ˆθ − Ez)2 + 12 var(z) = f(Ez) + O(1/√

n)

• More refined/direct bound:

f(ˆθ) − f(Ez) = 1

2(ˆθ − Ez)2 E

f(ˆθ) − f(Ez)

= 1 2E

1 n

n

X

i=1

zi − Ez 2

= 1

2n var(z)

• Bound only at θˆ + strong convexity

(43)

Fast rate for supervised learning

• Assumptions (f is the expected risk, fˆ the empirical risk) – Same as before (bounded features, Lipschitz loss)

– Regularized risks: fµ(θ) = f(θ)+µ2kθk22 and fˆµ(θ) = ˆf(θ)+µ2kθk22

– Convexity

• For any a > 0, with probability greater than 1 − δ, for all θ ∈ Rd, fµ(θ)−min

ηRd

fµ(η) 6 (1+a)( ˆfµ(θ)−min

ηRd

µ(η))+8(1 + 1a)G2R2(32 + log 1δ) µn

• Results from Sridharan, Srebro, and Shalev-Shwartz (2008)

– see also Boucheron and Massart (2011) and references therein

• Strongly convex functions ⇒ fast rate

– Warning: µ should decrease with n to reduce approximation error

(44)

Outline

1. Large-scale machine learning and optimization

• Traditional statistical analysis

• Classical methods for convex optimization 2. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

• Strongly convex vs. non-strongly convex

3. Smooth stochastic approximation algorithms

• Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes

5. Finite data sets

(45)

Complexity results in convex optimization

• Assumption: f convex on Rd

• Classical generic algorithms – (sub)gradient method/descent – Accelerated gradient descent – Newton method

• Key additional properties of f

– Lipschitz continuity, smoothness or strong convexity

• Key insight from Bottou and Bousquet (2008)

– In machine learning, no need to optimize below estimation error

• Key reference: Nesterov (2004)

(46)

Subgradient method/descent

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk2 6 D}

• Algorithm: θt = ΠD

θt1 − 2D B√

tft1)

– ΠD : orthogonal projection onto {kθk2 6 D}

• Bound:

f

1 t

t1

X

k=0

θk

− f(θ) 6 2DB

√t

• Three-line proof

• Best possible convergence rate after O(d) iterations

(47)

Subgradient method/descent - proof - I

Iteration: θt = ΠDt1 γtft1)) with γt = B2Dt

Assumption: kf(θ)k2 6 B and kθk2 6 D

kθt θk22 6 kθt1 θ γtft1)k22 by contractivity of projections

6 kθt1 θk22 + B2γt2 tt1 θ)ft1) because kft1)k2 6 B 6 kθt1 θk22 + B2γt2 t

ft1) f)

(property of subgradients)

leading to

ft1) f) 6 B2γt

2 + 1 t

kθt1 θk22 − kθt θk22

(48)

Subgradient method/descent - proof - II

Starting from ft1) f) 6 B2γt

2 + 1 t

kθt1 θk22 − kθt θk22

t

X

u=1

fu1) f) 6

t

X

u=1

B2γu

2 +

t

X

u=1

1 u

kθu1 θk22 − kθu θk22

=

t

X

u=1

B2γu

2 +

t1

X

u=1

kθu θk22

1

u+1 1 u

+ kθ0 θk22

1 kθt θk22

t

6

t

X

u=1

B2γu

2 +

t1

X

u=1

4D2 1

u+1 1 u

+ 4D2 1

=

t

X

u=1

B2γu

2 + 4D2

t 6 2DB

t with γt = 2D B t

Using convexity: f

1 t

t1

X

k=0

θk

− f(θ) 6 2DB

√t

(49)

Subgradient descent for machine learning

• Assumptions (f is the expected risk, fˆ the empirical risk)

– “Linear” predictors: θ(x) = θΦ(x), with kΦ(x)k2 6 R a.s.

– fˆ(θ) = n1 Pn

i=1 ℓ(yi, Φ(xi)θ)

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk2 6 D}

• Statistics: with probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Optimization: after t iterations of subgradient method fˆ(ˆθ) − min

η∈C

fˆ(η) 6 GRD

√t

• t = n iterations, with total running-time complexity of O(n2d)

(50)

Subgradient descent - strong convexity

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk2 6 D} – f µ-strongly convex

• Algorithm: θt = ΠD

θt1 − 2

µ(t + 1)ft1)

• Bound:

f

2 t(t + 1)

t

X

k=1

k1

− f(θ) 6 2B2 µ(t + 1)

• Three-line proof

• Best possible convergence rate after O(d) iterations

(51)

Subgradient method - strong convexity - proof - I

Iteration: θt = ΠDt1 γtft1)) with γt = µ(t+1)2

Assumption: kf(θ)k2 6 B and kθk2 6 D and µ-strong convexity of f kθt θk22 6 kθt1 θ γtft1)k22 by contractivity of projections

6 kθt1 θk22 + B2γt2 tt1 θ)ft1) because kft1)k2 6 B 6 kθt1 θk22 + B2γt2 t

ft1) f) + µ

2kθt1 θk22

(property of subgradients and strong convexity)

leading to

ft1) f) 6 B2γt

2 + 1 2

1

γt µ

kθt1 θk22 1

tkθt θk22

6 B2

µ(t + 1) + µ 2

t 1 2

kθt1 θk22 µ(t + 1)

4 kθt θk22

(52)

Subgradient method - strong convexity - proof - II

From ft1) f) 6 µ(tB+ 1)2 + µ2t 2 1kθt1 θk22 µ(t4+ 1)kθt θk22

t

X

u=1

u

fu1) f) 6

u

X

t=1

B2u

µ(u + 1) + 1 4

t

X

u=1

u(u 1)kθu1 θk22 u(u + 1)kθu θk22

6 B2t

µ + 1 4

0 t(t + 1)kθt θk22

6 B2t µ

Using convexity: f

2 t(t + 1)

t

X

u=1

u1

− f(θ) 6 2B2 t + 1

(53)

(smooth) gradient descent

• Assumptions

– f convex with L-Lipschitz-continuous gradient – Minimum attained at θ

• Algorithm:

θt = θt1 − 1

Lft1)

• Bound:

f(θt) − f(θ) 6 2Lkθ0 − θk2 t + 4

• Three-line proof

• Not best possible convergence rate after O(d) iterations

(54)

(smooth) gradient descent - strong convexity

• Assumptions

– f convex with L-Lipschitz-continuous gradient – f µ-strongly convex

• Algorithm:

θt = θt1 − 1

Lft1)

• Bound:

f(θt) − f(θ) 6 (1 − µ/L)t

f(θ0) − f(θ)

• Three-line proof

• Adaptivity of gradient descent to problem difficulty

• Line search

(55)

Accelerated gradient methods (Nesterov, 1983)

• Assumptions

– f convex with L-Lipschitz-cont. gradient , min. attained at θ

• Algorithm:

θt = ηt1 − 1

Lft1) ηt = θt + t − 1

t + 2(θt − θt1)

• Bound:

f(θt) − f(θ) 6 2Lkθ0 − θk2 (t + 1)2

• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

• Not improvable

• Extension to strongly convex functions

(56)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θt+1 = arg min

θ∈Rd f(θt) + (θ − θt)∇f(θt)+L

2kθ − θtk22

– θt+1 = θtL1∇f(θt)

(57)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θt+1 = arg min

θRd

f(θt) + (θ − θt)∇f(θt)+L

2kθ − θtk22

– θt+1 = θtL1∇f(θt)

• Problems of the form: min

θRd

f(θ) + µΩ(θ)

– θt+1 = arg min

θRd

f(θt) + (θ − θt)∇f(θt)+µΩ(θ)+L

2kθ − θtk22

– Ω(θ) = kθk1 ⇒ Thresholded gradient descent

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

Références

Documents relatifs

The MIR@NT@N database is publically available on the website, which proposes 1) a dump file of the data- base in a SQL format, 2) a file (tab-delimiter format) with all TFBS

gnétique des électrons de conduction dans les métaux entraîne une polarisation des spins nucléaires qui peut devenir très importante lorsque la tempéra-.. ture est

In this section, we prove oracle complexity lower bounds for distributed optimization in two set- tings: strongly convex and smooth functions for centralized (i.e. master/slave)

Here, we study extrapolation methods in a stochastic setting, where the iterates are produced by either a simple or an accelerated stochastic gradient algorithm.. We first

We instantiate these ideas in three different setups: (i) when comparing a discrete distribution to another, we show that incremental stochastic optimization schemes can beat

The two contexts (globally and locally convex objective) are introduced in Section 3 as well as the rate of convergence in quadratic mean and the asymptotic normality of the

We propose the almost sure convergence of two-time-scale stochastic approximation algorithms for superquantile estimation.. The quadratic strong law (QSL) as well as the law of