slides

(1)

Large-scale machine learning and convex optimization

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

IFCAM, Bangalore - July 2014

(2)

“Big data” revolution?

A new scientific context

• Data everywhere: size does not (always) matter

• Science and industry

• Size and variety

• Learning from examples

– n observations in dimension d

(3)

Search engines - advertising

(4)

Search engines - Advertising

(5)

Marketing - Personalized recommendation

(6)

Visual object recognition

(7)

Personal photos

(8)

Bioinformatics

• Protein: Crucial elements of cell life

• Massive data: 2 millions for humans

• Complex data

(9)

Context

Machine learning for “big data”

• Large-scale machine learning: large d, large n – d : dimension of each observation (input)

– n : number of observations

• Examples: computer vision, bioinformatics, advertising – Ideal running-time complexity: O(dn)

– Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951) – Mixing statistics and optimization

– Using smoothness to go beyond stochastic gradient descent

(10)

Context

Machine learning for “big data”

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn) – Going back to simple methods

(11)

Context

Machine learning for “big data”

• Examples: computer vision, bioinformatics, advertising

• Ideal running-time complexity: O(dn)

• Going back to simple methods

(12)

Outline

1. Large-scale machine learning and optimization

• Traditional statistical analysis

• Classical methods for convex optimization 2. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

• Strongly convex vs. non-strongly convex

3. Smooth stochastic approximation algorithms

• Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes

5. Finite data sets

(13)

Supervised machine learning

• Data: n observations (x_i, y_i) ∈ X × Y, i = 1, . . . , n, i.i.d.

• Prediction as a linear function θ^⊤Φ(x) of features Φ(x) ∈ R^d

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈R^d

1 n

n

X

i=1

ℓ y_i, θ^⊤Φ(x_i)

+ µΩ(θ)

convex data fitting term + regularizer

(14)

Usual losses

• Regression: y ∈ R, prediction yˆ = θ^⊤Φ(x) – quadratic loss ¹₂(y − y)ˆ ² = ¹₂(y − θ^⊤Φ(x))²

(15)

Usual losses

• Regression: y ∈ R, prediction yˆ = θ^⊤Φ(x) – quadratic loss ¹₂(y − y)ˆ ² = ¹₂(y − θ^⊤Φ(x))²

• Classification : y ∈ {−1, 1}, prediction yˆ = sign(θ^⊤Φ(x)) – loss of the form ℓ(y θ^⊤Φ(x))

– “True” 0-1 loss: ℓ(y θ^⊤Φ(x)) = 1_{y θ}_⊤_Φ(x)<0 – Usual convex losses:

−30 −2 −1 0 1 2 3 4

1 2 3 4 5

0−1 hinge square logistic

(16)

Main motivating examples

• Support vector machine (hinge loss)

ℓ(Y, θ^⊤Φ(X)) = max{1 − Y θ^⊤Φ(X), 0}

• Logistic regression

ℓ(Y, θ^⊤Φ(X)) = log(1 + exp(−Y θ^⊤Φ(X)))

• Least-squares regression

ℓ(Y, θ^⊤Φ(X)) = 1

2(Y − θ^⊤Φ(X))²

(17)

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: kθk²2 = Pd

j=1 |θ_j|² – Numerically well-behaved

– Representer theorem and kernel methods : θ = Pn

i=1 α_iΦ(x_i)

– See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and Cristianini (2004)

• Sparsity-inducing norms

– Main example: ℓ₁-norm kθk¹ = Pd

j=1 |θ_j|

– Perform model selection as well as regularization – Non-smooth optimization and structured sparsity

– See, e.g., Bach, Jenatton, Mairal, and Obozinski (2011, 2012)

(18)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

(19)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

• Empirical risk: fˆ(θ) = _n¹ Pn

i=1 ℓ(y_i, θ^⊤Φ(x_i)) training cost

• Expected risk: f(θ) = E(x,y)ℓ(y, θ^⊤Φ(x)) testing cost

• Two fundamental questions: (1) computing θˆ and (2) analyzing θˆ – May be tackled simultaneously

(20)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ µΩ(θ)

(21)

Supervised machine learning

θ∈R^d

1 n

n

X

i=1

such that Ω(θ) 6 D

convex data fitting term + constraint

(22)

General assumptions

• Bounded features Φ(x) ∈ R^d: kΦ(x)k² 6 R

• Loss for a single observation: f_i(θ) = ℓ(y_i, θ^⊤Φ(x_i))

⇒ ∀i, f(θ) = Ef_i(θ)

• Properties of f_i, f, fˆ – Convex on R^d

– Additional regularity assumptions: Lipschitz-continuity, smoothness and strong convexity

(23)

Lipschitz continuity

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ R^d, kθk² 6 D ⇒ kf^′(θ)k² 6 B

• Machine learning – with f(θ) = _n¹ Pn

i=1 ℓ(y_i, θ^⊤Φ(x_i))

– G-Lipschitz loss and R-bounded data: B = GR

(24)

Smoothness and strong convexity

• A function f : R^d → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

∀θ₁, θ₂ ∈ R^d, kf^′(θ₁) − f^′(θ₂)k² 6 Lkθ₁ − θ₂k²

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) 4 L · Id

smooth non−smooth

(25)

Smoothness and strong convexity

• A function f : R^d → R is L-smooth if and only if it is differentiable and its gradient is L-Lipschitz-continuous

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) 4 L · Id

i=1 ℓ(y_i, θ^⊤Φ(x_i)) – Hessian ≈ covariance matrix _n¹ Pn

i=1 Φ(x_i)Φ(x_i)^⊤ – ℓ-smooth loss and R-bounded data: L = ℓR²

(26)

Smoothness and strong convexity

• A function f : R^d → R is µ-strongly convex if and only if

∀θ₁, θ₂ ∈ R^d, f(θ₁) > f(θ₂) + f^′(θ₂)^⊤(θ₁ − θ₂) + ^µ₂kθ₁ − θ₂k²2

• If f is twice differentiable: ∀θ ∈ R^d, f^′′(θ) < µ · Id

convex strongly

convex

(27)

Smoothness and strong convexity

i=1 Φ(x_i)Φ(x_i)^⊤

– Data with invertible covariance matrix (low correlation/dimension)

(28)

Smoothness and strong convexity

i=1 Φ(x_i)Φ(x_i)^⊤

– Data with invertible covariance matrix (low correlation/dimension)

• Adding regularization by ^µ₂kθk²

– creates additional bias unless µ is small

(29)

Summary of smoothness/convexity assumptions

• Bounded gradients of f (Lipschitz-continuity): the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D:

∀θ ∈ R^d, kθk² 6 D ⇒ kf^′(θ)k² 6 B

• Smoothness of f: the function f is convex, differentiable with L-Lipschitz-continuous gradient f^′:

• Strong convexity of f: The function f is strongly convex with respect to the norm k · k, with convexity constant µ > 0:

(30)

Analysis of empirical risk minimization

• Approximation and estimation errors: C = {θ ∈ R^d, Ω(θ) 6 D} f(ˆθ) − min

θ∈R^d f(θ) =

f(ˆθ) − min

θ∈C f(θ)

+

minθ∈C f(θ) − min

θ∈R^d f(θ)

– NB: may replace min

θ∈R^d f(θ) by best (non-linear) predictions 1. Uniform deviation bounds, with θˆ ∈ arg min

θ∈C

fˆ(θ)

f(ˆθ) − min

θ∈C f(θ) 6 2 sup

θ∈C |fˆ(θ) − f(θ)| (proof) – Typically slow rate O 1

√n

2. More refined concentration results with faster rates

(31)

Motivation from least-squares

• For least-squares, we have ℓ(y, θ^⊤Φ(x)) = ¹₂(y − θ^⊤Φ(x))², and

f(θ) − fˆ(θ) = 1 2θ^⊤

1 n

n

X

i=1

Φ(x_i)Φ(x_i)^⊤ − EΦ(X)Φ(X)^⊤

θ

−θ^⊤ 1

n

X

i=1

y_iΦ(x_i) − EY Φ(X)

+ 1 2

1 n

n

X

i=1

y_i² − EY ²

,

sup

kθk26D |f(θ) − fˆ(θ)| 6 D² 2

1 n

n

X

i=1

Φ(x_i)Φ(x_i)^⊤ − EΦ(X)Φ(X)^⊤ op

+D

1 n

n

X

i=1

y_iΦ(x_i) − EY Φ(X) 2

+ 1 2

1 n

n

X

i=1

y_i² − EY ² ,

sup

kθk26D |f(θ) − fˆ(θ)| 6 O(1/√

n) with high probability

(32)

Slow rate for supervised learning

• Assumptions (f is the expected risk, fˆ the empirical risk) – Ω(θ) = kθk² (Euclidean norm)

– “Linear” predictors: θ(x) = θ^⊤Φ(x), with kΦ(x)k² 6 R a.s.

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk² 6 D} – No assumptions regarding convexity

(33)

Slow rate for supervised learning

• With probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Expectated estimation error: E

sup

θ∈C |fˆ(θ) − f(θ)|

6 4GRD

√n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

(34)

Symmetrization with Rademacher variables

• Let D^′ = {x^′₁, y₁^′ , . . . , x^′_n, y_n^′ } an independent copy of the data D = {x₁, y₁, . . . , x_n, y_n}, with corresponding loss functions f_i^′(θ)

E

sup

θ∈Θ

f(θ) − fˆ(θ)

= E

sup

θ∈Θ

f(θ) − 1 n

n

X

i=1

f_i(θ)

= E

sup

θ∈Θ

1 n

n

X

i=1

E f_i^′(θ) − f_i(θ)|D

6 E

E

sup

θ∈Θ

1 n

n

X

i=1

f_i^′(θ) − f_i(θ)

D

= E

sup

θ∈Θ

1 n

n

X

i=1

f_i^′(θ) − f_i(θ)

= E

sup

θ∈Θ

1 n

n

X

i=1

ε_i f_i^′(θ) − f_i(θ)

with ε_i uniform in {−1,1}

6 2E

sup

θ∈Θ

1 n

n

X

i=1

ε_if_i(θ)

= Rademacher complexity

(35)

Rademacher complexity

• Define the Rademacher complexity of the class of functions (X, Y ) 7→

ℓ(Y, θ^⊤Φ(X)) as

R_n = E

sup

θ∈Θ

1 n

n

X

i=1

ε_if_i(θ)

.

• Note two expectations, with respect to D and with respect to ε

• Main property:

E

sup

θ∈Θ

f(θ) − fˆ(θ)

6 2R_n

(36)

From Rademacher complexity to uniform bound

• Let Z = sup_θ_∈_Θ

f(θ) − fˆ(θ)

• By changing the pair (x_i, y_i), Z may only change by 2

n sup |ℓ(Y, θ^⊤Φ(X))| 6 2

n sup |ℓ(Y, 0)|+GRD

6 2

n ℓ₀+GRD

= c

with sup |ℓ(Y, 0)| = ℓ₀

• MacDiarmid inequality: with probability greater than 1 − δ, Z 6 EZ +

rn 2c ·

r

log 1

δ 6 2R_n +

√2

√n ℓ₀ + GRD r

log 1 δ

(37)

Bounding the Rademacher average - I

• We have, with ϕ_i(u) = ℓ(y_i, u)−ℓ(y_i, 0) is almost surely B-Lipschitz:

R_n = E

sup

θ∈Θ

1 n

n

X

i=1

ε_if_i(θ)

6 E

sup

θ∈Θ

1 n

n

X

i=1

ε_if_i(0)

+ E

sup

θ∈Θ

1 n

n

X

i=1

ε_i

f_i(θ) − f_i(0)

6 ℓ₀

√n + E

sup

θ∈Θ

1 n

n

X

i=1

ε_i

f_i(θ) − f_i(0)

= ℓ₀

√n + E

sup

θ∈Θ

1 n

n

X

i=1

ε_iϕ_i(θ^⊤Φ(x_i))

• Using Ledoux-Talagrand concentration results for Rademacher averages (since ϕ_i is G-Lipschitz, we get:

R_n 6 ℓ₀

√n + 2G · E

sup

kθk26D

1 n

n

X

i=1

ε_iθ^⊤Φ(x_i)

(38)

Bounding the Rademacher average - II

• We have:

R_n 6 ℓ₀

√n + 2GE

sup

kθk26D

1 n

n

X

i=1

ε_iθ^⊤Φ(x_i)

= ℓ₀

√n + 2GE

D1 n

n

X

i=1

ε_iΦ(x_i) 2

6 ℓ₀

√n + 2GD v u u tE

1 n

n

X

i=1

ε_iΦ(x_i)

2

6 2(ℓ₀ + GRD)

√n

• Overall, we get, with probability 1 − δ:

sup

θ∈Θ

f(θ) − fˆ(θ)

6 1

√n ℓ₀ + GRD)(4 + r

2 log 1 δ

(39)

Putting it all together

• We have, with probability 1 − δ, for all θ ∈ Θ:

f(θ) − f(θ_∗) 6

f(θ) − fˆ(θ)

+ fˆ(θ) − min

θ^′∈Θ

fˆ(θ^′)

+

θmin^′∈Θ

fˆ(θ^′) − fˆ(θ_∗) 6 2

√n(ℓ₀ + GRD)(4 + r

2 log 1

δ) + fˆ(θ) − min

θ^′∈Θ

fˆ(θ^′)

• Only need to optimize with precision ^√²_n(ℓ₀ + GRD)

(40)

Slow rate for supervised learning (summary)

• With probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 (ℓ₀ + GRD)

√n

2 + r

2 log 2 δ

• Expectated estimation error: E

sup

θ∈C |fˆ(θ) − f(θ)|

6 4(ℓ₀ + GRD)

√n

• Using Rademacher averages (see, e.g., Boucheron et al., 2005)

• Lipschitz functions ⇒ slow rate

(41)

Motivation from mean estimation

• Estimator θˆ = _n¹ Pn

i=1 z_i = arg min_θ_∈_R _2n¹ Pn

i=1(θ − z_i)² = ˆf(θ)

• From before:

– f(θ) = ¹₂E(θ − z)² = ¹₂(θ − Ez)² + ¹₂ var(z) = ˆf(θ) + O(1/√ n) – f(ˆθ) = ¹₂(ˆθ − Ez)² + ¹₂ var(z) = f(Ez) + O(1/√

n)

(42)

Motivation from mean estimation

• Estimator θˆ = _n¹ Pn

i=1 z_i = arg min_θ_∈_R _2n¹ Pn

i=1(θ − z_i)² = ˆf(θ)

• From before:

– f(θ) = ¹₂E(θ − z)² = ¹₂(θ − Ez)² + ¹₂ var(z) = ˆf(θ) + O(1/√ n) – f(ˆθ) = ¹₂(ˆθ − Ez)² + ¹₂ var(z) = f(Ez) + O(1/√

n)

• More refined/direct bound:

f(ˆθ) − f(Ez) = 1

2(ˆθ − Ez)² E

f(ˆθ) − f(Ez)

= 1 2E

1 n

n

X

i=1

z_i − Ez 2

= 1

2n var(z)

• Bound only at θˆ + strong convexity

(43)

Fast rate for supervised learning

• Assumptions (f is the expected risk, fˆ the empirical risk) – Same as before (bounded features, Lipschitz loss)

– Regularized risks: f^µ(θ) = f(θ)+^µ₂kθk²2 and fˆ^µ(θ) = ˆf(θ)+^µ₂kθk²2

– Convexity

• For any a > 0, with probability greater than 1 − δ, for all θ ∈ R^d, f^µ(θ)−min

η∈R^d

f^µ(η) 6 (1+a)( ˆf^µ(θ)−min

η∈R^d

fˆ^µ(η))+8(1 + ¹_a)G²R²(32 + log ¹_δ) µn

• Results from Sridharan, Srebro, and Shalev-Shwartz (2008)

– see also Boucheron and Massart (2011) and references therein

• Strongly convex functions ⇒ fast rate

– Warning: µ should decrease with n to reduce approximation error

(44)

Outline

1. Large-scale machine learning and optimization

• Traditional statistical analysis

• Classical methods for convex optimization 2. Non-smooth stochastic approximation

• Stochastic (sub)gradient and averaging

• Non-asymptotic results and lower bounds

• Strongly convex vs. non-strongly convex

3. Smooth stochastic approximation algorithms

• Asymptotic and non-asymptotic results 4. Beyond decaying step-sizes

5. Finite data sets

(45)

Complexity results in convex optimization

• Assumption: f convex on R^d

• Classical generic algorithms – (sub)gradient method/descent – Accelerated gradient descent – Newton method

• Key additional properties of f

– Lipschitz continuity, smoothness or strong convexity

• Key insight from Bottou and Bousquet (2008)

– In machine learning, no need to optimize below estimation error

• Key reference: Nesterov (2004)

(46)

Subgradient method/descent

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk² 6 D}

• Algorithm: θ_t = Π_D

θ_t₋₁ − 2D B√

tf^′(θ_t₋₁)

– Π_D : orthogonal projection onto {kθk² 6 D}

• Bound:

f

1 t

t−1

X

k=0

θ_k

− f(θ_∗) 6 2DB

√t

• Three-line proof

• Best possible convergence rate after O(d) iterations

(47)

Subgradient method/descent - proof - I

• ^Iteration: ^θ^t ^{= Π}^D^(θ^t⁻¹ ⁻ ^γ^t^f^′^(θ^t⁻¹⁾⁾ ^with ^γ^t ⁼ _B^2D^√_t

• Assumption: kf^′(θ)k² 6 B and kθk² 6 D

kθ_t − θ_∗k²2 6 kθ_t₋₁ − θ_∗ − γ_tf^′(θ_t₋₁)k²2 by contractivity of projections

6 kθ_t₋₁ − θ_∗k²2 + B²γ_t² − 2γ_t(θ_t₋₁ − θ_∗)^⊤f^′(θ_t₋₁) because kf^′(θ_t₋₁)k² 6 B 6 kθ_t₋₁ − θ_∗k²2 + B²γ_t² − 2γ_t

f(θ_t₋₁) − f(θ_∗)

(property of subgradients)

• ^{leading to}

f(θ_t₋₁) − f(θ_∗) 6 B²γ_t

2 + 1 2γ_t

kθ_t₋₁ − θ_∗k²2 − kθ_t − θ_∗k²2

(48)

Subgradient method/descent - proof - II

• Starting from f(θ_t₋₁) − f(θ_∗) 6 B²γ_t

2 + 1 2γ_t

kθ_t₋₁ − θ_∗k²2 − kθ_t − θ_∗k²2

t

X

u=1

f(θ_u₋₁) − f(θ_∗) 6

t

X

u=1

B²γ_u

2 +

t

X

u=1

1 2γ_u

kθ_u₋₁ − θ_∗k²2 − kθ_u − θ_∗k²2

=

t

X

u=1

B²γ_u

2 +

t−1

X

u=1

kθ_u − θ_∗k²2

1

2γ_u+1 − 1 2γ_u

+ kθ₀ − θ_∗k²2

2γ₁ − kθ_t − θ_∗k²2

2γ_t

6

t

X

u=1

B²γ_u

2 +

t−1

X

u=1

4D² 1

2γ_u+1 − 1 2γ_u

+ 4D² 2γ₁

=

t

X

u=1

B²γ_u

2 + 4D²

2γ_t 6 2DB√

t with γ_t = 2D B√ t

• Using convexity: f

1 t

t−1

X

k=0

θ_k

− f(θ_∗) 6 2DB

√t

(49)

Subgradient descent for machine learning

• Assumptions (f is the expected risk, fˆ the empirical risk)

– fˆ(θ) = _n¹ Pn

i=1 ℓ(y_i, Φ(x_i)^⊤θ)

– G-Lipschitz loss: f and fˆ are GR-Lipschitz on C = {kθk² 6 D}

• Statistics: with probability greater than 1 − δ sup

θ∈C |fˆ(θ) − f(θ)| 6 GRD

√n

2 + r

2 log 2 δ

• Optimization: after t iterations of subgradient method fˆ(ˆθ) − min

η∈C

fˆ(η) 6 GRD

√t

• t = n iterations, with total running-time complexity of O(n²d)

(50)

Subgradient descent - strong convexity

• Assumptions

– f convex and B-Lipschitz-continuous on {kθk² 6 D} – f µ-strongly convex

• Algorithm: θ_t = Π_D

θ_t₋₁ − 2

µ(t + 1)f^′(θ_t₋₁)

• Bound:

f

2 t(t + 1)

t

X

k=1

kθ_k₋₁

− f(θ_∗) 6 2B² µ(t + 1)

• Best possible convergence rate after O(d) iterations

(51)

Subgradient method - strong convexity - proof - I

• ^Iteration: ^θ^t ^{= Π}^D^(θ^t⁻¹ ⁻ ^γ^t^f^′^(θ^t⁻¹⁾⁾ ^with ^γ^t ⁼ µ(t+1)²

• Assumption: kf^′(θ)k² 6 B and kθk² 6 D and µ-strong convexity of f kθ_t − θ_∗k²2 6 kθ_t₋₁ − θ_∗ − γ_tf^′(θ_t₋₁)k²2 by contractivity of projections

6 kθ_t₋₁ − θ_∗k²2 + B²γ_t² − 2γ_t(θ_t₋₁ − θ_∗)^⊤f^′(θ_t₋₁) because kf^′(θ_t₋₁)k² 6 B 6 kθ_t₋₁ − θ_∗k²2 + B²γ_t² − 2γ_t

f(θ_t₋₁) − f(θ_∗) + µ

2kθ_t₋₁ − θ_∗k²2

(property of subgradients and strong convexity)

• ^{leading to}

f(θ_t₋₁) − f(θ_∗) 6 B²γ_t

2 + 1 2

1

γ_t − µ

kθ_t₋₁ − θ_∗k²2 − 1

2γ_tkθ_t − θ_∗k²2

6 B²

µ(t + 1) + µ 2

t − 1 2

kθ_t₋₁ − θ_∗k²2 − µ(t + 1)

4 kθ_t − θ_∗k²2

(52)

Subgradient method - strong convexity - proof - II

• ^From ^f^(θ^t⁻¹⁾ ⁻ ^f^(θ^∗⁾ ⁶ _µ(t^B_{+ 1)}² ⁺ ^µ₂^t ⁻₂ ¹^k^θ^t⁻¹ ⁻ ^θ^∗^k²² ⁻ ^µ(t₄^{+ 1)}^k^θ^t ⁻ ^θ^∗^k²²

t

X

u=1

u

f(θ_u₋₁) − f(θ_∗) 6

u

X

t=1

B²u

µ(u + 1) + 1 4

t

X

u=1

u(u − 1)kθ_u₋₁ − θ_∗k²2 − u(u + 1)kθ_u − θ_∗k²2

6 B²t

µ + 1 4

0 − t(t + 1)kθ_t − θ_∗k²2

6 B²t µ

• Using convexity: f

2 t(t + 1)

t

X

u=1

uθ_u₋₁

− f(θ_∗) 6 2B² t + 1

(53)

(smooth) gradient descent

• Assumptions

– f convex with L-Lipschitz-continuous gradient – Minimum attained at θ_∗

• Algorithm:

θ_t = θ_t₋₁ − 1

Lf^′(θ_t₋₁)

• Bound:

f(θ_t) − f(θ_∗) 6 2Lkθ₀ − θ_∗k² t + 4

• Not best possible convergence rate after O(d) iterations

(54)

(smooth) gradient descent - strong convexity

• Assumptions

– f convex with L-Lipschitz-continuous gradient – f µ-strongly convex

• Algorithm:

θ_t = θ_t₋₁ − 1

Lf^′(θ_t₋₁)

• Bound:

f(θ_t) − f(θ_∗) 6 (1 − µ/L)^t

f(θ₀) − f(θ_∗)

• Adaptivity of gradient descent to problem difficulty

• Line search

(55)

Accelerated gradient methods (Nesterov, 1983)

• Assumptions

– f convex with L-Lipschitz-cont. gradient , min. attained at θ_∗

• Algorithm:

θ_t = η_t₋₁ − 1

Lf^′(η_t₋₁) η_t = θ_t + t − 1

t + 2(θ_t − θ_t₋₁)

• Bound:

f(θ_t) − f(θ_∗) 6 2Lkθ₀ − θ_∗k² (t + 1)²

• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

• Not improvable

• Extension to strongly convex functions

(56)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d f(θ_t) + (θ − θ_t)^⊤∇f(θ_t)+L

2kθ − θ_tk²2

– θ_t+1 = θ_t − _L¹∇f(θ_t)

(57)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2011)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d

f(θ_t) + (θ − θ_t)^⊤∇f(θ_t)+L

2kθ − θ_tk²2

– θ_t+1 = θ_t − _L¹∇f(θ_t)

• Problems of the form: min

θ∈R^d

f(θ) + µΩ(θ)

– θ_t+1 = arg min

θ∈R^d

f(θ_t) + (θ − θ_t)^⊤∇f(θ_t)+µΩ(θ)+L

2kθ − θ_tk²2

– Ω(θ) = kθk¹ ⇒ Thresholded gradient descent

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)