slides

(1)

Optimization for Large Scale Machine Learning

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Hausdorff School - September 2020

Slides available at

www.di.ens.fr/~fbach/hausdorff2020.pdf

(2)

Scientific context

• Proliferation of digital data – Personal data

– Industry

– Scientific: from bioinformatics to humanities

• Need for automated processing of massive data

(3)

Scientific context

– Industry

• Series of “hypes”

Big data → Data science → Machine Learning

→ Deep Learning → Artificial Intelligence

(4)

Scientific context

– Industry

• Series of “hypes”

Big data → Data science → Machine Learning

→ Deep Learning → Artificial Intelligence

• Healthy interactions between theory, applications, and hype?

(5)

Recent progress in perception (vision, audio, text)

person ride dog

From translate.google.fr From Peyr´e et al. (2017)

(6)

Recent progress in perception (vision, audio, text)

person ride dog

(1) Massive data

(2) Computing power

(3) Methodological and scientific progress

(7)

Recent progress in perception (vision, audio, text)

person ride dog

(1) Massive data

(2) Computing power

“Intelligence” = models + algorithms + data

+ computing power

(8)

Recent progress in perception (vision, audio, text)

person ride dog

(1) Massive data

(2) Computing power

“Intelligence” = models + algorithms + data

+ computing power

(9)

Machine learning for large-scale data

• Large-scale supervised machine learning: large d, large n

– d : dimension of each observation (input) or number of parameters – n : number of observations

• Examples: computer vision, advertising, bioinformatics, etc.

– Ideal running-time complexity: O(dn)

• Going back to simple methods

− Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present recent progress

(10)

Advertising

(11)

Object / action recognition in images

car under elephant person in cart

person ride dog person on top of traffic light

From Peyr´e, Laptev, Schmid and Sivic (2017)

(12)

Bioinformatics

• Predicting multiple functions and interactions of proteins

• Massive data: up to 1 millions for humans!

• Complex data

– Amino-acid sequence – Link with DNA

– Tri-dimensional molecule

(13)

Machine learning for large-scale data

– d : dimension of each observation (input), or number of parameters – n : number of observations

• Ideal running-time complexity: O(dn)

− Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present recent progress

(14)

Machine learning for large-scale data

– d : dimension of each observation (input), or number of parameters – n : number of observations

• Ideal running-time complexity: O(dn)

– Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present classical algorithms and some recent progress

(15)

Outline

1. Introduction/motivation: Supervised machine learning

− Machine learning ≈ optimization of finite sums

− Batch optimization methods

2. Fast stochastic gradient methods for convex problems – Variance reduction: for training error

– Constant step-sizes: for testing error 3. Beyond convex problems

– Generic algorithms with generic “guarantees”

– Global convergence for over-parameterized neural networks

(16)

Parametric supervised machine learning

• Data: n observations (x_i, y_i) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ R^d

(17)

Parametric supervised machine learning

• Advertising: n > 10⁹

– Φ(x) ∈ {0, 1}^d, d > 10⁹ – Navigation history + ad - Linear predictions

- h(x, θ) = θ^⊤Φ(x)

(18)

Parametric supervised machine learning

• Advertising: n > 10⁹

– Φ(x) ∈ {0, 1}^d, d > 10⁹ – Navigation history + ad

• Linear predictions – h(x, θ) = θ^⊤Φ(x)

(19)

Parametric supervised machine learning

x₁ x₂ x₃ x₄ x₅ x₆

y₁ = 1 y₂ = 1 y₃ = 1 y₄ = −1 y₅ = −1 y₆ = −1

(20)

Parametric supervised machine learning

x₁ x₂ x₃ x₄ x₅ x₆

y₁ = 1 y₂ = 1 y₃ = 1 y₄ = −1 y₅ = −1 y₆ = −1

– Neural networks (n, d > 10⁶): h(x, θ) = θ_m^⊤σ(θ_m^⊤₋₁σ(· · · θ₂^⊤σ(θ₁^⊤x))

x y

θ¹

θ³ θ²

(21)

Parametric supervised machine learning

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈R^d

1 n

n

X

i=1

nℓ y_i, h(x_i, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

data fitting term + regularizer

(22)

Usual losses

• Regression: y ∈ R, prediction yˆ = h(x, θ) – quadratic loss ¹₂(y − yˆ)² = ¹₂(y − h(x, θ))²

(23)

Usual losses

• Regression: y ∈ R, prediction yˆ = h(x, θ) – quadratic loss ¹₂(y − yˆ)² = ¹₂(y − h(x, θ))²

• Classification : y ∈ {−1, 1}, prediction yˆ = sign(h(x, θ)) – loss of the form ℓ(y h(x, θ))

– “True” 0-1 loss: ℓ(y h(x, θ)) = 1y h(x,θ)<0

– Usual convex losses:

−30 −2 −1 0 1 2 3 4

1 2 3 4 5

0−1 hinge square logistic

(24)

Main motivating examples

• Support vector machine (hinge loss): non-smooth ℓ(Y, h(Xθ)) = max{1 − Y h(X, θ), 0}

• Logistic regression: smooth

ℓ(Y, h(Xθ)) = log(1 + exp(−Y h(X, θ)))

• Least-squares regression

ℓ(Y, h(Xθ)) = 1

2(Y − h(X, θ))²

• Structured output regression

– See Tsochantaridis et al. (2005); Lacoste-Julien et al. (2013)

(25)

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: kθk²2 = Pd

j=1 |θ_j|² – Numerically well-behaved if h(x, θ) = θ^⊤Φ(x)

– Representer theorem and kernel methods : θ = Pn

i=1 α_iΦ(x_i)

– See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and Cristianini (2004)

(26)

Usual regularizers

• Main goal: avoid overfitting

• (squared) Euclidean norm: kθk²2 = Pd

j=1 |θ_j|² – Numerically well-behaved if h(x, θ) = θ^⊤Φ(x)

– Representer theorem and kernel methods : θ = Pn

i=1 α_iΦ(x_i)

– See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and Cristianini (2004)

• Sparsity-inducing norms

– Main example: ℓ₁-norm kθk¹ = Pd

j=1 |θ_j|

– Perform model selection as well as regularization – Non-smooth optimization and structured sparsity

– See, e.g., Bach, Jenatton, Mairal, and Obozinski (2012a,b)

(27)

Parametric supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

(28)

Parametric supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

(29)

Parametric supervised machine learning

θ∈R^d

1 n

n

X

i=1

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

• Optimization: optimization of regularized risk training cost

(30)

Parametric supervised machine learning

• Data: n observations (x_i, y_i) ∈ X × Y, i = 1, . . . , n, i.i.d.

θ∈R^d

1 n

n

X

i=1

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

• Statistics: guarantees on Ep(x,y)ℓ(y, h(x, θ)) testing cost

(31)

Smoothness and (strong) convexity

• A function g : R^d → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ R^d,

eigenvalues

g^′′(θ)

6 L

smooth non-smooth

(32)

Smoothness and (strong) convexity

• A function g : R^d → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ R^d,

eigenvalues

g^′′(θ)

6 L

• Machine learning – with g(θ) = _n¹ Pn

i=1 ℓ(y_i, h(x_i, θ))

– Smooth prediction function θ 7→ h(x_i, θ) + smooth loss – (see board)

(33)

Board

• Function g(θ) = 1 n

n

X

i=1

ℓ(y_i, θ^⊤Φ(x_i))

(34)

Board

n

X

i=1

• Gradient g^′(θ) = 1 n

n

X

i=1

ℓ^′(y_i, θ^⊤Φ(x_i))Φ(x_i)

(35)

Board

n

X

i=1

n

X

i=1

• Hessian g^′′(θ) = 1 n

n

X

i=1

ℓ^′′(y_i, θ^⊤Φ(x_i))Φ(x_i)Φ(x_i)^⊤ – Smooth loss ⇒ ℓ^′′(y_i, θ^⊤Φ(x_i)) bounded

(36)

Smoothness and (strong) convexity

• A twice differentiable function g : R^d → R is convex if and only if

∀θ ∈ R^d, eigenvalues

g^′′(θ)

> 0

convex

(37)

Smoothness and (strong) convexity

• A twice differentiable function g : R^d → R is µ-strongly convex if and only if

g^′′(θ)

> µ

convex

strongly

convex

(38)

Smoothness and (strong) convexity

g^′′(θ)

> µ – Condition number κ = L/µ > 1

(small κ = L/µ) (large κ = L/µ)

(39)

Smoothness and (strong) convexity

g^′′(θ)

> µ

• Convexity in machine learning – With g(θ) = _n¹ Pn

i=1 ℓ(y_i, h(x_i, θ))

– Convex loss and linear predictions h(x, θ) = θ^⊤Φ(x)

(40)

Smoothness and (strong) convexity

g^′′(θ)

> µ

• Convexity in machine learning – With g(θ) = _n¹ Pn

i=1 ℓ(y_i, h(x_i, θ))

– Convex loss and linear predictions h(x, θ) = θ^⊤Φ(x)

• Relevance of convex optimization

– Easier design and analysis of algorithms

– Global minimum vs. local minimum vs. stationary points

– Gradient-based algorithms only need convexity for their analysis

(41)

Smoothness and (strong) convexity

g^′′(θ)

> µ

• Strong convexity in machine learning – With g(θ) = _n¹ Pn

i=1 ℓ(y_i, h(x_i, θ))

– Strongly convex loss and linear predictions h(x, θ) = θ^⊤Φ(x)

(42)

Smoothness and (strong) convexity

g^′′(θ)

> µ

i=1 ℓ(y_i, h(x_i, θ))

– Strongly convex loss and linear predictions h(x, θ) = θ^⊤Φ(x) – Invertible covariance matrix _n¹ P_n

i=1 Φ(x_i)Φ(x_i)^⊤ ⇒ n > d (board) – Even when µ > 0, µ may be arbitrarily small!

(43)

Board

n

X

i=1

n

X

i=1

• Hessian g^′′(θ) = 1 n

n

X

i=1

ℓ^′′(y_i, θ^⊤Φ(x_i))Φ(x_i)Φ(x_i)^⊤ – Smooth loss ⇒ ℓ^′′(y_i, θ^⊤Φ(x_i)) bounded

• Square loss ⇒ ℓ^′′(y_i, θ^⊤Φ(x_i)) = 1 – Hessian proportional to _n¹ Pn

i=1 Φ(x_i)Φ(x_i)^⊤

(44)

Smoothness and (strong) convexity

g^′′(θ)

> µ

i=1 ℓ(y_i, h(x_i, θ))

– Strongly convex loss and linear predictions h(x, θ) = θ^⊤Φ(x) – Invertible covariance matrix _n¹ P_n

i=1 Φ(x_i)Φ(x_i)^⊤ ⇒ n > d (board) – Even when µ > 0, µ may be arbitrarily small!

• Adding regularization by ^µ₂kθk²

– creates additional bias unless µ is small, but reduces variance – Typically L/√

n > µ > L/n

(45)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on R^d

• Gradient descent: θ_t = θ_t₋₁ − γ_t g^′(θ_t₋₁) (line search+adaptivity) g(θ_t) − g(θ_∗) 6 O(1/t)

g(θ_t) − g(θ_∗) 6 O(e⁻^t(µ/L)) = O(e⁻^t/κ)

(46)

Iterative methods for minimizing smooth functions

• Gradient descent: θ_t = θ_t₋₁ − γ_t g^′(θ_t₋₁) (line search+adaptivity) g(θ_t) − g(θ_∗) 6 O(1/t)

g(θ_t) − g(θ_∗) 6 O((1−µ/L)^t) = O(e⁻^t(µ/L)) if µ-strongly convex

(47)

Gradient descent - Proof for quadratic functions

• Quadratic convex function: g(θ) = ¹₂θ^⊤Hθ − c^⊤θ

– µ and L are the smallest and largest eigenvalues of H – Global optimum θ_∗ = H⁻¹c (or H^†c) such that Hθ_∗ = c

(48)

Gradient descent - Proof for quadratic functions

• Gradient descent with γ = 1/L:

θ_t = θ_t₋₁ − 1

L(Hθ_t₋₁ − c) = θ_t₋₁ − 1

L(Hθ_t₋₁ − Hθ_∗) θ_t − θ_∗ = (I − 1

LH)(θ_t₋₁ − θ_∗) = (I − 1

LH)^t(θ₀ − θ_∗)

(49)

Gradient descent - Proof for quadratic functions

θ_t = θ_t₋₁ − 1

L(Hθ_t₋₁ − c) = θ_t₋₁ − 1

L(Hθ_t₋₁ − Hθ_∗) θ_t − θ_∗ = (I − 1

LH)(θ_t₋₁ − θ_∗) = (I − 1

LH)^t(θ₀ − θ_∗)

• Strong convexity µ > 0: eigenvalues of (I − _L¹H)^t in [0, (1 − _L^µ)^t] – Convergence of iterates: kθ_t − θ_∗k² 6 (1 − µ/L)^2tkθ₀ − θ_∗k²

– Function values: g(θ_t) − g(θ_∗) 6 (1 − µ/L)^2t

g(θ₀) − g(θ_∗) – Function values: g(θ_t) − g(θ_∗) 6 ^L

t kθ₀ − θ_∗k²

(50)

Gradient descent - Proof for quadratic functions

θ_t = θ_t₋₁ − 1

L(Hθ_t₋₁ − c) = θ_t₋₁ − 1

L(Hθ_t₋₁ − Hθ_∗) θ_t − θ_∗ = (I − 1

LH)(θ_t₋₁ − θ_∗) = (I − 1

LH)^t(θ₀ − θ_∗)

• Convexity µ = 0: eigenvalues of (I − _L¹H)^t in [0, 1]

– No convergence of iterates: kθ_t − θ_∗k² 6 kθ₀ − θ_∗k²

– Function values: g(θ_t)−g(θ_∗) 6 max_v_∈_[0,L] v(1−v/L)^2tkθ₀−θ_∗k² – Function values: g(θ_t) − g(θ_∗) 6 ^L

t kθ₀ − θ_∗k² (board)

(51)

Board

• No convergence of iterates: kθ_t − θ_∗k² 6 kθ₀ − θ_∗k²

• g(θ_t) − g(θ_∗) = ¹₂(θ_t − θ_∗)^⊤H(θ_t − θ_∗), which is equal to 1

2(θ₀ − θ_∗)^⊤H(I − 1

LH)^2t(θ₀ − θ_∗)

• Function values: g(θ_t) − g(θ_∗) 6 max_v_∈_[0,L] v(1 − v/L)^2tkθ₀ − θ_∗k²

(52)

Board

• No convergence of iterates: kθ_t − θ_∗k² 6 kθ₀ − θ_∗k²

• g(θ_t) − g(θ_∗) = ¹₂(θ_t − θ_∗)^⊤H(θ_t − θ_∗), which is equal to 1

2(θ₀ − θ_∗)^⊤H(I − 1

LH)^2t(θ₀ − θ_∗)

• Function values: g(θ_t) − g(θ_∗) 6 max_v_∈_[0,L] v(1 − v/L)^2tkθ₀ − θ_∗k² v(1 − v/L)^2t 6 v exp(−v/L)^2t = v exp(−2tv/L)

6 (2tv/L) exp(−2tv/L) × L 2t

6 max

α>0 α exp(−α) × L

2t = O( L 2t)

(53)

Proof for strongly-convex functions

• Using the correct inequality is key! Here Kurdyka-Lojasiewicz:

g(θ) − g(θ_∗) 6 1

2µkg^′(θ)k²

(54)

Proof for strongly-convex functions

• Using the correct inequality is key! Here Kurdyka-Lojasiewicz:

g(θ) − g(θ_∗) 6 1

2µkg^′(θ)k²

• Assuming γ 6 1/L

g(θ_t) 6 g(θ_t₋₁) + g^′(θ_t₋₁)^⊤(−γg^′(θ_t₋₁)) + L

2k − γg^′(θ_t₋₁)k² g(θ_t) 6 g(θ_t₋₁) − γ(1 − Lγ/2)kg^′(θ_t₋₁)k²

g(θ_t) 6 g(θ_t₋₁) − γ

2kg^′(θ_t₋₁)k² 6 g(θ_t₋₁) − γµ(g(θ_t₋₁) − g(θ_∗)) leading to g(θ_t) − g(θ_∗) 6 (1 − γµ)(g(θ_t₋₁) − g(θ_∗))

• See automatic proofs by Taylor et al. (2017); Taylor and Bach (2019)

(55)

Accelerated gradient methods (Nesterov, 1983)

• Assumptions

– g convex with L-Lipschitz-cont. gradient , min. attained at θ_∗

• Algorithm:

θ_t = η_t₋₁ − 1

Lg^′(η_t₋₁) η_t = θ_t + t − 1

t + 2(θ_t − θ_t₋₁)

• Bound:

g(θ_t) − g(θ_∗) 6 2Lkθ₀ − θ_∗k² (t + 1)²

• Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011)

• Not improvable

• Extension to strongly-convex functions

(56)

Accelerated gradient methods - strong convexity

• Assumptions

– g convex with L-Lipschitz-cont. gradient , min. attained at θ_∗ – g µ-strongly convex

• Algorithm:

θ_t = η_t₋₁ − 1

Lg^′(η_t₋₁) η_t = θ_t + 1 − p

µ/L 1 + p

µ/L(θ_t − θ_t₋₁)

• Bound: g(θ_t) − g(θ_∗) 6 Lkθ₀ − θ_∗k²(1 − p

µ/L)^t

– Ten-line proof (see, e.g., Schmidt, Le Roux, and Bach, 2011) – Not improvable

– Relationship with conjugate gradient for quadratic functions

(57)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012a)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d f(θ_t) + (θ − θ_t)^⊤f^′(θ_t)+L

2kθ − θ_tk²2

– θ_t+1 = θ_t − _L¹f^′(θ_t)

(58)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012a)

• Gradient descent as a proximal method (differentiable functions) – θ_t+1 = arg min

θ∈R^d f(θ_t) + (θ − θ_t)^⊤f^′(θ_t)+L

2kθ − θ_tk²2

– θ_t+1 = θ_t − _L¹f^′(θ_t)

• Problems of the form: min

θ∈R^d f(θ) + µΩ(θ) – θ_t+1 = arg min

θ∈R^d f(θ_t) + (θ − θ_t)^⊤f^′(θ_t)+µΩ(θ)+L

2kθ − θ_tk²2

– Ω(θ) = kθk¹ ⇒ Thresholded gradient descent

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

(59)

Soft-thresholding for the ℓ

₁

-norm

• Example 1: quadratic problem in 1D, i.e. min

x∈R

1

2x² − xy + λ|x|

• Piecewise quadratic function with a kink at zero

– Derivative at 0+: g₊ = λ − y and 0−: g₋ = −λ − y

– x = 0 is the solution iff g₊ > 0 and g₋ 6 0 (i.e., |y| 6 λ) – x > 0 is the solution iff g₊ 6 0 (i.e., y > λ) ⇒ x^∗ = y − λ – x 6 0 is the solution iff g₋ > 0 (i.e., y 6 −λ) ⇒ x^∗ = y + λ

• Solution x^∗ = sign(y)(|y| − λ)₊ = soft thresholding

(60)

Soft-thresholding for the ℓ

₁

-norm

• Example 1: quadratic problem in 1D, i.e. min

x∈R

1

2x² − xy + λ|x|

• Piecewise quadratic function with a kink at zero

• Solution x^∗ = sign(y)(|y| − λ)₊ = soft thresholding

x

−λ

x(y)*

λ y

(61)

Projected gradient descent

• Problems of the form: min

θ∈K f(θ) – θ_t+1 = arg min

θ∈K f(θ_t) + (θ − θ_t)^⊤f^′(θ_t)+L

2kθ − θ_tk²2

– θ_t+1 = arg min

θ∈K

1 2

θ − θ_t − 1

Lf^′(θ_t)

2

– Projected gradient descent 2

• Similar convergence rates than smooth optimization

– Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

(62)

Iterative methods for minimizing smooth functions

• Gradient descent: θ_t = θ_t₋₁ − γ_t g^′(θ_t₋₁)

– O(1/t) convergence rate for convex functions – O(e⁻^t/κ) linear if strongly-convex

(63)

Iterative methods for minimizing smooth functions

– O(1/t) convergence rate for convex functions – O(e⁻^t/κ) linear if strongly-convex

• Newton method: θ_t = θ_t₋₁ − g^′′(θ_t₋₁)⁻¹g^′(θ_t₋₁) – O e⁻^ρ2^t

quadratic rate (see board)

(64)

Board

• Second-order Taylor expansion

g(θ) ≈ g(θ_t₋₁)+g^′(θ_t₋₁)^⊤(θ−θ_t₋₁)+1

2(θ−θ_t₋₁)^⊤g^′′(θ_t₋₁)(θ−θ_t₋₁) – Minimization by zeroing gradient:

g^′(θ_t₋₁) + g^′′(θ_t₋₁)(θ − θ_t₋₁) = 0 – Iteration: θ_t = θ_t₋₁ − g^′′(θ_t₋₁)⁻¹g^′(θ_t₋₁)

• Local quadratic convergence: kθ_t − θ_∗k = O(kθ_t₋₁ − θ_∗k²)

(65)

Iterative methods for minimizing smooth functions

– O(1/t) convergence rate for convex functions

– O(e⁻^t/κ) linear if strongly-convex ⇔ O(κ log ¹_ε) iterations

quadratic rate ⇔ O(log log ¹_ε) iterations

(66)

Iterative methods for minimizing smooth functions

– O(e⁻^t/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log ¹_ε)

quadratic rate ⇔ complexity = O((nd² + d³) · log log ¹_ε)

(67)

Iterative methods for minimizing smooth functions

• Key insights for machine learning (Bottou and Bousquet, 2008) 1. No need to optimize below statistical error

2. Objective functions are averages

3. Testing error is more important than training error

(68)

Iterative methods for minimizing smooth functions

(69)

Outline

1. Introduction/motivation: Supervised machine learning – Machine learning ≈ optimization of finite sums

– Batch optimization methods

2. Fast stochastic gradient methods for convex problems

− Variance reduction: for training error

− Constant step-sizes: for testing error 3. Beyond convex problems

– Generic algorithms with generic “guarantees”

– Global convergence for over-parameterized neural networks

(70)

Parametric supervised machine learning

• Data: n observations (x_i, y_i) ∈ X × Y, i = 1, . . . , n, i.i.d.

θ∈R^d

1 n

n

X

i=1

+ λΩ(θ)o

= 1 n

n

X

i=1

f_i(θ)

• Statistics: guarantees on Ep(x,y)ℓ(y, h(x, θ)) testing cost

(71)

Iterative methods for minimizing smooth functions

(72)

Stochastic gradient descent (SGD) for finite sums

min

θ∈R^d g(θ) = 1 n

n

X

i=1

f_i(θ)

• Iteration: θ_t = θ_t₋₁ − γ_tf_i(t)^′ (θ_t₋₁)

– Sampling with replacement: i(t) random element of {1, . . . , n} – Polyak-Ruppert averaging: θ¯_t = _t+1¹ Pt

u=0 θ_u

(73)

Stochastic gradient descent (SGD) for finite sums

min

θ∈R^d g(θ) = 1 n

n

X

i=1

f_i(θ)

• Iteration: θ_t = θ_t₋₁ − γ_tf_i(t)^′ (θ_t₋₁)

– Sampling with replacement: i(t) random element of {1, . . . , n} – Polyak-Ruppert averaging: θ¯_t = _t+1¹ Pt

u=0 θ_u

• Convergence rate if each f_i is convex L-smooth and g µ-strongly- convex:

Eg(¯θ_t) − g(θ_∗) 6

O(1/√

t) if γ_t = 1/(L√ t) O(L/(µt)) = O(κ/t) if γ_t = 1/(µt) – No adaptivity to strong-convexity in general

– Running-time complexity: O(d · κ/ε)

(74)

Where does κ/t come from? - I

• Iteration: θ_t = θ_t₋₁ − γ_tf_i(t)^′ (θ_t₋₁), with bounded gradients

kθ_t − θ_∗k² = kθ_t₋₁ − θ_∗k² − 2γ_t(θ_t₋₁ − θ_∗)^⊤f_i(t)^′ (θ_t₋₁) + γ_t²kf_i(t)^′ (θ_t₋₁)k² 6 kθ_t₋₁ − θ_∗k² − 2γ_t(θ_t₋₁ − θ_∗)^⊤f_i(t)^′ (θ_t₋₁) + γ_t²B²

(75)

Where does κ/t come from? - I

− leading to, with the proper conditional expectation:

E

kθ_t − θ_∗k²|i(1), . . . , i(t − 1)

6 kθ_t₋₁ − θ_∗k² − 2γ_t(θ_t₋₁ − θ_∗)^⊤g^′(θ_t₋₁) + γ_t²B²

(76)

Where does κ/t come from? - I

− leading to, with the proper conditional expectation:

E

kθ_t − θ_∗k²|i(1), . . . , i(t − 1)

6 kθ_t₋₁ − θ_∗k² − 2γ_t(θ_t₋₁ − θ_∗)^⊤g^′(θ_t₋₁) + γ_t²B²

− Using another magical identity g(θ) + (θ_∗ − θ)^⊤g^′(θ) + ^µ₂kθ − θ_∗k² 6 g(θ_∗):

E

kθ_t−θ_∗k²|i(1), . . . , i(t−1)

6 (1−γ_tµ)kθ_t₋₁−θ_∗k²−2γ_t[g(θ_t₋₁)−g(θ_∗)]+γ_t²B²

slides

Optimization for Large Scale Machine Learning

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

Hausdorff School - September 2020

Slides available at

Scientific context

Scientific context

Scientific context

Recent progress in perception (vision, audio, text)

Recent progress in perception (vision, audio, text)

Recent progress in perception (vision, audio, text)

Recent progress in perception (vision, audio, text)

Machine learning for large-scale data

Advertising

Object / action recognition in images

Bioinformatics

Machine learning for large-scale data

Machine learning for large-scale data

Outline

Parametric supervised machine learning

Parametric supervised machine learning

Parametric supervised machine learning

Parametric supervised machine learning

Parametric supervised machine learning

x y

Parametric supervised machine learning

Usual losses

Usual losses

Main motivating examples

Usual regularizers

Usual regularizers

Parametric supervised machine learning

Parametric supervised machine learning

Parametric supervised machine learning

Parametric supervised machine learning

Smoothness and (strong) convexity

smooth non-smooth

Smoothness and (strong) convexity

Board

Board

Board

Smoothness and (strong) convexity

convex

Smoothness and (strong) convexity

convex

strongly

convex

Smoothness and (strong) convexity

Smoothness and (strong) convexity

Smoothness and (strong) convexity

Smoothness and (strong) convexity

Smoothness and (strong) convexity

Board

Smoothness and (strong) convexity

Iterative methods for minimizing smooth functions

Iterative methods for minimizing smooth functions

Gradient descent - Proof for quadratic functions

Gradient descent - Proof for quadratic functions

Gradient descent - Proof for quadratic functions

Gradient descent - Proof for quadratic functions

Board

Board

Proof for strongly-convex functions

Proof for strongly-convex functions

Accelerated gradient methods (Nesterov, 1983)

Accelerated gradient methods - strong convexity

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012a)

Optimization for sparsity-inducing norms

(see Bach, Jenatton, Mairal, and Obozinski, 2012a)

Soft-thresholding for the ℓ

-norm

Soft-thresholding for the ℓ

-norm

x

−λ

x*(y)

λ y

Projected gradient descent

x(y)*