• Aucun résultat trouvé

part 1

N/A
N/A
Protected

Academic year: 2022

Partager "part 1"

Copied!
102
0
0

Texte intégral

(1)

Stochastic Variance-Reduced Optimization for Machine Learning

Part I

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint tutorial with Mark Schmidt, UBC - SIOPT - 2017

(2)

Context

Machine learning for large-scale data

• Large-scale supervised machine learning: large d, large n

– d : dimension of each observation (input) or number of parameters – n : number of observations

• Examples: computer vision, advertising, bioinformatics, etc.

– Ideal running-time complexity: O(dn)

• Going back to simple methods

− Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present recent progress

(3)

Search engines - Advertising - Marketing

(4)

Visual object recognition

(5)

Context

Machine learning for large-scale data

• Large-scale supervised machine learning: large d, large n

– d : dimension of each observation (input), or number of parameters – n : number of observations

• Examples: computer vision, advertising, bioinformatics, etc.

• Ideal running-time complexity: O(dn)

• Going back to simple methods

− Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present recent progress

(6)

Context

Machine learning for large-scale data

• Large-scale supervised machine learning: large d, large n

– d : dimension of each observation (input), or number of parameters – n : number of observations

• Examples: computer vision, advertising, bioinformatics, etc.

• Ideal running-time complexity: O(dn)

• Going back to simple methods

– Stochastic gradient methods (Robbins and Monro, 1951)

• Goal: Present recent progress

(7)

Outline

1. Introduction/motivation: Supervised machine learning

− Optimization of finite sums

− Existing optimization methods for finite sums 2. Convex finite-sum problems

– Linearly-convergent stochastic gradient method – SAG, SAGA, SVRG, SDCA, MISO, etc.

– From lazy gradient evaluations to variance reduction 3. Non-convex problems

4. Non-independent and identically distributed 5. Non-stochastic: other types of structures 6. Non-serial: parallel and distributed settings

(8)

References

• Textbooks and tutorials

– Nesterov (2004): Introductory lectures on convex optimization

– Bubeck (2015): Convex Optimization: Algorithms and Complexity – Bertsekas (2016): Nonlinear programming

– Bottou et al. (2016): Optimization methods for large-scale machine learning

• Research papers – See end of slides

– Slides available at www.ens.fr/~fbach/

(9)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

(10)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• Motivating examples

– Linear predictions: h(x, θ) = θΦ(x) with features Φ(x) ∈ Rd

(11)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• Motivating examples

– Linear predictions: h(x, θ) = θΦ(x) with features Φ(x) ∈ Rd – Neural networks: h(x, θ) = θmσ(θm1σ(· · · θ2σ(θ1x))

x y

θ

1

θ

3

θ

2

(12)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

nℓ yi, h(xi, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

fi(θ)

data fitting term + regularizer

(13)

Usual losses

• Regression: y ∈ R

– Quadratic loss ℓ y, h(x, θ)

= 12(y − h(x, θ))2

(14)

Usual losses

• Regression: y ∈ R

– Quadratic loss ℓ y, h(x, θ)

= 12(y − h(x, θ))2

• Classification : y ∈ {−1, 1} – Logistic loss ℓ y, h(x, θ)

= log(1 + exp(−yh(x, θ)))

(15)

Usual losses

• Regression: y ∈ R

– Quadratic loss ℓ y, h(x, θ)

= 12(y − h(x, θ))2

• Classification : y ∈ {−1, 1} – Logistic loss ℓ y, h(x, θ)

= log(1 + exp(−yh(x, θ)))

• Structured prediction

– Complex outputs y (k classes/labels, graphs, trees, or {0, 1}k, etc.) – Prediction function h(x, θ) ∈ Rk

– Conditional random fields (Lafferty et al., 2001)

– Max-margin (Taskar et al., 2003; Tsochantaridis et al., 2005)

(16)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

nℓ yi, h(xi, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

fi(θ)

data fitting term + regularizer

(17)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

nℓ yi, h(xi, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

fi(θ)

data fitting term + regularizer

(18)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

nℓ yi, h(xi, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

fi(θ)

data fitting term + regularizer

• Optimization: optimization of regularized risk training cost

(19)

Parametric supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction function h(x, θ) ∈ R parameterized by θ ∈ Rd

• (regularized) empirical risk minimization: find θˆ solution of min

θ∈Rd

1 n

n

X

i=1

nℓ yi, h(xi, θ)

+ λΩ(θ)o

= 1 n

n

X

i=1

fi(θ)

data fitting term + regularizer

• Optimization: optimization of regularized risk training cost

• Statistics: guarantees on Ep(x,y)ℓ(y, h(x, θ)) testing cost

(20)

Smoothness and (strong) convexity

• A function g : Rd → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ Rd,

eigenvalues

g′′(θ)

6 L

smooth non-smooth

(21)

Smoothness and (strong) convexity

• A function g : Rd → R is L-smooth if and only if it is twice differentiable and

∀θ ∈ Rd,

eigenvalues

g′′(θ)

6 L

• Machine learning – with g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Smooth prediction function θ 7→ h(xi, θ) + smooth loss

(22)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> 0

convex

(23)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

convex

strongly

convex

(24)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ – Condition number κ = L/µ > 1

(small κ = L/µ) (large κ = L/µ)

(25)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

• Convexity in machine learning – With g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Convex loss and linear predictions h(x, θ) = θΦ(x)

(26)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

• Convexity in machine learning – With g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Convex loss and linear predictions h(x, θ) = θΦ(x)

• Relevance of convex optimization

– Easier design and analysis of algorithms

– Global minimum vs. local minimum vs. stationary points

– Gradient-based algorithms only need convexity for their analysis

(27)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

• Strong convexity in machine learning – With g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Strongly convex loss and linear predictions h(x, θ) = θΦ(x)

(28)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

• Strong convexity in machine learning – With g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Strongly convex loss and linear predictions h(x, θ) = θΦ(x) – Invertible covariance matrix n1 Pn

i=1 Φ(xi)Φ(xi) ⇒ n > d – Even when µ > 0, µ may be arbitrarily small!

(29)

Smoothness and (strong) convexity

• A twice differentiable function g : Rd → R is µ-strongly convex if and only if

∀θ ∈ Rd, eigenvalues

g′′(θ)

> µ

• Strong convexity in machine learning – With g(θ) = n1 Pn

i=1 ℓ(yi, h(xi, θ))

– Strongly convex loss and linear predictions h(x, θ) = θΦ(x) – Invertible covariance matrix n1 Pn

i=1 Φ(xi)Φ(xi) ⇒ n > d – Even when µ > 0, µ may be arbitrarily small!

• Adding regularization by µ2kθk2

– creates additional bias unless µ is small, but reduces variance – Typically L/√

n > µ > L/n

(30)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1) g(θt) − g(θ) 6 O(1/t)

g(θt) − g(θ) 6 O(et(µ/L)) = O(et/κ)

(small κ = L/µ) (large κ = L/µ)

(31)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1) g(θt) − g(θ) 6 O(1/t)

g(θt) − g(θ) 6 O((1−µ/L)t) = O(et(µ/L)) if µ-strongly convex

(small κ = L/µ) (large κ = L/µ)

(32)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions – O(et/κ) linear if strongly-convex

(33)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions – O(et/κ) linear if strongly-convex

• Newton method: θt = θt1 − g′′t1)1gt1) – O eρ2t

quadratic rate

(34)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions

– O(et/κ) linear if strongly-convex ⇔ O(κ log 1ε) iterations

• Newton method: θt = θt1 − g′′t1)1gt1) – O eρ2t

quadratic rate ⇔ O(log log 1ε) iterations

(35)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions

– O(et/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)

• Newton method: θt = θt1 − g′′t1)1gt1) – O eρ2t

quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)

(36)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions

– O(et/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)

• Newton method: θt = θt1 − g′′t1)1gt1) – O eρ2t

quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)

• Key insights for machine learning (Bottou and Bousquet, 2008) 1. No need to optimize below statistical error

2. Cost functions are averages

3. Testing error is more important than training error

(37)

Iterative methods for minimizing smooth functions

• Assumption: g convex and L-smooth on Rd

• Gradient descent: θt = θt1 − γt gt1)

– O(1/t) convergence rate for convex functions

– O(et/κ) linear if strongly-convex ⇔ complexity = O(nd · κ log 1ε)

• Newton method: θt = θt1 − g′′t1)1gt1) – O eρ2t

quadratic rate ⇔ complexity = O((nd2 + d3) · log log 1ε)

• Key insights for machine learning (Bottou and Bousquet, 2008) 1. No need to optimize below statistical error

2. Cost functions are averages

3. Testing error is more important than training error

(38)

Stochastic gradient descent (SGD) for finite sums

min

θ∈Rd

g(θ) = 1 n

n

X

i=1

fi(θ)

• Iteration: θt = θt1 − γtfi(t)t1)

– Sampling with replacement: i(t) random element of {1, . . . , n} – Polyak-Ruppert averaging: θ¯t = t+11 Pt

u=0 θu

(39)

Stochastic gradient descent (SGD) for finite sums

min

θ∈Rd

g(θ) = 1 n

n

X

i=1

fi(θ)

• Iteration: θt = θt1 − γtfi(t)t1)

– Sampling with replacement: i(t) random element of {1, . . . , n} – Polyak-Ruppert averaging: θ¯t = t+11 Pt

u=0 θu

• Convergence rate if each fi is convex L-smooth and g µ-strongly- convex:

Eg(¯θt) − g(θ) 6

O(1/√

t) if γt = 1/(L√ t) O(L/(µt)) = O(κ/t) if γt = 1/(µt) – No adaptivity to strong-convexity in general

– Adaptivity with self-concordance assumption (Bach, 2014) – Running-time complexity: O(d · κ/ε)

(40)

Outline

1. Introduction/motivation: Supervised machine learning – Optimization of finite sums

– Existing optimization methods for finite sums 2. Convex finite-sum problems

− Linearly-convergent stochastic gradient method

− SAG, SAGA, SVRG, SDCA, etc.

− From lazy gradient evaluations to variance reduction

3. Non-convex problems

4. Parallel and distributed settings 5. Perspectives

(41)

Stochastic vs. deterministic methods

• Minimizing g(θ) = 1 n

n

X

i=1

fi(θ) with fi(θ) = ℓ yi, h(xi, θ)

+ λΩ(θ)

(42)

Stochastic vs. deterministic methods

• Minimizing g(θ) = 1 n

n

X

i=1

fi(θ) with fi(θ) = ℓ yi, h(xi, θ)

+ λΩ(θ)

• Batch gradient descent: θt = θt1−γtgt1) = θt1−γt n

n

X

i=1

fit1) – Linear (e.g., exponential) convergence rate in O(et/κ)

– Iteration complexity is linear in n

(43)

Stochastic vs. deterministic methods

• Minimizing g(θ) = 1 n

n

X

i=1

fi(θ) with fi(θ) = ℓ yi, h(xi, θ)

+ λΩ(θ)

• Batch gradient descent: θt = θt1−γtgt1) = θt1−γt n

n

X

i=1

fit1)

(44)

Stochastic vs. deterministic methods

• Minimizing g(θ) = 1 n

n

X

i=1

fi(θ) with fi(θ) = ℓ yi, h(xi, θ)

+ λΩ(θ)

• Batch gradient descent: θt = θt1−γtgt1) = θt1−γt n

n

X

i=1

fit1) – Linear (e.g., exponential) convergence rate in O(et/κ)

– Iteration complexity is linear in n

• Stochastic gradient descent: θt = θt1 − γtfi(t)t1)

– Sampling with replacement: i(t) random element of {1, . . . , n} – Convergence rate in O(κ/t)

– Iteration complexity is independent of n

(45)

Stochastic vs. deterministic methods

• Minimizing g(θ) = 1 n

n

X

i=1

fi(θ) with fi(θ) = ℓ yi, h(xi, θ)

+ λΩ(θ)

• Batch gradient descent: θt = θt1−γtgt1) = θt1−γt n

n

X

i=1

fit1)

• Stochastic gradient descent: θt = θt1 − γtfi(t)t1)

(46)

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(d) iteration cost Simple choice of step size

time

log(excess cost)

deterministic

stochastic

(47)

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(d) iteration cost Simple choice of step size

time

log(excess cost)

deterministic stochastic

new

(48)

Accelerating gradient methods - Related work

• Generic acceleration (Nesterov, 1983, 2004)

θt = ηt1 − γtgt1) and ηt = θt + δtt − θt1)

(49)

Accelerating gradient methods - Related work

• Generic acceleration (Nesterov, 1983, 2004)

θt = ηt1 − γtgt1) and ηt = θt + δtt − θt1) – Good choice of momentum term δt ∈ [0, 1)

g(θt) − g(θ) 6 O(1/t2) g(θt) −g(θ) 6 O(et

µ/L) = O(et/κ) if µ-strongly convex – Optimal rates after t = O(d) iterations (Nesterov, 2004)

(50)

Accelerating gradient methods - Related work

• Generic acceleration (Nesterov, 1983, 2004)

θt = ηt1 − γtgt1) and ηt = θt + δtt − θt1) – Good choice of momentum term δt ∈ [0, 1)

g(θt) − g(θ) 6 O(1/t2) g(θt) −g(θ) 6 O(et

µ/L) = O(et/κ) if µ-strongly convex – Optimal rates after t = O(d) iterations (Nesterov, 2004)

– Still O(nd) iteration cost: complexity = O(nd · √

κ log 1ε)

(51)

Accelerating gradient methods - Related work

• Constant step-size stochastic gradient

– Solodov (1998); Nedic and Bertsekas (2000)

– Linear convergence, but only up to a fixed tolerance

(52)

Accelerating gradient methods - Related work

• Constant step-size stochastic gradient

– Solodov (1998); Nedic and Bertsekas (2000)

– Linear convergence, but only up to a fixed tolerance

• Stochastic methods in the dual (SDCA) – Shalev-Shwartz and Zhang (2013)

– Similar linear rate but limited choice for the fi’s

– Extensions without duality: see Shalev-Shwartz (2016)

(53)

Accelerating gradient methods - Related work

• Constant step-size stochastic gradient

– Solodov (1998); Nedic and Bertsekas (2000)

– Linear convergence, but only up to a fixed tolerance

• Stochastic methods in the dual (SDCA) – Shalev-Shwartz and Zhang (2013)

– Similar linear rate but limited choice for the fi’s

– Extensions without duality: see Shalev-Shwartz (2016)

• Stochastic version of accelerated batch gradient methods – Tseng (1998); Ghadimi and Lan (2010); Xiao (2010)

– Can improve constants, but still have sublinear O(1/t) rate

(54)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

(55)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

y1t y2t y3t y4t yn−1t ynt f1 f2 f3 f4 fn−1 fn

1 n

Pn

i=1 yit g =

1 n

Pn

i=1 fi

functions

gradients ∈ R

d

(56)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

y1t y2t y3t y4t yn−1t ynt f1 f2 f3 f4 fn−1 fn

1 n

Pn

i=1 yit g =

1 n

Pn

i=1 fi

functions

gradients ∈ R

d

(57)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

y1t y2t y3t y4t yn−1t ynt f1 f2 f3 f4 fn−1 fn

1 n

Pn

i=1 yit g =

1 n

Pn

i=1 fi

functions

gradients ∈ R

d

(58)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

(59)

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n – Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt1 − γt n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Extra memory requirement: n gradients in Rd in general

• Linear supervised machine learning: only n real numbers – If fi(θ) = ℓ(yi, Φ(xi)θ), then fi(θ) = ℓ(yi, Φ(xi)θ) Φ(xi)

(60)

Stochastic average gradient - Convergence analysis

• Assumptions

– Each fi is L-smooth, i = 1, . . . , n – g= n1 Pn

i=1 fi is µ-strongly convex

– constant step size γt = 1/(16L) - no need to know µ

(61)

Stochastic average gradient - Convergence analysis

• Assumptions

– Each fi is L-smooth, i = 1, . . . , n – g= n1 Pn

i=1 fi is µ-strongly convex

– constant step size γt = 1/(16L) - no need to know µ

• Strongly convex case (Le Roux et al., 2012; Schmidt et al., 2016) E

g(θt) − g(θ)

6 cst ×

1 − min n 1

8n, µ 16L

ot

– Linear (exponential) convergence rate with O(d) iteration cost – After one pass, reduction of cost by exp

− min n

1

8, 16L o – NB: in machine learning, may often restrict to µ > L/n

⇒ constant error reduction after each effective pass

(62)

Running-time comparisons (strongly-convex)

• Assumptions: g(θ) = n1 Pn

i=1 fi(θ)

– Each fi convex L-smooth and g µ-strongly convex Stochastic gradient descent d×

L

µ × 1ε

Gradient descent d×

nLµ × log 1ε Accelerated gradient descent d×

nq

L

µ × log 1ε

SAG d×

(n + Lµ) × log 1ε

NB-1: for (accelerated) gradient descent, L = smoothness constant of g

NB-2: with non-uniform sampling, L = average smoothness constants of all fi’s

(63)

Running-time comparisons (strongly-convex)

• Assumptions: g(θ) = n1 Pn

i=1 fi(θ)

– Each fi convex L-smooth and g µ-strongly convex Stochastic gradient descent d×

L

µ × 1ε

Gradient descent d×

nLµ × log 1ε Accelerated gradient descent d×

nq

L

µ × log 1ε

SAG d×

(n + Lµ) × log 1ε

• Beating two lower bounds (Nemirovski and Yudin, 1983; Nesterov, 2004): with additional assumptions

(1) stochastic gradient: exponential rate for finite sums

(2) full gradient: better exponential rate using the sum structure

(64)

Running-time comparisons (non-strongly-convex)

• Assumptions: g(θ) = n1 Pn

i=1 fi(θ) – Each fi convex L-smooth

– Ill conditioned problems: g may not be strongly-convex (µ = 0) Stochastic gradient descent d×

1/ε2

Gradient descent d×

n/ε Accelerated gradient descent d×

n/√ ε

SAG d×

√n/ε

• Adaptivity to potentially hidden strong convexity

• No need to know the local/global strong-convexity constant

(65)

Stochastic average gradient

Implementation details and extensions

• Sparsity in the features

– Just-in-time updates ⇒ replace O(d) by number of non zeros – See also Leblond, Pedregosa, and Lacoste-Julien (2016)

• Mini-batches

– Reduces the memory requirement + block access to data

• Line-search

– Avoids knowing L in advance

• Non-uniform sampling

– Favors functions with large variations

• See www.cs.ubc.ca/~schmidtm/Software/SAG.html

(66)

Experimental results (logistic regression)

quantum dataset rcv1 dataset

(n = 50 000, d = 78) (n = 697 641, d = 47 236)

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶✵

✶✵

✲ ✁

✶✵

✲ ✂

✶✵

❊ ☎ ☎✆✝ ✞✟✠✆✡☛☞☞ ✆☞

AFG

L−BFGS ASG SG

IAG

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶✵

✶✵

✲ ✁

✶✵

✲✂

✶✵

✲ ✄

✶✵

❊ ✆ ✆✝✞ ✟✠✡✝☛☞✌✌ ✝✌

AFG L−BFGS

SG

ASG

IAG

(67)

Experimental results (logistic regression)

quantum dataset rcv1 dataset

(n = 50 000, d = 78) (n = 697 641, d = 47 236)

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶ ✵

✶ ✵

✶ ✵

✲✄

✶ ✵

✲☎

✶ ✵

✲✆

✶ ✵

✲ ✁

✶ ✵

❊ ✝ ✝✞✟ ✠✡☛✞☞✌✍✍ ✞✍

AFG

L−BFGS SG ASG

IAG

SAG−LS

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶ ✵

✶ ✵

✶ ✵

✲ ✁

✶ ✵

❊ ✄ ✄☎✆ ✝✞✟☎✠✡☛☛ ☎☛

L−BFGS AFG SG

ASG IAG

SAG−LS

(68)

Before non-uniform sampling

protein dataset sido dataset

(n = 145 751, d = 74) (n = 12 678, d = 4 932)

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶✵

✶✵

✲✁

✶✵

✲ ✂

✶✵

✲ ✄

✶✵

✲☎

✶✵

❊ ✝ ✝✞✟ ✠✡☛✞☞✌✍✍ ✞✍

IAG AGD

L−BFGS

SG−C

ASG−C PCD−L

DCA SAG

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶✵

✶✵

✲ ✁

✶✵

✲ ✂

✶✵

✲ ✄

✶✵

✲ ☎

✶✵

✲ ✆

✶✵

❊ ✞ ✞✟✠ ✡☛☞✟✌✍✎✎ ✟✎

IAG AGD

L−BFGS

SG−C ASG−C

PCD−L

DCA SAG

(69)

After non-uniform sampling

protein dataset sido dataset

(n = 145 751, d = 74) (n = 12 678, d = 4 932)

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶ ✵

✶ ✵

✲✂ ✄

✶ ✵

✲✂ ✁

✶ ✵

✲ ✄

✶ ✵

❊ ☎ ☎✆✝ ✞✟✠✆✡☛☞☞ ✆☞

IAGAGD L−BFGS

SG−C

ASG−C PCD−L

DCA

SAG

SAG−LS (Lipschitz)

✶ ✵ ✷✵ ✸ ✵ ✹ ✵ ✺ ✵

✶ ✵

✶ ✵

✶ ✵

✲ ✁

✶ ✵

❊ ✄ ✄☎✆ ✝✞✟☎✠✡☛☛ ☎☛

IAG AGD

L−BFGS

SG−C ASG−CPCD−L

DCA SAG

SAG−LS (Lipschitz)

(70)

Linearly convergent stochastic gradient algorithms

• Many related algorithms

– SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013)

– SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015)

– Finito (Defazio et al., 2014b)

– SAGA (Defazio, Bach, and Lacoste-Julien, 2014a) – · · ·

• Similar rates of convergence and iterations

– Different interpretations and proofs / proof lengths – Lazy gradient evaluations

– Variance reduction

(71)

Linearly convergent stochastic gradient algorithms

• Many related algorithms

– SAG (Le Roux, Schmidt, and Bach, 2012) – SDCA (Shalev-Shwartz and Zhang, 2013)

– SVRG (Johnson and Zhang, 2013; Zhang et al., 2013) – MISO (Mairal, 2015)

– Finito (Defazio et al., 2014b)

– SAGA (Defazio, Bach, and Lacoste-Julien, 2014a) – · · ·

• Similar rates of convergence and iterations

• Different interpretations and proofs / proof lengths – Lazy gradient evaluations

– Variance reduction

(72)

Variance reduction

• Principle: reducing variance of sample of X by using a sample from another random variable Y with known expectation

Zα = α(X − Y ) + EY – EZα = αEX + (1 − α)EY

– var(Zα) = α2

var(X) + var(Y ) − 2cov(X, Y )

– α = 1: no bias, α < 1: potential bias (but reduced variance) – Useful if Y positively correlated with X

(73)

Variance reduction

• Principle: reducing variance of sample of X by using a sample from another random variable Y with known expectation

Zα = α(X − Y ) + EY – EZα = αEX + (1 − α)EY

– var(Zα) = α2

var(X) + var(Y ) − 2cov(X, Y )

– α = 1: no bias, α < 1: potential bias (but reduced variance) – Useful if Y positively correlated with X

• Application to gradient estimation (Johnson and Zhang, 2013;

Zhang, Mahdavi, and Jin, 2013)

– SVRG: X = fi(t)t1), Y = fi(t) (˜θ), α = 1, with θ˜ stored – EY = n1 Pn

i=1 fi(˜θ) full gradient at θ,˜ X − Y = fi(t)t1) − fi(t) (˜θ)

(74)

Stochastic variance reduced gradient (SVRG) (Johnson and Zhang, 2013; Zhang et al., 2013)

• Initialize θ˜ ∈ Rd

• For iepoch = 1 to # of epochs

– Compute all gradients fi(˜θ) ; store g(˜θ) = n1 Pn

i=1 fi(˜θ) – Initialize θ0 = ˜θ

– For t = 1 to length of epochs θt = θt1 − γh

g(˜θ) + fi(t)t1) − fi(t) (˜θ)i – Update θ˜ = θt

• Output: θ˜

– No need to store gradients - two gradient evaluations per inner step – Two parameters: length of epochs + step-size γ

– Same linear convergence rate as SAG, simpler proof

(75)

Stochastic variance reduced gradient (SVRG) (Johnson and Zhang, 2013; Zhang et al., 2013)

• Initialize θ˜ ∈ Rd

• For iepoch = 1 to # of epochs

– Compute all gradients fi(˜θ) ; store g(˜θ) = n1 Pn

i=1 fi(˜θ) – Initialize θ0 = ˜θ

– For t = 1 to length of epochs θt = θt1 − γh

g(˜θ) + fi(t)t1) − fi(t) (˜θ)i – Update θ˜ = θt

• Output: θ˜

– No need to store gradients - two gradient evaluations per inner step – Two parameters: length of epochs + step-size γ

– Same linear convergence rate as SAG, simpler proof

(76)

Interpretation of SAG as variance reduction

• SAG update: θt = θt1 − γ n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise – Interpretation as lazy gradient evaluations

(77)

Interpretation of SAG as variance reduction

• SAG update: θt = θt1 − γ n

n

X

i=1

yit with yit =

(fit1) if i = i(t) yit1 otherwise – Interpretation as lazy gradient evaluations

• SAG update: θt = θt1 − γh

1 n

Pn

i=1 yit1 + n1 fi(t)t1) − yi(t)t1i – Biased update (expectation w.r.t. to i(t) not equal to full gradient)

Références

Documents relatifs

The identification of GISA/hGISA isolates NRS4 (HIP5836) and NRS68 should be confirmed by quantitative susceptibility and confirmatory testing using PAPs, whereas the sporadic

Actually, the saddle-point method may be seen as a crude version of the circle method in which the major arcs are reduced to a single neighbourhood of one point.. In the

This problem is of prime interest for solving more complex scheduling problems like the ESPCPR. In fact, we can get a relaxation for ESPCPR based on CuSP as follows. We separate

In our study, the increase in CCL-2 mRNA expression in perigonadal and subcutaneous adipose tissue observed in mice consuming the high-fat diet compared to the control diet

لولأا لصفلا راطلإا ةيميظنتلا ةيلاعفلاو يرادلإا عادبلإل يميهافملأ 15 اثلاث : عادبلإا تايوتسم : عاونأ ةثلاث ينب زييمتلا نكيم يهو تاسسؤلما في

Nos conditions expérimentales diffèrent de celle du groupe de Friry-Santini (rat/souris, flutamide/bicalutamide, doses) pouvant expliquer que dans nos résultats seule l’expression de

Quand t0 varie, on peut assimiler ce segment à une échelle qui glisse le long de AO et qui est alors toujours tangente à l'astroïde.. a Le champ U est déni sur le domaine D, borné

This very general model is standard for the study of many problems about multivariate polynomials such as, e.g., factorization [40], sparse interpolation [6, 59], sparsest shift [60]