slides

(1)

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint work with Loucas Pillaud-Vivien and Alessandro Rudi

Newton Institute, Cambridge - June 2018

(2)

Two-minute summary

• Stochastic gradient descent for large-scale machine learning – Processes observations one by one

(3)

Two-minute summary

• Theory: Single pass SGD is optimal – Only for “easy” problems

• Practice: Multiple pass SGD always works better

(4)

Two-minute summary

• Theory: Single pass SGD is optimal – Only for “easy” problems

• Practice: Multiple pass SGD always works better – Provable for “hard” problems

– Quantification of required number of passes – Optimal statistical performance

– Source and capacity conditions from kernel methods

(5)

Least-squares regression in finite dimension

• Data: n observations (x_i, y_i) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = R^d

(6)

Least-squares regression in finite dimension

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = R^d – Optimal prediction θ^∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)²

(7)

Least-squares regression in finite dimension

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = R^d – Optimal prediction θ^∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)² – Assumption: kΦ(x)k 6 R almost surely

– Assumption: |y| 6 M and |y − hθ∗, Φ(x)i| 6 σ almost surely

(8)

Least-squares regression in finite dimension

• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ∗) – Finite dimension: optimal rate ^σ²^dim(_n ^H⁾ = ^σ_n²^d

– Attained by empirical risk minimization (ERM) and SGD

(9)

Least-squares regression in finite dimension

• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ∗) – Finite dimension: optimal rate ^σ²^dim(_n ^H⁾ = ^σ_n²^d

– Attained by empirical risk minimization (ERM) and SGD

• What if n ≫ dim(H)?

– Needs assumptions on Σ = E

Φ(x) ⊗ Φ(x)

and θ∗

(10)

Spectrum of covariance matrix Σ = E

Φ(x) ⊗ Φ(x)

• Eigenvalues λ_m(Σ) (in decreasing order)

• Example: News dataset (d = 1 300 000, n = 20 000)

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

(11)

Spectrum of covariance matrix Σ = E

Φ(x) ⊗ Φ(x)

• Eigenvalues λ_m(Σ) (in decreasing order)

• Example: News dataset (d = 1 300 000, n = 20 000)

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

• Assumption: tr(Σ^1/α) = X

m>1

λ_m(Σ)^1/α is “small” (compared to n) – “Equivalent” to λ_m(Σ) = O(m⁻^α)

(12)

Difficulty of the learning problem

• Measuring difficulty through “the” norm of θ∗

• Assumption: kΣ^1/2⁻^rθ∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ∗k

– Larger r: simpler problems

– Smaller r: harder problems (r = 0 always true)

(13)

Difficulty of the learning problem

• Assumption: kΣ^1/2⁻^rθ∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ∗k

(14)

Difficulty of the learning problem

• Assumption: kΣ^1/2⁻^rθ^∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ^∗k

(15)

Optimal statistical performance

• Easy problems r > ^α⁻¹

2α : optimal rate is O(n^2rα+1⁻^2rα )

(16)

Optimal statistical performance

2α : optimal rate is O(n^2rα+1⁻^2rα ), achieved by:

– Regularized ERM (Caponnetto and De Vito, 2007) – Early-stopped gradient descent (Yao et al., 2007)

– Single-pass averaged SGD (Dieuleveut and Bach, 2016)

(17)

Optimal statistical performance

2α : optimal rate is O(n^2rα+1⁻^2rα )

• Hard problems r 6 ^α⁻¹

2α

– Lower bound: O(n^2rα+1⁻^2rα ). Known upper bound: O(n⁻^2r)

(18)

Least-mean-square (LMS) algorithm

• Least-squares: F(θ) = ¹₂E

(y − hΦ(x), θi)²

with θ ∈ R^d

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θ_i = θ_i⁻₁ − γ hΦ(x_i), θ_i⁻₁i − y_i

Φ(x_i)

(19)

Least-mean-square (LMS) algorithm

• Least-squares: F(θ) = ¹₂E

(y − hΦ(x), θi)²

with θ ∈ R^d

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θ_i = θ_i⁻₁ − γ hΦ(x_i), θ_i⁻₁i − y_i

Φ(x_i)

• New analysis for averaging and constant step-size γ = 1/(4R²) – Bach and Moulines (2013)

– Assume kΦ(x)k 6 R and |y − hΦ(x), θ^∗i| 6 σ almost surely – No assumption regarding lowest eigenvalues of Σ

– Main result: EF(¯θ_n) − F(θ^∗) 6 4σ²d

n + 4R²kθ₀ − θ^∗k² n

• Matches statistical lower bound (Tsybakov, 2003)

(20)

Markov chain interpretation of constant step sizes

• LMS recursion: θ_i = θ_i−1 − γ hΦ(x_i), θ_i−1i − y_i

Φ(x_i)

(21)

Markov chain interpretation of constant step sizes

Φ(x_i)

• The sequence (θ_i)_i is a homogeneous Markov chain – convergence to a stationary distribution π_γ

– with expectation θ¯_γ ^def= R

θπ_γ(dθ) - For least-squares, θ¯_γ = θ∗

¯ θγ

θ₀

θn

(22)

Markov chain interpretation of constant step sizes

Φ(x_i)

θπ_γ(dθ)

• For least-squares, θ¯_γ = θ∗

¯ θγ

θ₀

θn

(23)

Markov chain interpretation of constant step sizes

Φ(x_i)

θπ_γ(dθ)

θ∗

θ₀

θ_n

¯ θ_n

(24)

Markov chain interpretation of constant step sizes

Φ(x_i)

θπ_γ(dθ)

– θ_n does not converge to θ∗ but oscillates around it

• Ergodic theorem:

– Averaged iterates converge to θ¯_γ = θ^∗ at rate O(1/n)

– See Dieuleveut, Durmus, and Bach (2017) for more details

(25)

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic square

1/2R² 1/8R² 1/32R² 1/2R²n^1/2

(26)

Simulations - benchmarks

• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5 0 0.5 1

log10(n) log 10[f(θ)−f(θ *)]

alpha square C=1 test

1/R² 1/R²n^1/2 SAG

0 2 4

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n) log 10[f(θ)−f(θ *)]

news square C=1 test

1/R² 1/R²n^1/2 SAG

(27)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ²d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

(28)

Optimal bounds for least-squares?

• Needs assumptions on Σ = E

Φ(x) ⊗ Φ(x)

and θ^∗

(29)

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λ_m less than a constant – Actual decay as λ_m = o(m⁻^α) with tr Σ^1/α = X

m

λ^1/α_m small

– New result: replace σ²d

n by σ²(γn)^1/α tr Σ^1/α n

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

(30)

Finer assumptions (Dieuleveut and Bach, 2016)

m

λ^1/α_m small – New result: replace σ²d

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

0 50 100

0 10 20 30 40

alpha (γ n)1/α Tr Σ1/α / n

(31)

Finer assumptions (Dieuleveut and Bach, 2016)

m

• Optimal predictor

– Pessimistic assumption: kθ₀ − θ^∗k² finite/small

– Finer assumption: kΣ^1/2⁻^r(θ₀ − θ^∗)k₂ small, for r ∈ [0, 1]

– Always satisfied for r = 0 and θ₀ = 0, since kΣ^1/2θ∗k 6 2p

Ey_n²

(32)

Finer assumptions (Dieuleveut and Bach, 2016)

m

• Optimal predictor

– Pessimistic assumption: kθ₀ − θ^∗k² finite/small

– Finer assumption: kΣ^1/2⁻^r(θ₀ − θ^∗)k₂ small, for r ∈ [0, 1]

– Always satisfied for r = 0 and θ₀ = 0, since kΣ^1/2θ∗k 6 2p

Ey_n² – New result: replace kθ₀ − θ∗k²

γn by kΣ^1/2⁻^r(θ₀ − θ∗)k² γ^2rn^2r

(33)

Optimal bounds for least-squares?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof

EF(¯θ_n) − F(θ^∗) 6 inf

α>1,r∈[0,1]

4σ² tr Σ^1/α

n (γn)^1/α + 4kΣ^1/2⁻^rθ∗k² γ^2rn^2r

– Previous results: α = +∞ and r = 1/2

(34)

Optimal bounds for least-squares?

EF(¯θ_n) − F(θ^∗) 6 inf

α>1,r∈[0,1]

4σ² tr Σ^1/α

n (γn)^1/α + 4kΣ^1/2⁻^rθ∗k² γ^2rn^2r

– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)

(35)

Optimal bounds for least-squares?

EF(¯θ_n) − F(θ^∗) 6 inf

α>1,r∈[0,1]

4σ² tr Σ^1/α

n (γn)^1/α + 4kΣ^1/2⁻^rθ∗k² γ^2rn^2r

– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)

– Extension to non-parametric estimation (using kernels) with optimal rates when r > (α−1)/(2α), still with O(n²) running-time

(36)

From least-squares to non-parametric estimation

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θ_i = θ_i−1 − γ hΦ(x_i), θ_i−1i − y_i

Φ(x_i)

• If θ₀ = 0, θ_i is a linear combination of Φ(x₁), . . . , Φ(x_i)

θ_i =

i

X

k=1

a_kΦ(x_k) and a_i = −γ

i−1

X

k=1

a_khΦ(x_k), Φ(x_i)i + γy_i

– Kernel trick: k(x, x^′) = hΦ(x), Φ(x^′)i

– Reproducing kernel Hilbert spaces and non parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n²) overall running-time

(37)

From least-squares to non-parametric estimation

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θ_i = θ_i−1 − γ hΦ(x_i), θ_i−1i − y_i

Φ(x_i)

• If θ₀ = 0, θ_i is a linear combination of Φ(x₁), . . . , Φ(x_i)

θ_i =

i

X

k=1

a_kΦ(x_k) and a_i = −γ

i−1

X

k=1

a_khΦ(x_k), Φ(x_i)i + γy_i

• Kernel trick: k(x, x^′) = hΦ(x), Φ(x^′)i

– Reproducing kernel Hilbert spaces and non-parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n²) overall running-time

(38)

Example: Sobolev spaces in one dimension

• X = [0, 1], functions represented through their Fourier series

– Weighted Fourier basis Φ(x)_m = λ^1/2_m cos(2mπx) (plus sines) – kernel k(x, x^′) = P

m λ_m cos

2mπ(x − x^′)

(39)

Example: Sobolev spaces in one dimension

m λ_m cos

2mπ(x − x^′)

• λ_m ∝ m⁻^α corresponds to Sobolev penalty on f_θ(x) = hθ, Φ(x)i kf_θk² = kθk² = X

m

|Fourier(f_θ)_m|²λ⁻_m¹ ∝

Z ₁

0

|f_θ^(α/2)(x)|²dx

(40)

Example: Sobolev spaces in one dimension

m λ_m cos

2mπ(x − x^′)

• λ_m ∝ m⁻^α corresponds to Sobolev penalty on f_θ(x) = hθ, Φ(x)i kf_θk² = kθk² = X

m

|Fourier(f_θ)_m|²λ⁻_m¹ ∝

Z ₁

0

|f_θ^(α/2)(x)|²dx

• Adapted norm kΣ^1/2⁻^rθk² depends on regularity of f_θ – kΣ^1/2⁻^rθk² = X

m

|Fourier(f_θ)_m|²λ⁻_m^2r ∝

Z 1

0

|f_θ^(rα)(x)|²dx – Optimal rate is O(n^2rα+1⁻^2rα )

(41)

New assumption needed

• Assumption: kΣ^µ/2⁻^1/2Φ(x)k almost surely “small”

– Already used by Steinwart et al. (2009) – True for µ = 1

– Usually µ > 1/α (equal for Sobolev spaces)

– Relationship between L^∞ norm k · k_L∞ and RKHS norm k · k kgk_L_∞ = O(kgk^µkgk¹_L⁻^µ

2 )

– NB: implies bounded leverage scores (Rudi et al., 2015)

(42)

Multiple pass SGD (sampling with replacement)

• Algorithm from n i.i.d. observations (x_i, y_i), i = 1, . . . , n:

θ_u = θ_u−1 + γ y_i(u) − hθ_u−1, Φ(x_i(u))i

Φ(x_i(u)) – θ¯_t averaged iterate after t > n iterations

(43)

Multiple pass SGD (sampling with replacement)

• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 ^α⁻¹

2α . – If µ 6 2r, then after t = Θ(n^α/(2rα+1)) iterations, we have:

EF (¯θ_t) − F(θ∗) = O(n⁻2rα/(2rα+1)) Optimal

– Otherwise, then after t = Θ(n^1/µ (log n)^µ¹) iterations, we have:

EF(¯θ_t) − F(θ∗) 6 O(n⁻^2r/µ) Improved

• Proof technique following Rosasco and Villa (2015)

(44)

Proof sketch

θ_u = θ_u⁻₁ + γ y_i(u) − hθ_u⁻₁, Φ(x_i(u))i

• Following Rosasco and Villa (2015), consider batch gradient recursion η_u = θ_u−1 + γ

n

X

i=1

y_i − hθ_u−1, Φ(x_i)i

Φ(x_i) – η¯_t averaged iterate after t > n iterations

(45)

Proof sketch

θ_u = θ_u⁻₁ + γ y_i(u) − hθ_u⁻₁, Φ(x_i(u))i

• Following Rosasco and Villa (2015), consider batch gradient recursion η_u = θ_u−1 + γ

n

X

i=1

y_i − hθ_u−1, Φ(x_i)i

Φ(x_i)

– η¯_t averaged iterate after t > n iterations

• As long as t = O(n^1/µ)

– Property 1: EF (¯θ_t) − EF(¯η_t) = Ot^1/α t

– Property 2: EF (¯η_t) − F(θ^∗) = Ot^1/α n

+ O(t⁻^2r)

(46)

Multiple pass SGD (sampling with replacement)

• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 ^α⁻¹

2α . – If µ 6 2r, then after t = Θ(n^α/(2rα+1)) iterations, we have:

EF (¯θ_t) − F(θ∗) = O(n⁻2rα/(2rα+1)) Optimal

– Otherwise, then after t = Θ(n^1/µ (log n)^µ¹) iterations, we have:

• Proof technique following Rosasco and Villa (2015)

(47)

Statistical optimality

• If µ 6 2r, then after t = Θ(n^α/(2rα+1)) iterations, we have:

EF(¯θ_t) − F(θ^∗) = O(n⁻2rα/(2rα+1)) Optimal

• Otherwise, then after t = Θ(n^1/µ (log n)^µ¹) iterations, we have:

(48)

Simulations

• Synthetic examples

– One-dimensional kernel regression – Sobolev spaces

– Arbitrary chosen values for r and α

• Check optimal number of iterations over the data

(49)

Simulations

• Synthetic examples

– One-dimensional kernel regression – Sobolev spaces

– Arbitrary chosen values for r and α

• Check optimal number of iterations over the data

• Comparing three sampling schemes – With replacement

– Without replacement (cycling with random reshuffling) – Cycling

(50)

Simulations (sampling with replacement)

α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(51)

Simulations (sampling without replacement)

α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(52)

Simulations (cycling)

α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log₁₀n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(53)

Simulations - Benchmarks

• MNIST dataset with linear kernel

1000 2000 3000 4000 5000 6000 7000 8000

Number of samples

30 40 50 60

Optimal number of passes

(54)

Conclusion

• Benefits of multiple passes

– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD

[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

(55)

Conclusion

• Benefits of multiple passes

– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD

[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

• Current work - Extensions

– Study of cycling and sampling without replacement (Shamir, 2016; G¨urb¨uzbalaban et al., 2015)

– Mini-batches

– Beyond least-squares

– Optimal efficient algorithms for the situation µ > 2r

– Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)

(56)

References

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), 2013.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.

Foundations of Computational Mathematics, 7(3):331–368, 2007.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4):1363–1399, 2016.

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.

The Annals of Statistics, 44(4):1363–1399, 2016.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report 1707.06386, arXiv, 2017.

Mert G¨urb¨uzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report 1510.08560, arXiv, 2015.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 2016.

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1–47, 2017.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Wiley West Sussex, 1995.

(57)

L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report 1712.04755, arXiv, 2017.

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report 1805.10074, arXiv, 2018.

Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.

A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr¨om computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.

B. Sch¨olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.

Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46–54, 2016.

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression.

In Proc. COLT, 2009.

A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.

Constructive Approximation, 26(2):289–315, 2007.