• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
57
0
0

Texte intégral

(1)

Statistical Optimality of Stochastic Gradient Descent through Multiple Passes

Francis Bach

INRIA - Ecole Normale Sup´erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint work with Loucas Pillaud-Vivien and Alessandro Rudi

Newton Institute, Cambridge - June 2018

(2)

Two-minute summary

• Stochastic gradient descent for large-scale machine learning – Processes observations one by one

(3)

Two-minute summary

• Stochastic gradient descent for large-scale machine learning – Processes observations one by one

• Theory: Single pass SGD is optimal – Only for “easy” problems

• Practice: Multiple pass SGD always works better

(4)

Two-minute summary

• Stochastic gradient descent for large-scale machine learning – Processes observations one by one

• Theory: Single pass SGD is optimal – Only for “easy” problems

• Practice: Multiple pass SGD always works better – Provable for “hard” problems

– Quantification of required number of passes – Optimal statistical performance

– Source and capacity conditions from kernel methods

(5)

Least-squares regression in finite dimension

• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd

(6)

Least-squares regression in finite dimension

• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2

(7)

Least-squares regression in finite dimension

• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely

– Assumption: |y| 6 M and |y − hθ, Φ(x)i| 6 σ almost surely

(8)

Least-squares regression in finite dimension

• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely

– Assumption: |y| 6 M and |y − hθ, Φ(x)i| 6 σ almost surely

• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ) – Finite dimension: optimal rate σ2dim(n H) = σn2d

– Attained by empirical risk minimization (ERM) and SGD

(9)

Least-squares regression in finite dimension

• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.

• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely

– Assumption: |y| 6 M and |y − hθ, Φ(x)i| 6 σ almost surely

• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ) – Finite dimension: optimal rate σ2dim(n H) = σn2d

– Attained by empirical risk minimization (ERM) and SGD

• What if n ≫ dim(H)?

– Needs assumptions on Σ = E

Φ(x) ⊗ Φ(x)

and θ

(10)

Spectrum of covariance matrix Σ = E

Φ(x) ⊗ Φ(x)

• Eigenvalues λm(Σ) (in decreasing order)

• Example: News dataset (d = 1 300 000, n = 20 000)

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

(11)

Spectrum of covariance matrix Σ = E

Φ(x) ⊗ Φ(x)

• Eigenvalues λm(Σ) (in decreasing order)

• Example: News dataset (d = 1 300 000, n = 20 000)

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

• Assumption: tr(Σ1/α) = X

m>1

λm(Σ)1/α is “small” (compared to n) – “Equivalent” to λm(Σ) = O(mα)

(12)

Difficulty of the learning problem

• Measuring difficulty through “the” norm of θ

• Assumption: kΣ1/2rθk is “small” (compared to n) – r = 1/2: usual assumption on kθk

– Larger r: simpler problems

– Smaller r: harder problems (r = 0 always true)

(13)

Difficulty of the learning problem

• Measuring difficulty through “the” norm of θ

• Assumption: kΣ1/2rθk is “small” (compared to n) – r = 1/2: usual assumption on kθk

– Larger r: simpler problems

– Smaller r: harder problems (r = 0 always true)

(14)

Difficulty of the learning problem

• Measuring difficulty through “the” norm of θ

• Assumption: kΣ1/2rθk is “small” (compared to n) – r = 1/2: usual assumption on kθk

– Larger r: simpler problems

– Smaller r: harder problems (r = 0 always true)

(15)

Optimal statistical performance

• Easy problems r > α1

: optimal rate is O(n2rα+12rα )

(16)

Optimal statistical performance

• Easy problems r > α1

: optimal rate is O(n2rα+12rα ), achieved by:

– Regularized ERM (Caponnetto and De Vito, 2007) – Early-stopped gradient descent (Yao et al., 2007)

– Single-pass averaged SGD (Dieuleveut and Bach, 2016)

(17)

Optimal statistical performance

• Easy problems r > α1

: optimal rate is O(n2rα+12rα )

• Hard problems r 6 α1

– Lower bound: O(n2rα+12rα ). Known upper bound: O(n2r)

(18)

Least-mean-square (LMS) algorithm

• Least-squares: F(θ) = 12E

(y − hΦ(x), θi)2

with θ ∈ Rd

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

(19)

Least-mean-square (LMS) algorithm

• Least-squares: F(θ) = 12E

(y − hΦ(x), θi)2

with θ ∈ Rd

– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• New analysis for averaging and constant step-size γ = 1/(4R2) – Bach and Moulines (2013)

– Assume kΦ(x)k 6 R and |y − hΦ(x), θi| 6 σ almost surely – No assumption regarding lowest eigenvalues of Σ

– Main result: EF(¯θn) − F(θ) 6 4σ2d

n + 4R20 − θk2 n

• Matches statistical lower bound (Tsybakov, 2003)

(20)

Markov chain interpretation of constant step sizes

• LMS recursion: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

(21)

Markov chain interpretation of constant step sizes

• LMS recursion: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation θ¯γ def= R

θπγ(dθ) - For least-squares, θ¯γ = θ

¯ θγ

θ0

θn

(22)

Markov chain interpretation of constant step sizes

• LMS recursion: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation θ¯γ def= R

θπγ(dθ)

• For least-squares, θ¯γ = θ

¯ θγ

θ0

θn

(23)

Markov chain interpretation of constant step sizes

• LMS recursion: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation θ¯γ def= R

θπγ(dθ)

• For least-squares, θ¯γ = θ

θ

θ0

θn

¯ θn

(24)

Markov chain interpretation of constant step sizes

• LMS recursion: θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ

– with expectation θ¯γ def= R

θπγ(dθ)

• For least-squares, θ¯γ = θ

– θn does not converge to θ but oscillates around it

• Ergodic theorem:

– Averaged iterates converge to θ¯γ = θ at rate O(1/n)

– See Dieuleveut, Durmus, and Bach (2017) for more details

(25)

Simulations - synthetic examples

• Gaussian distributions - d = 20

0 2 4 6

−5

−4

−3

−2

−1 0

log10(n) log 10[f(θ)−f(θ *)]

synthetic square

1/2R2 1/8R2 1/32R2 1/2R2n1/2

(26)

Simulations - benchmarks

alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)

0 2 4 6

−2

−1.5

−1

−0.5 0 0.5 1

log10(n) log 10[f(θ)−f(θ *)]

alpha square C=1 test

1/R2 1/R2n1/2 SAG

0 2 4

−0.8

−0.6

−0.4

−0.2 0 0.2

log10(n) log 10[f(θ)−f(θ *)]

news square C=1 test

1/R2 1/R2n1/2 SAG

(27)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

(28)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Needs assumptions on Σ = E

Φ(x) ⊗ Φ(x)

and θ

(29)

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(mα) with tr Σ1/α = X

m

λ1/αm small

– New result: replace σ2d

n by σ2(γn)1/α tr Σ1/α n

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

(30)

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(mα) with tr Σ1/α = X

m

λ1/αm small – New result: replace σ2d

n by σ2(γn)1/α tr Σ1/α n

0 5 10

ï2 0 2 4 6 8

log(m) log( λ m)

0 50 100

0 10 20 30 40

alpha (γ n)1/α Tr Σ1/α / n

(31)

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(mα) with tr Σ1/α = X

m

λ1/αm small – New result: replace σ2d

n by σ2(γn)1/α tr Σ1/α n

• Optimal predictor

– Pessimistic assumption: kθ0 − θk2 finite/small

– Finer assumption: kΣ1/2r0 − θ)k2 small, for r ∈ [0, 1]

– Always satisfied for r = 0 and θ0 = 0, since kΣ1/2θk 6 2p

Eyn2

(32)

Finer assumptions (Dieuleveut and Bach, 2016)

• Covariance eigenvalues

– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(mα) with tr Σ1/α = X

m

λ1/αm small – New result: replace σ2d

n by σ2(γn)1/α tr Σ1/α n

• Optimal predictor

– Pessimistic assumption: kθ0 − θk2 finite/small

– Finer assumption: kΣ1/2r0 − θ)k2 small, for r ∈ [0, 1]

– Always satisfied for r = 0 and θ0 = 0, since kΣ1/2θk 6 2p

Eyn2 – New result: replace kθ0 − θk2

γn by kΣ1/2r0 − θ)k2 γ2rn2r

(33)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof

EF(¯θn) − F(θ) 6 inf

α>1,r[0,1]

2 tr Σ1/α

n (γn)1/α + 4kΣ1/2rθk2 γ2rn2r

– Previous results: α = +∞ and r = 1/2

(34)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof

EF(¯θn) − F(θ) 6 inf

α>1,r[0,1]

2 tr Σ1/α

n (γn)1/α + 4kΣ1/2rθk2 γ2rn2r

– Previous results: α = +∞ and r = 1/2

– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)

(35)

Optimal bounds for least-squares?

• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?

– What if d ≫ n?

• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof

EF(¯θn) − F(θ) 6 inf

α>1,r[0,1]

2 tr Σ1/α

n (γn)1/α + 4kΣ1/2rθk2 γ2rn2r

– Previous results: α = +∞ and r = 1/2

– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)

– Extension to non-parametric estimation (using kernels) with optimal rates when r > (α−1)/(2α), still with O(n2) running-time

(36)

From least-squares to non-parametric estimation

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• If θ0 = 0, θi is a linear combination of Φ(x1), . . . , Φ(xi)

θi =

i

X

k=1

akΦ(xk) and ai = −γ

i1

X

k=1

akhΦ(xk), Φ(xi)i + γyi

– Kernel trick: k(x, x) = hΦ(x), Φ(x)i

– Reproducing kernel Hilbert spaces and non parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n2) overall running-time

(37)

From least-squares to non-parametric estimation

• Extension to Hilbert spaces: Φ(x), θ ∈ H

θi = θi1 − γ hΦ(xi), θi1i − yi

Φ(xi)

• If θ0 = 0, θi is a linear combination of Φ(x1), . . . , Φ(xi)

θi =

i

X

k=1

akΦ(xk) and ai = −γ

i1

X

k=1

akhΦ(xk), Φ(xi)i + γyi

• Kernel trick: k(x, x) = hΦ(x), Φ(x)i

– Reproducing kernel Hilbert spaces and non-parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and

Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n2) overall running-time

(38)

Example: Sobolev spaces in one dimension

• X = [0, 1], functions represented through their Fourier series

– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x) = P

m λm cos

2mπ(x − x)

(39)

Example: Sobolev spaces in one dimension

• X = [0, 1], functions represented through their Fourier series

– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x) = P

m λm cos

2mπ(x − x)

• λm ∝ mα corresponds to Sobolev penalty on fθ(x) = hθ, Φ(x)i kfθk2 = kθk2 = X

m

|Fourier(fθ)m|2λm1

Z 1

0

|fθ(α/2)(x)|2dx

(40)

Example: Sobolev spaces in one dimension

• X = [0, 1], functions represented through their Fourier series

– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x) = P

m λm cos

2mπ(x − x)

• λm ∝ mα corresponds to Sobolev penalty on fθ(x) = hθ, Φ(x)i kfθk2 = kθk2 = X

m

|Fourier(fθ)m|2λm1

Z 1

0

|fθ(α/2)(x)|2dx

• Adapted norm kΣ1/2rθk2 depends on regularity of fθ – kΣ1/2rθk2 = X

m

|Fourier(fθ)m|2λm2r

Z 1

0

|fθ(rα)(x)|2dx – Optimal rate is O(n2rα+12rα )

(41)

New assumption needed

• Assumption: kΣµ/21/2Φ(x)k almost surely “small”

– Already used by Steinwart et al. (2009) – True for µ = 1

– Usually µ > 1/α (equal for Sobolev spaces)

– Relationship between L norm k · kL and RKHS norm k · k kgkL = O(kgkµkgk1Lµ

2 )

– NB: implies bounded leverage scores (Rudi et al., 2015)

(42)

Multiple pass SGD (sampling with replacement)

• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:

θu = θu1 + γ yi(u) − hθu1, Φ(xi(u))i

Φ(xi(u)) – θ¯t averaged iterate after t > n iterations

(43)

Multiple pass SGD (sampling with replacement)

• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:

θu = θu1 + γ yi(u) − hθu1, Φ(xi(u))i

Φ(xi(u)) – θ¯t averaged iterate after t > n iterations

• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 α1

. – If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:

EF (¯θt) − F(θ) = O(n2rα/(2rα+1)) Optimal

– Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:

EF(¯θt) − F(θ) 6 O(n2r/µ) Improved

• Proof technique following Rosasco and Villa (2015)

(44)

Proof sketch

• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:

θu = θu1 + γ yi(u) − hθu1, Φ(xi(u))i

Φ(xi(u)) – θ¯t averaged iterate after t > n iterations

• Following Rosasco and Villa (2015), consider batch gradient recursion ηu = θu1 + γ

n

n

X

i=1

yi − hθu1, Φ(xi)i

Φ(xi) – η¯t averaged iterate after t > n iterations

(45)

Proof sketch

• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:

θu = θu1 + γ yi(u) − hθu1, Φ(xi(u))i

Φ(xi(u)) – θ¯t averaged iterate after t > n iterations

• Following Rosasco and Villa (2015), consider batch gradient recursion ηu = θu1 + γ

n

n

X

i=1

yi − hθu1, Φ(xi)i

Φ(xi)

– η¯t averaged iterate after t > n iterations

• As long as t = O(n1/µ)

– Property 1: EF (¯θt) − EF(¯ηt) = Ot1/α t

– Property 2: EF (¯ηt) − F(θ) = Ot1/α n

+ O(t2r)

(46)

Multiple pass SGD (sampling with replacement)

• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:

θu = θu1 + γ yi(u) − hθu1, Φ(xi(u))i

Φ(xi(u)) – θ¯t averaged iterate after t > n iterations

• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 α1

. – If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:

EF (¯θt) − F(θ) = O(n2rα/(2rα+1)) Optimal

– Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:

EF(¯θt) − F(θ) 6 O(n2r/µ) Improved

• Proof technique following Rosasco and Villa (2015)

(47)

Statistical optimality

• If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:

EF(¯θt) − F(θ) = O(n2rα/(2rα+1)) Optimal

• Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:

EF(¯θt) − F(θ) 6 O(n2r/µ) Improved

(48)

Simulations

• Synthetic examples

– One-dimensional kernel regression – Sobolev spaces

– Arbitrary chosen values for r and α

• Check optimal number of iterations over the data

(49)

Simulations

• Synthetic examples

– One-dimensional kernel regression – Sobolev spaces

– Arbitrary chosen values for r and α

• Check optimal number of iterations over the data

• Comparing three sampling schemes – With replacement

– Without replacement (cycling with random reshuffling) – Cycling

(50)

Simulations (sampling with replacement)

α = 3/2, r = 1/3 > 1)/(2α) α = 4, r = 1/4 = 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < 1)/(2α) α = 3, r = 1/6 < 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(51)

Simulations (sampling without replacement)

α = 3/2, r = 1/3 > 1)/(2α) α = 4, r = 1/4 = 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < 1)/(2α) α = 3, r = 1/6 < 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(52)

Simulations (cycling)

α = 3/2, r = 1/3 > 1)/(2α) α = 4, r = 1/4 = 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1

α = 5/2, r = 1/5 < 1)/(2α) α = 3, r = 1/6 < 1)/(2α)

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.25

2.0 2.2 2.4 2.6 2.8 3.0

log10n

2.0 2.5 3.0 3.5 4.0

log10t*

line of slope 1.5

(53)

Simulations - Benchmarks

• MNIST dataset with linear kernel

1000 2000 3000 4000 5000 6000 7000 8000

Number of samples

30 40 50 60

Optimal number of passes

(54)

Conclusion

• Benefits of multiple passes

– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD

[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

(55)

Conclusion

• Benefits of multiple passes

– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD

[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]

• Current work - Extensions

– Study of cycling and sampling without replacement (Shamir, 2016; G¨urb¨uzbalaban et al., 2015)

– Mini-batches

– Beyond least-squares

– Optimal efficient algorithms for the situation µ > 2r

– Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)

(56)

References

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), 2013.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.

Foundations of Computational Mathematics, 7(3):331–368, 2007.

A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4):1363–1399, 2016.

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.

The Annals of Statistics, 44(4):1363–1399, 2016.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report 1707.06386, arXiv, 2017.

Mert G¨urb¨uzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report 1510.08560, arXiv, 2015.

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 2016.

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1–47, 2017.

O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.

Wiley West Sussex, 1995.

(57)

L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report 1712.04755, arXiv, 2017.

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report 1805.10074, arXiv, 2018.

Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.

A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr¨om computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.

B. Sch¨olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.

Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46–54, 2016.

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression.

In Proc. COLT, 2009.

A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.

Constructive Approximation, 26(2):289–315, 2007.

Références

Documents relatifs

As the implementation of big DNNs is problematic, proposed works in this context focus on inference process and assume that the training or learning process is done on an

In gossip learning, models perform random walks on the network and are trained on the local data using stochastic gradient descent.. Besides, several models can perform random walks

Dictionary learning algorithms often fail at that task because of the imper- fection of the sparse representation step: when the dictionary is fixed, sparse approximation

In this paper, we have shown that for least-squares regression, in hard problems where single-pass SGD is not statistically optimal (r &lt; α−1 2α ), then multiple passes lead

In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.. Keywords: Stochastic

This contribution presents an overview of the theoretical and practical aspects of the broad family of learning algorithms based on Stochastic Gradient Descent, including

The proof of the following theorems uses a method introduced by (Gladyshev, 1965), and widely used in the mathematical study of adaptive signal processing. We first consider a