Statistical Optimality of Stochastic Gradient Descent through Multiple Passes
Francis Bach
INRIA - Ecole Normale Sup´erieure, Paris, France
É C O L E N O R M A L E S U P É R I E U R E
Joint work with Loucas Pillaud-Vivien and Alessandro Rudi
Newton Institute, Cambridge - June 2018
Two-minute summary
• Stochastic gradient descent for large-scale machine learning – Processes observations one by one
Two-minute summary
• Stochastic gradient descent for large-scale machine learning – Processes observations one by one
• Theory: Single pass SGD is optimal – Only for “easy” problems
• Practice: Multiple pass SGD always works better
Two-minute summary
• Stochastic gradient descent for large-scale machine learning – Processes observations one by one
• Theory: Single pass SGD is optimal – Only for “easy” problems
• Practice: Multiple pass SGD always works better – Provable for “hard” problems
– Quantification of required number of passes – Optimal statistical performance
– Source and capacity conditions from kernel methods
Least-squares regression in finite dimension
• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.
• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd
Least-squares regression in finite dimension
• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.
• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2
Least-squares regression in finite dimension
• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.
• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely
– Assumption: |y| 6 M and |y − hθ∗, Φ(x)i| 6 σ almost surely
Least-squares regression in finite dimension
• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.
• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely
– Assumption: |y| 6 M and |y − hθ∗, Φ(x)i| 6 σ almost surely
• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ∗) – Finite dimension: optimal rate σ2dim(n H) = σn2d
– Attained by empirical risk minimization (ERM) and SGD
Least-squares regression in finite dimension
• Data: n observations (xi, yi) ∈ X × R, i = 1, . . . , n, i.i.d.
• Prediction as linear functions hθ, Φ(x)i of features Φ(x) ∈ H = Rd – Optimal prediction θ∗ ∈ H minimizing F(θ) = E(y − hθ, Φ(x)i)2 – Assumption: kΦ(x)k 6 R almost surely
– Assumption: |y| 6 M and |y − hθ∗, Φ(x)i| 6 σ almost surely
• Statistical performance of estimators θˆ defined as EF(ˆθ) − F(θ∗) – Finite dimension: optimal rate σ2dim(n H) = σn2d
– Attained by empirical risk minimization (ERM) and SGD
• What if n ≫ dim(H)?
– Needs assumptions on Σ = E
Φ(x) ⊗ Φ(x)
and θ∗
Spectrum of covariance matrix Σ = E
Φ(x) ⊗ Φ(x)
• Eigenvalues λm(Σ) (in decreasing order)
• Example: News dataset (d = 1 300 000, n = 20 000)
0 5 10
ï2 0 2 4 6 8
log(m) log( λ m)
Spectrum of covariance matrix Σ = E
Φ(x) ⊗ Φ(x)
• Eigenvalues λm(Σ) (in decreasing order)
• Example: News dataset (d = 1 300 000, n = 20 000)
0 5 10
ï2 0 2 4 6 8
log(m) log( λ m)
• Assumption: tr(Σ1/α) = X
m>1
λm(Σ)1/α is “small” (compared to n) – “Equivalent” to λm(Σ) = O(m−α)
Difficulty of the learning problem
• Measuring difficulty through “the” norm of θ∗
• Assumption: kΣ1/2−rθ∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ∗k
– Larger r: simpler problems
– Smaller r: harder problems (r = 0 always true)
Difficulty of the learning problem
• Measuring difficulty through “the” norm of θ∗
• Assumption: kΣ1/2−rθ∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ∗k
– Larger r: simpler problems
– Smaller r: harder problems (r = 0 always true)
Difficulty of the learning problem
• Measuring difficulty through “the” norm of θ∗
• Assumption: kΣ1/2−rθ∗k is “small” (compared to n) – r = 1/2: usual assumption on kθ∗k
– Larger r: simpler problems
– Smaller r: harder problems (r = 0 always true)
Optimal statistical performance
• Easy problems r > α−1
2α : optimal rate is O(n2rα+1−2rα )
Optimal statistical performance
• Easy problems r > α−1
2α : optimal rate is O(n2rα+1−2rα ), achieved by:
– Regularized ERM (Caponnetto and De Vito, 2007) – Early-stopped gradient descent (Yao et al., 2007)
– Single-pass averaged SGD (Dieuleveut and Bach, 2016)
Optimal statistical performance
• Easy problems r > α−1
2α : optimal rate is O(n2rα+1−2rα )
• Hard problems r 6 α−1
2α
– Lower bound: O(n2rα+1−2rα ). Known upper bound: O(n−2r)
Least-mean-square (LMS) algorithm
• Least-squares: F(θ) = 12E
(y − hΦ(x), θi)2
with θ ∈ Rd
– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
Least-mean-square (LMS) algorithm
• Least-squares: F(θ) = 12E
(y − hΦ(x), θi)2
with θ ∈ Rd
– SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – Iteration: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• New analysis for averaging and constant step-size γ = 1/(4R2) – Bach and Moulines (2013)
– Assume kΦ(x)k 6 R and |y − hΦ(x), θ∗i| 6 σ almost surely – No assumption regarding lowest eigenvalues of Σ
– Main result: EF(¯θn) − F(θ∗) 6 4σ2d
n + 4R2kθ0 − θ∗k2 n
• Matches statistical lower bound (Tsybakov, 2003)
Markov chain interpretation of constant step sizes
• LMS recursion: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
Markov chain interpretation of constant step sizes
• LMS recursion: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ
– with expectation θ¯γ def= R
θπγ(dθ) - For least-squares, θ¯γ = θ∗
¯ θγ
θ0
θn
Markov chain interpretation of constant step sizes
• LMS recursion: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ
– with expectation θ¯γ def= R
θπγ(dθ)
• For least-squares, θ¯γ = θ∗
¯ θγ
θ0
θn
Markov chain interpretation of constant step sizes
• LMS recursion: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ
– with expectation θ¯γ def= R
θπγ(dθ)
• For least-squares, θ¯γ = θ∗
θ∗
θ0
θn
¯ θn
Markov chain interpretation of constant step sizes
• LMS recursion: θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• The sequence (θi)i is a homogeneous Markov chain – convergence to a stationary distribution πγ
– with expectation θ¯γ def= R
θπγ(dθ)
• For least-squares, θ¯γ = θ∗
– θn does not converge to θ∗ but oscillates around it
• Ergodic theorem:
– Averaged iterates converge to θ¯γ = θ∗ at rate O(1/n)
– See Dieuleveut, Durmus, and Bach (2017) for more details
Simulations - synthetic examples
• Gaussian distributions - d = 20
0 2 4 6
−5
−4
−3
−2
−1 0
log10(n) log 10[f(θ)−f(θ *)]
synthetic square
1/2R2 1/8R2 1/32R2 1/2R2n1/2
Simulations - benchmarks
• alpha (d = 500, n = 500 000), news (d = 1 300 000, n = 20 000)
0 2 4 6
−2
−1.5
−1
−0.5 0 0.5 1
log10(n) log 10[f(θ)−f(θ *)]
alpha square C=1 test
1/R2 1/R2n1/2 SAG
0 2 4
−0.8
−0.6
−0.4
−0.2 0 0.2
log10(n) log 10[f(θ)−f(θ *)]
news square C=1 test
1/R2 1/R2n1/2 SAG
Optimal bounds for least-squares?
• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?
– What if d ≫ n?
Optimal bounds for least-squares?
• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?
– What if d ≫ n?
• Needs assumptions on Σ = E
Φ(x) ⊗ Φ(x)
and θ∗
Finer assumptions (Dieuleveut and Bach, 2016)
• Covariance eigenvalues
– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(m−α) with tr Σ1/α = X
m
λ1/αm small
– New result: replace σ2d
n by σ2(γn)1/α tr Σ1/α n
0 5 10
ï2 0 2 4 6 8
log(m) log( λ m)
Finer assumptions (Dieuleveut and Bach, 2016)
• Covariance eigenvalues
– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(m−α) with tr Σ1/α = X
m
λ1/αm small – New result: replace σ2d
n by σ2(γn)1/α tr Σ1/α n
0 5 10
ï2 0 2 4 6 8
log(m) log( λ m)
0 50 100
0 10 20 30 40
alpha (γ n)1/α Tr Σ1/α / n
Finer assumptions (Dieuleveut and Bach, 2016)
• Covariance eigenvalues
– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(m−α) with tr Σ1/α = X
m
λ1/αm small – New result: replace σ2d
n by σ2(γn)1/α tr Σ1/α n
• Optimal predictor
– Pessimistic assumption: kθ0 − θ∗k2 finite/small
– Finer assumption: kΣ1/2−r(θ0 − θ∗)k2 small, for r ∈ [0, 1]
– Always satisfied for r = 0 and θ0 = 0, since kΣ1/2θ∗k 6 2p
Eyn2
Finer assumptions (Dieuleveut and Bach, 2016)
• Covariance eigenvalues
– Pessimistic assumption: all eigenvalues λm less than a constant – Actual decay as λm = o(m−α) with tr Σ1/α = X
m
λ1/αm small – New result: replace σ2d
n by σ2(γn)1/α tr Σ1/α n
• Optimal predictor
– Pessimistic assumption: kθ0 − θ∗k2 finite/small
– Finer assumption: kΣ1/2−r(θ0 − θ∗)k2 small, for r ∈ [0, 1]
– Always satisfied for r = 0 and θ0 = 0, since kΣ1/2θ∗k 6 2p
Eyn2 – New result: replace kθ0 − θ∗k2
γn by kΣ1/2−r(θ0 − θ∗)k2 γ2rn2r
Optimal bounds for least-squares?
• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?
– What if d ≫ n?
• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof
EF(¯θn) − F(θ∗) 6 inf
α>1,r∈[0,1]
4σ2 tr Σ1/α
n (γn)1/α + 4kΣ1/2−rθ∗k2 γ2rn2r
– Previous results: α = +∞ and r = 1/2
Optimal bounds for least-squares?
• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?
– What if d ≫ n?
• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof
EF(¯θn) − F(θ∗) 6 inf
α>1,r∈[0,1]
4σ2 tr Σ1/α
n (γn)1/α + 4kΣ1/2−rθ∗k2 γ2rn2r
– Previous results: α = +∞ and r = 1/2
– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)
Optimal bounds for least-squares?
• Least-squares: cannot beat σ2d/n (Tsybakov, 2003). Really?
– What if d ≫ n?
• Refined assumptions with adaptivity (Dieuleveut and Bach, 2016) – Beyond strong convexity or lack thereof
EF(¯θn) − F(θ∗) 6 inf
α>1,r∈[0,1]
4σ2 tr Σ1/α
n (γn)1/α + 4kΣ1/2−rθ∗k2 γ2rn2r
– Previous results: α = +∞ and r = 1/2
– Optimal step-size γ potentially decaying with n, but depends on usually unknown quantities α and r ⇔ no adaptivity (yet)
– Extension to non-parametric estimation (using kernels) with optimal rates when r > (α−1)/(2α), still with O(n2) running-time
From least-squares to non-parametric estimation
• Extension to Hilbert spaces: Φ(x), θ ∈ H
θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• If θ0 = 0, θi is a linear combination of Φ(x1), . . . , Φ(xi)
θi =
i
X
k=1
akΦ(xk) and ai = −γ
i−1
X
k=1
akhΦ(xk), Φ(xi)i + γyi
– Kernel trick: k(x, x′) = hΦ(x), Φ(x′)i
– Reproducing kernel Hilbert spaces and non parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and
Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n2) overall running-time
From least-squares to non-parametric estimation
• Extension to Hilbert spaces: Φ(x), θ ∈ H
θi = θi−1 − γ hΦ(xi), θi−1i − yi
Φ(xi)
• If θ0 = 0, θi is a linear combination of Φ(x1), . . . , Φ(xi)
θi =
i
X
k=1
akΦ(xk) and ai = −γ
i−1
X
k=1
akhΦ(xk), Φ(xi)i + γyi
• Kernel trick: k(x, x′) = hΦ(x), Φ(x′)i
– Reproducing kernel Hilbert spaces and non-parametric estimation – See, e.g., Sch¨olkopf and Smola (2001); Shawe-Taylor and
Cristianini (2004); Dieuleveut and Bach (2016) – Still O(n2) overall running-time
Example: Sobolev spaces in one dimension
• X = [0, 1], functions represented through their Fourier series
– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x′) = P
m λm cos
2mπ(x − x′)
Example: Sobolev spaces in one dimension
• X = [0, 1], functions represented through their Fourier series
– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x′) = P
m λm cos
2mπ(x − x′)
• λm ∝ m−α corresponds to Sobolev penalty on fθ(x) = hθ, Φ(x)i kfθk2 = kθk2 = X
m
|Fourier(fθ)m|2λ−m1 ∝
Z 1
0
|fθ(α/2)(x)|2dx
Example: Sobolev spaces in one dimension
• X = [0, 1], functions represented through their Fourier series
– Weighted Fourier basis Φ(x)m = λ1/2m cos(2mπx) (plus sines) – kernel k(x, x′) = P
m λm cos
2mπ(x − x′)
• λm ∝ m−α corresponds to Sobolev penalty on fθ(x) = hθ, Φ(x)i kfθk2 = kθk2 = X
m
|Fourier(fθ)m|2λ−m1 ∝
Z 1
0
|fθ(α/2)(x)|2dx
• Adapted norm kΣ1/2−rθk2 depends on regularity of fθ – kΣ1/2−rθk2 = X
m
|Fourier(fθ)m|2λ−m2r ∝
Z 1
0
|fθ(rα)(x)|2dx – Optimal rate is O(n2rα+1−2rα )
New assumption needed
• Assumption: kΣµ/2−1/2Φ(x)k almost surely “small”
– Already used by Steinwart et al. (2009) – True for µ = 1
– Usually µ > 1/α (equal for Sobolev spaces)
– Relationship between L∞ norm k · kL∞ and RKHS norm k · k kgkL∞ = O(kgkµkgk1L−µ
2 )
– NB: implies bounded leverage scores (Rudi et al., 2015)
Multiple pass SGD (sampling with replacement)
• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:
θu = θu−1 + γ yi(u) − hθu−1, Φ(xi(u))i
Φ(xi(u)) – θ¯t averaged iterate after t > n iterations
Multiple pass SGD (sampling with replacement)
• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:
θu = θu−1 + γ yi(u) − hθu−1, Φ(xi(u))i
Φ(xi(u)) – θ¯t averaged iterate after t > n iterations
• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 α−1
2α . – If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:
EF (¯θt) − F(θ∗) = O(n−2rα/(2rα+1)) Optimal
– Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:
EF(¯θt) − F(θ∗) 6 O(n−2r/µ) Improved
• Proof technique following Rosasco and Villa (2015)
Proof sketch
• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:
θu = θu−1 + γ yi(u) − hθu−1, Φ(xi(u))i
Φ(xi(u)) – θ¯t averaged iterate after t > n iterations
• Following Rosasco and Villa (2015), consider batch gradient recursion ηu = θu−1 + γ
n
n
X
i=1
yi − hθu−1, Φ(xi)i
Φ(xi) – η¯t averaged iterate after t > n iterations
Proof sketch
• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:
θu = θu−1 + γ yi(u) − hθu−1, Φ(xi(u))i
Φ(xi(u)) – θ¯t averaged iterate after t > n iterations
• Following Rosasco and Villa (2015), consider batch gradient recursion ηu = θu−1 + γ
n
n
X
i=1
yi − hθu−1, Φ(xi)i
Φ(xi)
– η¯t averaged iterate after t > n iterations
• As long as t = O(n1/µ)
– Property 1: EF (¯θt) − EF(¯ηt) = Ot1/α t
– Property 2: EF (¯ηt) − F(θ∗) = Ot1/α n
+ O(t−2r)
Multiple pass SGD (sampling with replacement)
• Algorithm from n i.i.d. observations (xi, yi), i = 1, . . . , n:
θu = θu−1 + γ yi(u) − hθu−1, Φ(xi(u))i
Φ(xi(u)) – θ¯t averaged iterate after t > n iterations
• Theorem (Pillaud-Vivien, Rudi, and Bach, 2018): Assume r 6 α−1
2α . – If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:
EF (¯θt) − F(θ∗) = O(n−2rα/(2rα+1)) Optimal
– Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:
EF(¯θt) − F(θ∗) 6 O(n−2r/µ) Improved
• Proof technique following Rosasco and Villa (2015)
Statistical optimality
• If µ 6 2r, then after t = Θ(nα/(2rα+1)) iterations, we have:
EF(¯θt) − F(θ∗) = O(n−2rα/(2rα+1)) Optimal
• Otherwise, then after t = Θ(n1/µ (log n)µ1) iterations, we have:
EF(¯θt) − F(θ∗) 6 O(n−2r/µ) Improved
Simulations
• Synthetic examples
– One-dimensional kernel regression – Sobolev spaces
– Arbitrary chosen values for r and α
• Check optimal number of iterations over the data
Simulations
• Synthetic examples
– One-dimensional kernel regression – Sobolev spaces
– Arbitrary chosen values for r and α
• Check optimal number of iterations over the data
• Comparing three sampling schemes – With replacement
– Without replacement (cycling with random reshuffling) – Cycling
Simulations (sampling with replacement)
α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.25
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.5
Simulations (sampling without replacement)
α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.25
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.5
Simulations (cycling)
α = 3/2, r = 1/3 > (α − 1)/(2α) α = 4, r = 1/4 = (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1
α = 5/2, r = 1/5 < (α − 1)/(2α) α = 3, r = 1/6 < (α − 1)/(2α)
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.25
2.0 2.2 2.4 2.6 2.8 3.0
log10n
2.0 2.5 3.0 3.5 4.0
log10t*
line of slope 1.5
Simulations - Benchmarks
• MNIST dataset with linear kernel
1000 2000 3000 4000 5000 6000 7000 8000
Number of samples
30 40 50 60
Optimal number of passes
Conclusion
• Benefits of multiple passes
– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD
[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]
Conclusion
• Benefits of multiple passes
– Number of passes grows with sample size for “hard” problems – First provable improvement of multiple passes over SGD
[NB: Hardt et al. (2016); Lin and Rosasco (2017) consider small step-sizes]
• Current work - Extensions
– Study of cycling and sampling without replacement (Shamir, 2016; G¨urb¨uzbalaban et al., 2015)
– Mini-batches
– Beyond least-squares
– Optimal efficient algorithms for the situation µ > 2r
– Combining analysis with exponential convergence of testing errors (Pillaud-Vivien, Rudi, and Bach, 2017)
References
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Advances in Neural Information Processing Systems (NIPS), 2013.
Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.
Foundations of Computational Mathematics, 7(3):331–368, 2007.
A. Dieuleveut and F. Bach. Non-parametric Stochastic Approximation with Large Step sizes. Annals of Statistics, 44(4):1363–1399, 2016.
Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.
The Annals of Statistics, 44(4):1363–1399, 2016.
Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and markov chains. Technical Report 1707.06386, arXiv, 2017.
Mert G¨urb¨uzbalaban, Asu Ozdaglar, and Pablo Parrilo. Why random reshuffling beats stochastic gradient descent. Technical Report 1510.08560, arXiv, 2015.
Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 2016.
Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(97):1–47, 2017.
O. Macchi. Adaptive processing: The least mean squares approach with applications in transmission.
Wiley West Sussex, 1995.
L. Pillaud-Vivien, A. Rudi, and F. Bach. Stochastic gradient methods with exponential convergence of testing errors. Technical Report 1712.04755, arXiv, 2017.
Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. Technical Report 1805.10074, arXiv, 2018.
Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.
A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nystr¨om computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.
B. Sch¨olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2001.
Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems 29, pages 46–54, 2016.
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
Ingo Steinwart, Don R. Hush, and Clint Scovel. Optimal rates for regularized least squares regression.
In Proc. COLT, 2009.
A. B. Tsybakov. Optimal rates of aggregation. In Proc. COLT, 2003.
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.
Constructive Approximation, 26(2):289–315, 2007.