poster

(1)

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

U. Marteau-Ferey, D. Ostrovskii, F. Bach, A. Rudi

SIERRA Project-Team, INRIA, D´epartement d’Informatique de l’´ENS, PSL Research University

Problem Setting:

Unknown distribution : rv Z ∈ Z with distribution ρ Parameter θ ∈ H, H a Hilbert space

Problem: Minimize an expected loss:

minθ∈H L(θ) := E [`_Z(θ)], `_z(θ) loss function (1) Well-specified assumption θ^? ∈ argmin_θ∈H L(θ) exists

Statistical performance: L(θ) − L(θ^?)

Data: access to ρ through n i.i.d observations (z_i)_1≤i≤n ∈ Zⁿ from Z Regularized ERM:

θb_λ = argmin_θ∈H L(θb ) + λ

2kθk²_H, L(θb ) = 1 n

n

X

i=1

`_z_i(θ) Basic result : slow rates

L(θb_λ) − L(θ^?) ≤ k∇Lk²_∞ + kθ^?k²

√n , λ = 1

√n

Regularized Empirical Risk Minimization

Loss: `_x,y(θ) = ky − θ · xk², L(θ) = E [kY − θ · Xk²] Covariance operator: Σ = E [X ⊗ X]

∀θ ∈ H, L(θ) − L(θ^?) = kΣ^1/2(θ − θ^?)k² = kθ − θ^?k²_Σ. Two main terms :

Effective dimension: df_λ = Tr(Σ⁻¹_λ Σ), Σ_λ = Σ + λI Bias term: b_λ = λkΣ⁻¹_λ θ^?k

Bias-Variance trade-off : L(θb_λ) − L(θ^?) ≤ b²_λ + df_λ

n .

Bias-variance trade-off for least-squares

Effective dimension ↔ spectrum of covariance matrix (λ_i)_i eigenvalues of Σ in decreasing order.

assumption: df_λ ≤ Q²λ^−1/α ↔ λ_i = O (i^−α)

Bias term ↔ difficulty of the learning problem assumption : b_λ ≤ Lλ^1/2+r ↔ kΣ^−rθ^?k < ∞

Optimal fast rates for λ = (Q/L)² n^−α/((1+2r^)α+1)

L(θb_λ) − L(θ^?) ≤ Q^2γL^2(1−γ⁾n^−γ, γ = (1 + 2r)α/(1 + (1 + 2r)α)

Parametrization and optimal rates

We acknowledge support from the ERCIM Alain Bensoussan Fellowship and the European Reasearch Council (SEQUOIA 724063)

• A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.Found.Comput. Math., 7(3):331–368, July 2007.

• Dmitrii Ostrovskii and Francis Bach.Finite-sample analysis of M-estimators using self-concordance. Technical Report 1810.06838, arXiv, 2018.

Acknowledgments and References

• Hessian at optimum: H(θ^?) = ∇²L(θ^?) = E [∇²`_Z(θ^?)]

Main terms :

• Effective dimension: df_λ = E

hkH^−1/2_λ (θ^?)∇`_Z(θ^?)k²i

• Bias term: b_λ = λkH⁻¹_λ (θ^?)θ^?k = k∇L_λ(θ^?)k_H⁻¹

λ (θ^?)

• Dikin radius: r_λ = √

λ/R

Bias-Variance trade-off :

L(θb_λ) − L(θ^?) ≤ b²_λ + df_λ

n , b_λ, p

df_λ/n ≤ r_λ

Generalized bias-variance trade-off

Effective dimension ↔ renormalized Fischer information assumption: df_λ ≤ Q²λ^−1/α.

For GLMs in the well specified case, df_λ = Tr(H⁻¹_λ (θ^?)H(θ^?)).

Bias term ↔ difficulty of the learning problem assumption : b_λ ≤ Lλ^1/2+r ↔ kH(θ^?)^−rθ^?k < ∞

Optimal fast rates for λ = (Q/L)² n−α/((1+2r)α+1)

L(θb_λ) − L(θ^?) ≤ Q^2γL^2(1−γ⁾n^−γ

Generalized optimal rates

Define θ_λ^? = argmin_θ∈H L(θ) + ^λ₂kθk².

Idea: Decompose the statistical performance θ^? → θ_λ^? → θb_λ. Bias term: θ^? ↔ θ_λ^? Using localisation on L_λ:

b_λ ≤ r_λ =⇒

(R kθ_λ^? − θ^?k ≤ log 2 kθ_λ^? − θ^?k_H_λ ≤ 2b_λ

Equivalence of norms: H_λ ∼ Hb _λ Concentration inequality.

Variance term: θb_λ ↔ θ_λ^? Localization + Concentration inequality : 1. Localization on Lb_λ + Equivalence of norms

k∇Lb_λ(θ_λ^?)k_H⁻¹

λ ≤ r_λ =⇒

(R kθ_λ^? − θb_λk ≤ log 2

kθ_λ^? − θb_λk_H_λ ≤ 2k∇Lb_λ(θ_λ^?)k_H⁻¹

λ

2. Concentration inequality

k∇Lb_λ(θ_λ^?)k_H⁻¹

λ ≤ p

df_λ/n Putting things together pdf_λ/n, b_λ ≤ r_λ =⇒

(R kθ^? − θb_λk ≤ 2 log 2 kθ^? − θb_λk_H_λ ≤ p

df_λ/n + b_λ Quadratic approximation: L(θb_λ) − L(θ^?) ≤ 4(p

df_λ/n + b_λ)²

Sketch of proof

4 3 2 1 0 1 2 3 4

0 t

1 2 3 4 5

f(t)

f(t) = log(1 + e

^t)

f(t) = log((e

^t + e ^t)/2)

f(t) = 1 + t

² 1

f(t) = t

²

∇⁽³⁾F (θ)[h, k, k] ≤ R khk ∇²F (θ)[k, k]

Assumption: the (`_z)_z∈supp(ρ) are all GSC functions for R.

Consequence: L, L_λ = L + ^λ₂k · k², Lb_λ are GSC for R.

Definition (GSC function)

Examples from Supervised Learning

Distribution: Z = (X, Y ), where we want to predict Y from X. Predictor: θ · Φ(x) (θ · Φ(x, y) for multiclass)

Assumption: Y is bounded, Φ(X) is bounded Regression: `_z(θ) = ψ(y − θ · Φ(x))

• Square loss: ψ(t) = ¹₂t²

• Huber loss 1: ψ(t) = √

1 + t² − 1

• Huber loss 2: ψ(t) = log ^e^t^+e₂ ^−t Classification:

• Logistic loss: `_z(θ) = log(1 + e^{−y θ·Φ(x)})

• GLMs: `_z(θ) = −θ · Φ(x, y) + log R

Y exp (θ · Φ(x, y⁰)) dµ(y⁰)

Generalized Self-Concordance

Assume F is GSC, with minimizer θ^?. Let θ ∈ H, t = R kθ − θ^?k.

Quadratic approximation

F (θ) − F (θ^?) ≤ e^t kθ − θ^?k²_∇²_F_(θ)

Localization using gradients Define F_λ = F + ^λ₂k · k². k∇F_λ(θ)k_∇²_F_λ_(θ)⁻¹ ≤ r_λ =⇒

(t ≤ log 2

kθ − θ^?k_∇²_F_λ_(θ) ≤ 2k∇F_λ(θ)k_∇²_F_λ_(θ)⁻¹