• Aucun résultat trouvé

poster

N/A
N/A
Protected

Academic year: 2022

Partager "poster"

Copied!
1
0
0

Texte intégral

(1)

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

U. Marteau-Ferey, D. Ostrovskii, F. Bach, A. Rudi

SIERRA Project-Team, INRIA, D´epartement d’Informatique de l’´ENS, PSL Research University

Problem Setting:

Unknown distribution : rv Z ∈ Z with distribution ρ Parameter θ ∈ H, H a Hilbert space

Problem: Minimize an expected loss:

minθ∈H L(θ) := E [`Z(θ)], `z(θ) loss function (1) Well-specified assumption θ? ∈ argminθ∈H L(θ) exists

Statistical performance: L(θ) − L(θ?)

Data: access to ρ through n i.i.d observations (zi)1≤i≤n ∈ Zn from Z Regularized ERM:

θbλ = argminθ∈H L(θb ) + λ

2kθk2H, L(θb ) = 1 n

n

X

i=1

`zi(θ) Basic result : slow rates

L(θbλ) − L(θ?) ≤ k∇Lk2 + kθ?k2

n , λ = 1

n

Regularized Empirical Risk Minimization

Loss: `x,y(θ) = ky − θ · xk2, L(θ) = E [kY − θ · Xk2] Covariance operator: Σ = E [X ⊗ X]

∀θ ∈ H, L(θ) − L(θ?) = kΣ1/2(θ − θ?)k2 = kθ − θ?k2Σ. Two main terms :

Effective dimension: dfλ = Tr(Σ−1λ Σ), Σλ = Σ + λI Bias term: bλ = λkΣ−1λ θ?k

Bias-Variance trade-off : L(θbλ) − L(θ?) ≤ b2λ + dfλ

n .

Bias-variance trade-off for least-squares

Effective dimensionspectrum of covariance matrixi)i eigenvalues of Σ in decreasing order.

assumption: dfλ ≤ Q2λ−1/αλi = O (i−α)

Bias termdifficulty of the learning problem assumption : bλ ≤ Lλ1/2+r ↔ kΣ−rθ?k <

Optimal fast rates for λ = (Q/L)2 n−α/((1+2r)α+1)

L(θbλ) − L(θ?) ≤ QL2(1−γ)n−γ, γ = (1 + 2r)α/(1 + (1 + 2r)α)

Parametrization and optimal rates

We acknowledge support from the ERCIM Alain Bensoussan Fellowship and the European Reasearch Council (SEQUOIA 724063)

A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.Found.Comput. Math., 7(3):331–368, July 2007.

Dmitrii Ostrovskii and Francis Bach.Finite-sample analysis of M-estimators using self-concordance. Technical Report 1810.06838, arXiv, 2018.

Acknowledgments and References

Hessian at optimum: H(θ?) = ∇2L(θ?) = E [∇2`Z?)]

Main terms :

Effective dimension: dfλ = E

hkH−1/2λ?)∇`Z?)k2i

Bias term: bλ = λkH−1λ??k = k∇Lλ?)kH−1

λ ?)

Dikin radius: rλ = √

λ/R

Bias-Variance trade-off :

L(θbλ) − L(θ?) ≤ b2λ + dfλ

n , bλ, p

dfλ/n ≤ rλ

Generalized bias-variance trade-off

Effective dimensionrenormalized Fischer information assumption: dfλ ≤ Q2λ−1/α.

For GLMs in the well specified case, dfλ = Tr(H−1λ?)H(θ?)).

Bias termdifficulty of the learning problem assumption : bλ ≤ Lλ1/2+r ↔ kH(θ?)−rθ?k <

Optimal fast rates for λ = (Q/L)2 n−α/((1+2r)α+1)

L(θbλ) − L(θ?) ≤ QL2(1−γ)n−γ

Generalized optimal rates

Define θλ? = argminθ∈H L(θ) + λ2kθk2.

Idea: Decompose the statistical performance θ?θλ?θbλ. Bias term: θ?θλ? Using localisation on Lλ:

bλ ≤ rλ =⇒

(R kθλ?θ?k ≤ log 2 kθλ?θ?kHλ ≤ 2bλ

Equivalence of norms: HλHb λ Concentration inequality.

Variance term: θbλθλ? Localization + Concentration inequality : 1. Localization on Lbλ + Equivalence of norms

k∇Lbλλ?)kH−1

λ ≤ rλ =⇒

(R kθλ?θbλk ≤ log 2

λ?θbλkHλ ≤ 2k∇Lbλλ?)kH−1

λ

2. Concentration inequality

k∇Lbλλ?)kH−1

λ ≤ p

dfλ/n Putting things together pdfλ/n, bλ ≤ rλ =⇒

(R kθ?θbλk ≤ 2 log 2 kθ?θbλkHλ ≤ p

dfλ/n + bλ Quadratic approximation: L(θbλ) − L(θ?) ≤ 4(p

dfλ/n + bλ)2

Sketch of proof

4 3 2 1 0 1 2 3 4

0 t

1 2 3 4 5

f(t)

f(t) = log(1 + e

t)

f(t) = log((e

t + e t)/2)

f(t) = 1 + t

2 1

f(t) = t

2

(3)F (θ)[h, k, k] ≤ R khk ∇2F (θ)[k, k]

Assumption: the (`z)z∈supp(ρ) are all GSC functions for R.

Consequence: L, Lλ = L + λ2k · k2, Lbλ are GSC for R.

Definition (GSC function)

Examples from Supervised Learning

Distribution: Z = (X, Y ), where we want to predict Y from X. Predictor: θ · Φ(x) (θ · Φ(x, y) for multiclass)

Assumption: Y is bounded, Φ(X) is bounded Regression: `z(θ) = ψ(y − θ · Φ(x))

• Square loss: ψ(t) = 12t2

• Huber loss 1: ψ(t) = √

1 + t2 − 1

• Huber loss 2: ψ(t) = log et+e2 −t Classification:

• Logistic loss: `z(θ) = log(1 + e−y θ·Φ(x))

• GLMs: `z(θ) = −θ · Φ(x, y) + log R

Y exp (θ · Φ(x, y0)) dµ(y0)

Generalized Self-Concordance

Assume F is GSC, with minimizer θ?. Let θ ∈ H, t = R kθ − θ?k.

Quadratic approximation

F (θ) − F?) ≤ et kθ − θ?k22F(θ)

Localization using gradients Define Fλ = F + λ2k · k2. k∇Fλ(θ)k2Fλ(θ)−1 ≤ rλ =⇒

(t ≤ log 2

kθ − θ?k2Fλ(θ) ≤ 2k∇Fλ(θ)k2Fλ(θ)−1

Properties of GSC functions

Références

Documents relatifs

In this paper, we proposed to solve a range of computational imaging problems under the perspective of a regularized weighted least-squares model using a PCG approach.. We provided

In order to exemplify the behavior of least-squares methods for policy iteration, we apply two such methods (offline LSPI and online, optimistic LSPI) to the car-on- the-hill

We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced

For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity

Also, in functional data analysis (FDA) where observed continuous data are measured over a densely sampled grid and then repre- sented by real-valued functions rather than by

In this paper we proposed m-power regularized least squares regression (RLSR), a supervised regression algorithm based on a regularization raised to the power of m, where m is with

Considering a capacity assumption of the space H [32, 5], and a general source condition [1] of the target function f H , we show high-probability, optimal convergence results in

In order to characterize the optimality of these new bounds, our third contribution is to consider an application to non-parametric regression in Section 6 and use the known