• Aucun résultat trouvé

Discussion of “Minimal penalties and the slope heuristics: a survey” by Sylvain Arlot

N/A
N/A
Protected

Academic year: 2022

Partager "Discussion of “Minimal penalties and the slope heuristics: a survey” by Sylvain Arlot"

Copied!
2
0
0

Texte intégral

(1)

Journal de la Société Française de Statistique

Vol. 160No. 3 (2019)

|

Discussion of “Minimal penalties and the slope heuristics: a survey” by Sylvain Arlot

Titre:Discussion sur "Pénalités minimales et heuristique de pente" par Sylvain Arlot

Sylvain Sardy1

Most machine learning methods require the selection of a regularization parameter that controls the complexity and the fit of the estimated model. The learner considered here is a sequence of projections into a collection of linear subspaces(Sm)m∈M, the regularization penalizes the least squares byCdim(Sm), and the goodness-of-fit measure is the predictive risk. Theoptimalpenalty is seen as the constantC that unbiasedly estimates the predictive risk. Sylvain Arlot makes a thorough theoretical and empirical survey and provides great insights of the minimal penalty and the slope heuristics that circumvent the difficult problem of estimating the noise varianceσ2.

FIGURE1. Example of risk estimation compared to true loss.

0 2 4 6 8

−20002004006008001000

λ

risk

true loss SURE Mallow’s Cp

Subset selection l0

0 1 2 3 4 5 6

020040060080010001200

λ

risk

Adaptive lasso

Regularization methods know that bias is good, yet they often paradoxically seek an optimal

1 Section de Mathématiques, Université de Genève

Journal de la Société Française de Statistique,Vol. 160 No. 3 152-153 http://www.sfds.asso.fr/journal

© Société Française de Statistique et Société Mathématique de France (2019) ISSN: 2102-6238

(2)

Discussion by S. Sardy 153

model by minimizing anunbiasedestimate of the risk. We recommend biased estimation of the risk towards models of lower complexity. We illustrate our point with orthonormal regression Y=µ+ε, whereµ is believed to be sparse. We consider two estimators.

A typical projection estimator is subset selection withCnpmodels of sizep∈ {0,1, . . . ,n}and a total of|M|=2n models (Sm)m∈M. Conditional on p=dim(Sˆm) the best model ˆSm is the support of ˆµϕϕhard(Y) with threshold ϕ =√

C andC=Y(n−p)2 . In that case the unbiased risk estimate formula based onStein(1981) andSardy(2009, Equation (12)) takes into account that the optimal model ˆSmis estimated. On the contrary Mallow’sCpof (9) which unbiasedness property is conditional on eachSmof sizepunderestimates the variance and consequently selects an over-complex model. The left plot of Figure1illustrates this behavior. The factor 2 in (9) is too small, which concurs withBirgé and Massart(2007).

Adaptive lasso (Zou,2006) indexed by(λ,ν)includes best subset selection at its limit when ν→∞. For adaptive lasso, the right plot of Figure 1shows that, for a fixed large ν=20, the unbiased estimate of the risk (Sardy,2012) as a function ofλ has high variance on the left side of the minimum of the true loss which itself has a high negative derivative. Biasing towards smaller complexity (i.e., largerλ) would lead to an estimator with smaller risk.

These two examples suggest a slope method with a larger constant than 2. An even larger con- stant should be employed when the design is not fixed, the regression matrix is badly condition (ill-posed inverse problems) andσ2 is unknown. BIC (Schwarz,1978) and Quantile universal threshold (QUT) (Giacobino et al.,2017) lead to low complexity models. QUT is also a good competitor of the scree test to recover the number of components in principal component analysis (Josse and Sardy,2016).

References

Birgé, L. and Massart, P. (2007). Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33–73.

Giacobino, C., Sardy, S., Diaz Rodriguez, J., and Hengardner, N. (2017). Quantile universal threshold. Electronic Journal of Statistics, 11(2):4701–4722.

Josse, J. and Sardy, S. (2016). Adaptive shrinkage of singular values.Statistics and Computing, 26(3):715–724.

Sardy, S. (2009). Adaptive posterior mode estimation of a sparse sequence for model selection.Scandinavian Journal of Statistics, 36:577–601.

Sardy, S. (2012). Smooth blockwise iterative thresholding: a smooth fixed point estimator based on the likelihood’s block gradient.Journal of the American Statistical Association, 107(498):800–813.

Schwarz, G. (1978). Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464.

Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, 9(6):1135–

1151.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418–1429.

Journal de la Société Française de Statistique,Vol. 160 No. 3 152-153 http://www.sfds.asso.fr/journal

© Société Française de Statistique et Société Mathématique de France (2019) ISSN: 2102-6238

Références

Documents relatifs

We shall say that two power series f, g with integral coefficients are congruent modulo M (where M is any positive integer) if their coefficients of the same power of z are

In the framework of least-squares fixed-design regression with projection estimators and Gaussian noise, Theorem 1 validates the slope heuristics in a stronger sense compared

The slope heuristic is known to adapt itself to real dataset, in the sense that if the true model does not belong to the collection, the method will select a competitive model

Indeed, the Scree plot of n eigenvalues or singular values, with the i-th largest eigen- value plotted against the horizontal value i/n, is none other than a (sense-reversed)

Average risk (left) and model’s dimension (right) obtained by CV, Jump and Plateau heuristics over 400 simulations of the Line (up) and Square (bottom) simulated data sets. Black

The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.. L’archive ouverte pluridisciplinaire HAL, est

Vaiter S, Deledalle C, Peyr´ e G, Fadili J, Dossal C (2012b) Local behavior of sparse analysis regularization: Applications to

In Section 2, we present some preliminary results including large deviation estimates on the passage time, a lemma to control the tail distribution of maximal weight of paths