Structured High-Dimensional Models

Dans le document The DART-Europe E-theses Portal (Page 16-20)

Over the last decades, Statistics has been at the center of attention, in a wide variety of ways. Thanks to technological improvements, for instance the increase of computer performance and the soar of sharing data capacity, high-dimensional statistics has been extremely dynamic. As a consequence, this field became one of the main pillars of the modern statistical landscape.

The standard statistical framework considers the case where the sample size is rel-atively large and the dimension of the observations substantially smaller. As pointed out in Giraud (2014), the technological evolution of computing has urged a shift of paradigm from classical statistical theory to high-dimensional statistics. More precisely, we characterize a statistical problem as high-dimensional whenever the dimension of the observations is much larger than the sample size. It has become more common with the increase of accessible features of data.

Generally speaking, statistical problems are ill posed in the high-dimensional setting.

Further assumptions on the structure of the underlying model are required in order to make the problem more significant. For instance in a problem of high-dimensional regression, we may assume that the vector to estimate is sparse (i.e. only few components are non-zero), or that the signal matrix is of small rank when dealing with matrix estimation. These assumptions are usually very realistic and endorsed by empirical evidence. This is what can described as as Structured High-Dimensional Statistics.

A new paradigm

One way to summarize some paradigms of modern Statistics is the following. For statis-tical methods to be "successful", they need to fulfill the OCAR criterion, where OCAR stands respectively for Optimality, Computational tractability, Adaptivity and Robust-ness.

3

• Optimality

In order to evaluate and compare algorithms the oldest criterion is probably the statistical optimality. An estimator is said to be optimal if it cannot be improved in some sense. A widely used criterion is minimax optimality. The notion of minimax optimality is relative to some risk. In order to make this notion more transparent, let us assume that we observe i.i.d realizations X1, . . . , Xn of some random variable X. Suppose moreover that the distribution of X is given by P for some parameter ✓ 2 ⇥ we are interested in. Given a semi-distance d, the performance of an estimator ✓ˆn = ˆ✓n(X1, . . . , Xn) of ✓ is measured by the maximum risk of this estimator on⇥:

r(ˆ✓n) = sup

2

E

⇣d2(ˆ✓n,✓)⌘ ,

whereE denotes the expectation with respect to(X1, . . . , Xn). The minimax risk is given by the smallest worst-case risk reached among all measurable estimators.

It is given by:

Rn = inf

ˆn

r(ˆ✓n),

where the infimum is over all estimators. In practice, we have a general framework to derive minimax lower bounds, cf. Tsybakov (2008). We say that an estimator

n is non-asymptotically minimax optimal if the following holds r(✓n)CRn,

where C >0 is a constant.

Given this criterion, we are interested in estimators achieving the minimax optimal rate. As an example, consider the problem of low rank matrix estimation. It turns out that a simple spectral procedure is minimax optimal. Indeed, Koltchinskii et al. (2011) gives a lower bound and a matching upper bound for the problem of minimax low-rank matrix estimation through a nuclear norm penalization pro-cedure. We recall that the notion of minimax optimality is one way to define optimality, and one may think of other criteria, for instance a Bayesian risk in-stead of the minimax risk.

We should point out that the notion of minimax risk is not impeccable. In general, this notion is pessimistic since the worst case scenario may be located in a tiny region of ⇥. In that case, the worst-case scenario is not likely to be realized. This fact is detailed further in Chapter 5.

• Computational Tractability

Computational tractability captures whether a given algorithm can be computed in polynomial time. For instance, a method based on a sample of size N that runs in O(N2) is practical while another one running in O(epN) is not. The recent importance of this criterion is due to the explosion of sample sizes versus the limited capacity of our actual machines. Indeed, for many statistical problems computational by non-tractable exhaustive search methods (i.e. greedy methods testing all possible solutions in a finite set of an exponential size) are shown to be optimal from a statistical point of view.

1.1. STRUCTURED HIGH-DIMENSIONAL MODELS 5 One of the most challenging problems related to tractability of algorithms is related to computational lower bounds. While a large body of techniques is available to derive general lower bounds for minimax risks, not much in known when we restrict the class of estimators to polynomial time methods. Karp (1972) has proved, for the specific problem of detecting the presence of a hidden clique, that there is a non trivial gap between what could be achieved by any method and by polynomial time methods. This breakthrough shows that it is not always possible to reach statistical optimality through polynomial methods. Inspired by the planted clique problem, the previous fact has been extended to Sparse PCA among many other problems, cf. Berthet and Rigollet (2013). Apart from this reduction to the planted clique problem, it is still unclear how to derive general computational lower bounds having the same flavour as information-theoretical lower bounds.

• Adaptivity

In order to measure the performance of a given estimator, we may assume that the data is generated according to some model. This model is used further to evaluate the algorithm. Usually, a model depends on different parameters, and the proposed estimator may depend on these parameters. The criterion of adaptivity aims to compare two optimal algorithms through their ability to adapt to the parameters of the model. Sometimes optimality and adaptive optimality are slightly different but in many scenarios adaptivity is possible at almost no cost. For instance consider the problem of high-dimensional estimation in linear regression. The performance of LASSO and SLOPE (Bogdan et al. (2015)) estimators is studied in Bellec et al. (2018) under similar conditions on the design. It turns out that a sparsity dependent tuning of LASSO achieves the minimax estimation rate. While LASSO requires a prior knowledge of the sparsity, SLOPE is adaptively minimax optimal.

Still, we may argue that SLOPE requires a higher complexity due to the sorting step. This may be seen as the price to pay for adaptation. To the best of our knowledge, the question of minimax adaptive optimality using a fixed complexity has not been addressed so far. Generally speaking adaptation to sparsity can be done through two main techniques, either by a Lepski type method or by sorted thresholding procedures as in the Benjamini-Hochberg procedure.

• Robustness

There are two popular notions of robustness. The classical robustness is with respect to outliers, in the sense that a small fraction of data is corrupted by outliers.

The Huber contamination model is a typical example of it (Huber (1992)). Let X1, . . . , Xn be n i.i.d random variables and p¯the probability distribution of Xi. There are two probability measures p, q and a real ✏2[0,1/2) such that

¯

p= (1 ✏)p+✏q, 8i2{1, . . . , n}.

This model corresponds to assuming that (1 ✏)-fraction of observations, called inliers, are drawn from a reference measure p, whereas ✏-fraction of observations are outliers and are drawn from another distribution q. In general, all the three parameters p, q and ✏ are unknown. The parameter of interest is the reference distributionp, whereasqand ✏play the role of nuisance parameters. For instance, the particular case where p is the normal distribution with unknown mean ✓ and

variance 1 has been extensively studied in the last decade, cf. Diakonikolas et al.

(2016, 2017) and references therein.

In dimension one, it is clear that the empirical median is a robust alternative to the empirical mean. The problem becomes more complicated in higher dimensions since there are many generalizations of median in dimension larger than two. For the normal mean estimation problem, Chen et al. (2018) show that robust estima-tion can be achieved in a minimax sense through Tukey’s median (Tukey (1975)).

Unfortunately, this approach is not computationally efficient. Recently, many ef-forts has been made to prove similar results using polynomial time methods, for instance, filtering techniques (Diakonikolas et al. (2016)) and group threshold-ing (Collier and Dalalyan (2017)). We should add here that the outliers may be deterministic, random or even adversarial.

A more recent notion of robustness is with respect to heavy tailed noise. It is pioneered by Catoni (2012). Although the sub-Gaussian noise assumption is not always realistic, it is quite convenient in order to derive non-asymptotic results thanks to concentration properties. These guarantees fail under heavy tail as-sumptions of the noise. Assume that we observeX1, . . . , Xn2Rp such that

Xi =µ+⇠i,

where⇠i are i.i.d centered sub-Gaussian random vectors with independent entries.

In that case the empirical mean Xˆ satisfies, for any given confidence level >0, the following:

P kXˆ µk C rp

n +

rlog 1/

n

!!

 ,

where C > 0 and k.k denotes the `2 norm. Recently, the Median-Of-Means esti-mator (Nemirovskii and Yudin (1983)) was shown to achieve similar results under very mild assumptions on the noise in dimension one, cf. Devroye et al. (2016).

Generalization to high dimensions through tractable methods has been an active field of research in recent years. The recent paper by Cherapanamjeri et al. (2019), exhibits a new method based on an SDP relaxation achieving similar results in polynomial time. Their algorithm is significantly faster than the one proposed by Hopkins (2018), which was, to the best of our knowledge, the first polynomial method achieving sub-Gaussian guarantees for mean estimation under only the second moment assumption.

To sum up, we have presented some criteria that we believe are in the core of mod-ern Statistics. Following this perspective, the ideal algorithm would satisfy the OCAR.

However this is subject to further evolution. Distributional implementation along with storage capacity are already attracting attention, cf. Szabo and van Zanten (2017) and Ding et al. (2019) for recent advances in these directions. If the data keeps grow-ing without improvgrow-ing the speed limitations then at some point distributed algorithms will become to polynomial time methods what today polynomial time methods are for exponential time methods.

Dans le document The DART-Europe E-theses Portal (Page 16-20)