• Aucun résultat trouvé

Semi-parametric mixture models and applications to multiple testing

N/A
N/A
Protected

Academic year: 2021

Partager "Semi-parametric mixture models and applications to multiple testing"

Copied!
118
0
0

Texte intégral

(1)

HAL Id: tel-00987035

https://tel.archives-ouvertes.fr/tel-00987035

Submitted on 12 May 2014

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Semi-parametric mixture models and applications to

multiple testing

van Hanh Nguyen

To cite this version:

van Hanh Nguyen. Semi-parametric mixture models and applications to multiple testing. General Mathematics [math.GM]. Université Paris Sud - Paris XI, 2013. English. �NNT : 2013PA112196�. �tel-00987035�

(2)

Université Paris-Sud Faculté des Sciences d’Orsay

THÈSE présentée pour obtenir

LE GRADE DE DOCTEUR EN SCIENCES DE L’UNIVERSITÉ PARIS XI

Spécialité : Mathématiques par

Van Hanh Nguyen

Sujet :

MODÈLES DE MÉLANGE SEMI-PARAMÉTRIQUES ET

APPLICATIONS AUX TESTS MULTIPLES.

Rapporteurs : M. Gilles BLANCHARDM. Stéphane ROBIN

Soutenue le 01 Octobre 2013 devant la Commission d’examen : Mme. Cristina BUTUCEA (Présidente du jury)

M. Alain CELISSE (Examinateur)

Mme. Élisabeth GASSIAT (Co-directrice de thèse) Mme. Catherine MATIAS (Co-directrice de thèse)

M. Stéphane ROBIN (Rapporteur)

(3)

Remerciements

Je voudrais tout d’abord remercier grandement mes deux directrices de thèse : Catherine MATIAS, pour son encadrement scientifique de qualité qui m’a permis de m’épanouir dans mes travaux de recherche, aussi pour sa grande gentillesse et sympathie qu’elle m’a apporté pendant ces trois années, ainsi que Elisabeth GASSIAT, pour ses multiples conseils et pour la confiance qu’elle m’a accordé en acceptant d’encadrer ce travail doctoral.

Je remercie également à Gilles BLANCHARD et Stéphane ROBIN, pour avoir accepté de rapporter sur ma thèse. Leurs conseils sur le manuscrit m’ont beaucoup aidé à améliorer ma thèse. Mes remerciements vont également à Cristina BUTUCEA, Alain CELISSE et Etienne RO-QUAIN qui me font l’honneur d’accepter d’être membres de Jury de ma soutenance.

Je voudrais remercier très chaleureusement tous les membres du laboratoire Statistique et Génome. Je remercie particulièrement Christophe AMBROISE et Michèle ILBERT pour leur gentillesse et leur aide. Je remercie également Marie-Luce TAUPIN pour sa co-direction de mon stage de Master 2. Il m’est impossible d’oublier les trois années passées au laboratoire.

Je voudrais remercier le Ministère de l’enseignement supérieur et de la recherche de la France qui a financé cette thèse en m’accordant un poste d’allocataire de recherche à l’Université Paris-Sud. Je remercie également David HARARI, les membres du département de Mathématiques d’Orsay et les personnels du sécrétariat de l’Université Paris-Sud.

Enfin, je désire remercier ma famille, mes amis qui m’ont encouragé continuellement pendant toutes mes études en France.

(4)

Résumé

Dans un contexte de test multiple, nous considérons un modèle de mélange semipara-métrique avec deux composantes. Une composante est supposée connue et correspond à la distribution des p-valeurs sous hypothèse nulle avec probabilité a priori θ. L’autre composante f est nonparamétrique et représente la distribution des p-valeurs sous l’hypothèse alternative. Le problème d’estimer les paramètres θ et f du modèle ap-paraît dans les procédures de contrôle du taux de faux positifs (“false discovery rate” ou FDR). Dans la première partie de cette dissertation, nous étudions l’estimation de la proportion θ. Nous discutons de résultats d’efficacité asymptotique et établissons que deux cas différents arrivent suivant que f s’annule ou non surtout un intervalle non-vide. Dans le premier cas (annulation surtout un intervalle), nous présentons des estimateurs qui convergent à la vitesse paramétrique, calculons la variance asympto-tique optimale et conjecturons qu’aucun estimateur n’est asymptoasympto-tiquement efficace (i.e atteint la variance asymptotique optimale). Dans le deuxième cas, nous prouvons que le risque quadratique de n’importe quel estimateur ne converge pas à la vitesse paramétrique. Dans la deuxième partie de la dissertation, nous nous concentrons sur l’estimation de la composante inconnue nonparamétrique f dans le mélange, en comp-tant sur un estimateur préliminaire de θ. Nous proposons et étudions les propriétés asymptotiques de deux estimateurs différents pour cette composante inconnue. Le premier estimateur est un estimateur à noyau avec poids aléatoires. Nous établissons une borne supérieure pour son risque quadratique ponctuel, en montrant une vitesse de convergence nonparamétrique classique sur une classe de Hőlder. Le deuxième es-timateur est un eses-timateur du maximum de vraisemblance régularisée. Il est calculé par un algorithme itératif, pour lequel nous établissons une propriété de décroissance d’un critère. De plus, ces estimateurs sont utilisés dans une procédure de test multiple pour estimer le taux local de faux positifs (“local false discovery rate” ou ℓFDR).

(5)

Abstract

In a multiple testing context, we consider a semiparametric mixture model with two components. One component is assumed to be known and corresponds to the distribution of p-values under the null hypothesis with prior probability θ. The other component f is nonparametric and stands for the distribution under the alternative hypothesis. The problem of estimating the parameters θ and f of the model appears from the false discovery rate control procedures. In the first part of this dissertation, we study the estimation of the proportion θ. We discuss asymptotic efficiency results and establish that two different cases occur whether f vanishes on a non-empty interval or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. In the second part of the dissertation, we focus on the estimation of the nonparametric unknown component f in the mixture, relying on a preliminary estimator of θ. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of Hőlder densities. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate.

(6)

Contents

1 General Introduction 8

1.1 Multiple testing framework . . . 8

1.1.1 Multiple testing problem. . . 8

1.1.2 An example of multiple testing . . . 9

1.1.3 P -value and z-value of test . . . 10

1.1.4 Mixture model in multiple testing setup . . . 11

1.1.5 Multiple testing procedure . . . 12

1.1.6 Type I and II error rates. . . 12

1.2 Type I error rate control procedures . . . 13

1.2.1 FWER control procedures . . . 14

1.2.2 FDR control procedures . . . 15

1.3 The FDR estimation approach . . . 18

1.3.1 Estimation of pFDR and FDR . . . 18

1.3.2 Connection between FDR estimation and FDR control . . . 19

1.4 Local false discovery rate. . . 19

1.5 Semiparametric inference. . . 21

1.5.1 Tangent sets and efficient influence function . . . 21

1.5.2 Asymptotically efficient estimator. . . 23

1.5.3 Expressions for semiparametric models in a strict sense. . . 26

1.5.4 One-step estimator method . . . 27

1.5.5 The infinite bound case . . . 29

1.6 Organization . . . 29

2 Estimation of the proportion of true null hypotheses 31 2.1 Introduction . . . 31

(7)

CONTENTS

2.2 Lower bounds for the quadratic risk and efficiency . . . 35

2.3 Upper bounds for the quadratic risk and efficiency (when δ > 0) . . . 38

2.3.1 A histogram based estimator . . . 38

2.3.2 Celisse and Robin[2010]’s procedure . . . 39

2.3.3 One-step estimators . . . 41

2.4 Simulations . . . 43

2.5 Proofs of main results . . . 46

2.5.1 Proof of Proposition 2.1 . . . 46

2.5.2 Proof of Theorem 2.1. . . 48

2.5.3 Proofs from Sections 2.3.1 and 2.3.3 . . . 52

2.5.4 Proof of Theorem 2.3. . . 55

2.6 Proofs of technical lemmas . . . 60

2.6.1 Proof of Lemma 2.1 . . . 60

2.6.2 Proof of Lemma 2.2 . . . 61

2.6.3 Proof of Lemma 2.3 . . . 62

3 Estimation of the density of the alternative 64 3.1 Introduction . . . 65

3.2 Algorithmic procedures to estimate the density f . . . 69

3.2.1 Direct procedures. . . 69

3.2.2 Iterative procedures . . . 70

3.3 Mathematical properties of the algorithms . . . 73

3.3.1 Randomly weighted kernel estimator . . . 73

3.3.2 Maximum smoothed likelihood estimator. . . 76

3.4 Estimation of local false discovery rate and simulation study . . . 79

3.4.1 Estimation of local false discovery rate . . . 79

(8)

CONTENTS

3.5 Proofs of main results . . . 85

3.5.1 Proof of Theorem 3.1. . . 85

3.5.2 Other proofs . . . 93

3.6 Proofs of technical lemmas . . . 95

3.6.1 Proof of Lemma 3.3 . . . 95

3.6.2 Proof of Lemma 3.4 . . . 97

3.6.3 Proof of Lemma 3.5 . . . 98

3.6.4 Proof of Lemma 3.6 . . . 99

4 Another semiparametric mixture model 101 4.1 Identifiability . . . 101

4.2 Efficient information matrix for estimating θ . . . 102

4.3 Perspectives . . . 107

Bibliography 108 Appendix 114 A Adaptive estimation via Lepski’s method 115 A.1 Lepski’s method . . . 115

(9)

Chapter 1

General Introduction

This overview briefly describes the main components of this dissertation, including multiple testing framework, type I error rate control procedures, FDR estimation approach, local false discovery rate and semiparametric inference. The last concept is the central motivation of this dissertation. This introduction borrows some material from Roquain (2011), Storey (2002,2004) and van der Vaart (1998,2002).

Contents

1.1 Multiple testing framework . . . 8

1.2 Type I error rate control procedures. . . 13

1.3 The FDR estimation approach . . . 18

1.4 Local false discovery rate . . . 19

1.5 Semiparametric inference. . . 21

1.6 Organization . . . 29

1.1 Multiple testing framework

1.1.1 Multiple testing problem

The problem of multiple testing has a long history in the statistics literature. Microarray analysis [Dudoit and van der Laan, 2008], astrophysics [Meinshausen and Rice, 2006] or neu-roimaging [Turkheimer et al.,2001] are some areas in which multiple testing problems occur. We first recall the basic paradigm for single-hypothesis testing. We wish to test a null hypothesis H0 versus an alternative H1 based on a statistic X. For a given rejection region Γ, we reject H0

when X ∈ Γ and we accept H0 when X /∈ Γ. A type I error occurs when the null hypothesis

(H0) is true but is rejected; while a type II error occurs when the null hypothesis is false but

is accepted. To choose Γ, the acceptable type I error is set at some level α, then all rejection regions are considered that have a type I error that is less than or equal to α. The one that has the lowest type II error is chosen. Therefore, the rejection region is sought with respect

(10)

1.1. MULTIPLE TESTING FRAMEWORK

to controlling the type I error. Precisely, we find a rejection region with nearly optimal power (power = 1 - type II error) while maintaining the desired α-level type I error.

Now, for multiple-hypothesis testing, the situation becomes much more complicated. For instance, we test simultaneously n = 10, 000 null hypotheses, of which n0 = 8, 000 are true nulls

(level α = 0.05 for each test). This procedure makes on average n0α = 400 false positives (type

I errors). It seems unsuitable because it is likely to select a lot of false positives. And it becomes unclear how we should measure the overall error rate. A multiple testing procedure aims at correcting a priori the level of the single tests in order to obtain the “quantity” of false positives that is below a nominal level α. The “quantity” of false positives is measured by using global type I error rates, as for instance the probability to make at least one type I error among all the hypotheses (family wise error rate, FWER) or the expected proportion of false positives among all rejected hypotheses (false discovery rate, FDR).

1.1.2 An example of multiple testing

In a microarray experiment, the level expressions of a set of genes are measured under two different experimental conditions and we aim at finding the genes that are differentially expressed between the two conditions. For instance, the genes come from tumor cells in the first experimen-tal condition, while the genes come from healthy cells in the second, the differentially expressed genes may be involved in the development of this tumor and thus are genes of special interest. The problem of finding differentially expressed genes can be formalized as a particular case of a general two-sample multiple testing problem. Let us observe a couple of two independent samples

(Y1, . . . , Yn1)∈ Rn×n1 and (Z1, . . . , Zn2)∈ Rn×n2,

where (Y1, . . . , Yn1) is a family of n

1 iid copies of a random vector Y in Rn and (Z1, . . . , Zn2)

is a family of n2 iid copies of a random vector Z in Rn. In the context of microarray data,

Yij (resp. Zij) is the expression level of the i-th gene for the j-th individual of the first (resp. second) experimental condition. Suppose that Y ∼ N (µ1, Σ1) and Z ∼ N (µ2, Σ2), where µ1 =

(µ11, . . . , µ1n) and µ2 = (µ21, . . . , µ2n) are mean vectors ofRn, Σ1 and Σ2are diagonal covariance

matrices. The index set {1 ≤ i ≤ n : µ1i 6= µ2i} corresponds to differentially expressed genes.

Then we aim at testing simultaneously n hypotheses

(11)

1.1. MULTIPLE TESTING FRAMEWORK

The individual test statistic is the classical two-sample t-statistic Xi = Yi− Zi ˆ σi q 1 n1 + 1 n2 where ˆσ2 i = (n1− 1)ˆσ1i2 + (n2− 1)ˆσ22i n1+ n2− 2 ,

and Yi, ˆσ21i(resp. Zi, ˆσ2i2) are the sample mean and the sample variance of the data {Yij}j (resp.

{Zij}j).

1.1.3 P-value and z-value of test

We define the p-value as the probability of observing something as extreme as or more extreme than the observed test statistic given that the null hypothesis is true. That is, we can consider the p-value as the minimum probability under the null that our test statistic is in the rejection region (i.e., the minimum type I error rate) over the set of nested rejection regions containing the observed test statistic. Formally, we can write the p-value [see Lehmann,1986], corresponding to an observed test statistic X = x as

p− value(x) = inf

{Γ:x∈Γ}{P(X ∈ Γ|H = 0)},

where {Γ : x ∈ Γ} is a set of nested rejection regions that contain the observed test statistic x. Any p-value is stochastically bounded by a uniform distribution under the null, namely,

P pi(X)≤ t|H = 0 ≤ t, for all t ∈ [0, 1]. (1.1)

For example, when the rejection regions Γ are of the forms {X ≥ c}, the p-value of X = x is p− value(x) = inf

{c:x≥c}{P(X ≥ c|H = 0)}

= P(X ≥ x|H = 0) = 1 − G0(x),

where G0is the cumulative distribution function (CDF) of test statistic X under null hypothesis.

If the distribution of the statistic Xi is absolutely continuous, (1.1) holds with equality, that is,

the p-values are exactly distributed like a uniform variable in [0, 1] when H0 is true.

Remark 1.1. When we reject the null hypotheses on the basis of p-values, all rejection regions are of the form [0, γ] for some γ > 0.

Indeed, according to the definition of p-value, for two p-values p1 and p2, the relation p1 ≤ p2

implies that the respective observed statistics x1 and x2 are such that x2 ∈ Γ implies x1 ∈ Γ.

(12)

1.1. MULTIPLE TESTING FRAMEWORK

We now define a z-value of test as the probit transformation Z = probit(P ) = Φ−1(P ),

where P is a p-value and Φ is the CDF of the standard normal distribution.

1.1.4 Mixture model in multiple testing setup

Suppose that we are testing n identical hypothesis tests H1, . . . , Hn with observed statistics

X1, . . . , Xn. The identical tests mean that the same rejection region type is used for each test.

We let Hi= 0 when the null hypothesis i is true and Hi = 1 otherwise. We denote by Ti = T (Xi)

a transformation of test statistic, for example, Ti is p-value Pi, z-value Zi, local false discovery

rate ℓFDR(Xi) (defined as below) or identical to test statistic Xi. We assume that the nulls

Ti|Hi= 0 and the alternatives Ti|Hi = 1 are identically distributed with respective distribution

functions G0that is known and G1that is unknown. Finally we assume that the Hiare Bernoulli

random variables with an unknown probability P(Hi = 0) = θ. The marginal distribution of

each Ti is thus a mixture

G(x) = θG0(x) + (1− θ)G1(x),

and we denote by g = θg0+ (1− θ)g1 the corresponding probability density function (pdf) of Ti

(if it exists). When we assume that the statistics Xi under the null hypotheses are continuous

variables, the p-values under the null hypotheses follow the uniform distribution U([0, 1]) on interval [0, 1] and the marginal distribution of each p-value is

F (x) = θx + (1− θ)F1(x), for x∈ [0, 1],

and we denote the corresponding pdf by f(x) = θ1[0,1](x) + (1− θ)f1(x), where f1 is an unknown

pdf on [0, 1]. If we consider the transformation Ti as z-value Zi, then

G0(x) =PH0(Zi ≤ x) = PH0(Pi ≤ Φ(x)) = Φ(x),

and

G1(x) =PH1(Zi ≤ x) = PH1(Pi ≤ Φ(x)) = F1(Φ(x)).

Thus the pdf of Zi is g(x) = φ(x)[θ + (1 − θ)f1(Φ(x))], for x ∈ R, where φ is the pdf of the

(13)

1.1. MULTIPLE TESTING FRAMEWORK

1.1.5 Multiple testing procedure

A multiple testing procedure (MTP) provides rejection regions, i.e., sets of values for each Ti

that lead to the decision to reject the corresponding null hypothesis Hi. In other words, a MTP

produces a random subset R of {1, · · · , n} that the indexes selected correspond to the rejected null hypotheses. A multiple testing setting includes the p-value family p = {pi, 1≤ i ≤ n} ∈ [0, 1]n.

The multiple testing procedure based on p is defined as a set-valued function R : p = (pi)1≤i≤n∈ [0, 1]n7→ R(p) ⊂ {1, · · · , n},

taking as input an element of [0, 1]nand returning a subset of {1, · · · , n}. The indexes selected by

the procedure R(p) correspond to the rejected null hypotheses. When we focus on the case of the identical tests based on the p-value family, one procedure, called thresholding based procedure, is of the form R(p) = {1 ≤ i ≤ n : pi ≤ t(p)}, where the threshold t(·) ∈ [0, 1] can depend on the

data.

1.1.6 Type I and II error rates

To measure the quality of a multiple testing procedure, various error rates have been proposed in the literature. These rates evaluate the importance of the null hypotheses wrongly rejected, that is the number of false positives (FP). Two error measures that are the most commonly used in multiple-hypothesis testing are the family wise error rate (FWER) and the false discovery rate (FDR). Moreover, the false discovery proportion (FDP) is also a widely used type I error. The definitions of these rates are recalled in the following. First, the outcome of testing n hypotheses simultaneously can be summarized as indicated in Table1.1.

Table 1.1: Possible outcomes from testing n hypotheses H1, . . . , Hn.

Accepts Hi Rejects Hi Total

Hi is true TN FP n0

Hi is false FN TP n1

Total W R n

The family wise error rate (FWER) is defined as the probability to make at least one false positive among all the hypotheses,

(14)

1.2. TYPE I ERROR RATE CONTROL PROCEDURES

The false discovery proportion (FDP) is defined as the proportion of false positives among the rejected hypotheses,

FDP = FP

max(R, 1).

Let us remark that the FDP is a random variable, it does not define an error rate. Benjamini and Hochberg[1995] define the false discovery rate (FDR) as the expectation of the FDP,

FDR = Eh FP max(R, 1) i =EhFPR R > 0 i P(R > 0).

They provided sequential p-value methods to control this quantity. FDR offers a much less strict multiple-testing criterion over FWER and therefore leads to an increase in power. Storey[2003] proposes to modify FDR so as to obtain a new criterion, the positive FDR (or pFDR) defined by pFDR = EhFP R R > 0 i

and argues that it is conceptually more sound than FDR. Indeed, when controlling FDR at level α, and positive findings have occured then FDR has really only been controlled at level α/P(R > 0). This can be quite dangerous, and it is not the case for pFDR. Other similar measure includes the marginal FDR (mFDR) defined as

mFDR = E(FP) E(R) .

Under weak conditions,Genovese and Wasserman[2002] showed that mFDR = FDR + O(n−1/2)

and Storey [2003] proved that mFDR and pFDR are identical. An analog of FDR in terms of false negatives (type II errors) is the false nondiscovery rate (FNR), defined as

FNR = Eh FN max(W, 1) i =EhFNW W > 0 i P(W > 0).

Similarly, we define the positive false nondiscovery rate (pFNR) as the conditional expectation pFNR = EhFN W W > 0 i .

1.2 Type I error rate control procedures

A multiple testing control procedure aims at finding a rejection region whose type I error rate is no larger than a certain level. There is well-defined relationship between two type I error

(15)

1.2. TYPE I ERROR RATE CONTROL PROCEDURES

rate: the FDR and the FWER. To see this, we write

Eh FP max(R, 1) i = EhFP R FP ≥ 1 i PhFP ≥ 1i + 0.PhFP = 0i ≤ PhFP ≥ 1i,

then the FDR is less than or equal to the FWER. This implies that any procedure that controls the FWER will also control the FDR. The reverse, however, is not true. That is, control of the FDR does not generally imply control of the FWER.

1.2.1 FWER control procedures

Hochberg and Tamhane [1987] describe a variety of FWER-controlling methods, based on cut-off rules for ordered p-values. Westfall and Young [1993] provide resampling-based multiple testing procedures for controlling the FWER. We only present here some classical procedures to control the FWER. Bonferroni [1936]’s procedure is perhaps the best-known procedure in the multiple testing literature. It controls the FWER for arbitrary test statistics joint null distributions.

Bonferroni [1936]’s procedure. The Bonferroni procedure rejects any null hypothesis Hi

with a p-value less than or equal to the common threshold t(p) = α/n. That is, the set of rejected null hypotheses is R(p) = {1 ≤ i ≤ n : pi ≤ α/n}. This procedure controls the FWER

under arbitrary conditions. That is,

FWER = P(∃ a false positive) = P [

i:H0,i is true

pi ≤ t(p)



≤ X

i:H0,i is true

P pi ≤ t(p) ≤ n0

α n ≤ α.

Closely related to Bonferroni [1936]’s procedure is Šidák [1967]’s procedure, which guarantees control of the FWER for the test statistics distributions that satisfy Šidák’s Inequality. It rejects any null hypothesis Hi with a p-value less than or equal to the common threshold t(p) =

1− (1 − α)1/n. Since α/n ≤ 1 − (1 − α)1/n, Šidák [1967]’s procedure is thus more powerful

thanBonferroni [1936]’s one. In other words, using Šidák [1967]’s procedure, we reject a larger number of hypotheses while controlling the same error rate, which leads to larger power. Besides, there are some other procedures that intend to control the family wise error rate and they are more powerful thanBonferroni[1936]’s procedure. Among those procedures, we can recallHolm

(16)

1.2. TYPE I ERROR RATE CONTROL PROCEDURES

1.2.2 FDR control procedures

A common criticism of multiple testing procedures designed to control the FWER is their lack of power, especially for large-scale testing problems such as those encountered in biomedical and genomic research. In many situations, control of the FWER can lead to unduly conservative procedures. In current areas of application of multiple testing procedures, such as gene expression studies based on microarray experiments, thousands of tests are performed simultaneously and a fairly large proportion of null hypotheses are expected to be false. In this context, Type I error rates based on the proportion of false positives among the rejected hypotheses (FDR) may be more appropriate than error rates based on the absolute number of Type I errors (FWER).

Benjamini and Hochberg [1995] provided a linear step-up procedure (the BH procedure) which controls the FDR at a certain level α.

A linear step-up procedure (the BH procedure). Consider testing H1, . . . , Hn based on

the corresponding p-values p1, . . . , pn,

– Step 1: let p(1)≤ . . . ≤ p(n)be the ordered p-values and denote by H(i) the null hypothesis

corresponding to p(i),

– Step 2: calculate ˆk = max{1 ≤ i ≤ n : p(i)≤ iα/n},

– Step 3: if ˆk exists then reject all H(i) for i = 1, . . . , ˆk, otherwise reject nothing.

Benjamini and Hochberg[1995] prove that this procedure controls the FDR for independent test statistics. The subsequent article of Benjamini and Yekutieli [2001] establishes FDR control for test statistics with the positive dependence structure called positive regression dependence from a subset. SinceBenjamini and Hochberg[1995]’s article, many authors have proposed a variety of multiple testing procedures for controlling the FDR. We first describe an adaptive linear step-up procedure which is proposed byBenjamini and Hochberg [2000].

An adaptive linear step-up procedure. Note that in fact, the BH procedure controls the FDR at level θα under independence or positive dependence conditions, this suggests the use of the following adaptive procedure that depends on an estimator of θ:

– Step 1: compute an estimator of θ as ˆθn,

(17)

1.2. TYPE I ERROR RATE CONTROL PROCEDURES

linear step-up procedure at level α/ˆθn. That is, reject all H(i) for i = 1, · · · , ˆl, where

ˆl= max{i : p(i) ≤

iα nˆθn

}.

Now suppose that we take the most conservative estimate ˆθn= 1, then

ˆl= max{i : p(i)≤

iα n} = ˆk,

it means that the adaptive linear step-up procedure identifies with the BH linear step-up one in this case. Moreover, if we take a better estimator ˆθn< 1, then ˆl > ˆk. In other words, using the

adaptive linear step-up procedure, we reject a larger number of hypotheses while controlling the same error rate, which leads to larger power.

Since the p-values that are associated with the false null hypotheses are likely to be small and a large majority of the p-values in the interval [λ, 1], for λ not too small, should correspond to the true null hypotheses, Schweder and Spjøtvoll [1982] suggested a procedure to estimate θ, that depends on the unspecified parameter λ. This estimator is equal to the proportion of p-values larger than this threshold λ divided by 1− λ, namely

ˆ θn(λ) =

#{Pi > λ : 1≤ i ≤ n}

n(1− λ) . (1.2)

Benjamini and Hochberg[2000] used this estimator to propose an adaptive linear step-up proce-dure. They also showed that this adaptive procedure has higher power than the BH one. And

Storey et al. [2004] provided a proof that it controls FDR at a level α. Note that ˆθn(λ) is a

conservative estimator of θ (it means that ˆθn(λ) overestimates θ). Moreover, small values of λ

typically produce estimators with higher bias but lower variance, whereas large values of λ yield low bias and high variance estimators. There exist many methods to choose the value of λ and the most popular choice is to let λ = 1/2. Recently, Liang and Nettleton [2012] have summed up many existing adaptive procedures under two different strategies to select λ, the first one includes the adaptive procedures that use predetermined values of λ and the second one includes the dynamic adaptive procedures where the parameter λ is determined by data.

A plug-in threshold procedure. We now present here the FDR controlling method that is proposed by Genovese and Wasserman (2002,2004). They consider the threshold

t(θ, F ) = sup{0 ≤ t ≤ 1 : θt

(18)

1.2. TYPE I ERROR RATE CONTROL PROCEDURES

where we recall that F is the cumulative distribution function of each p-value Pi. Suppose that

we reject the null hypotheses whenever the p-value is less than t(θ, F ). This threshold depends on the unknown parameters θ and F , so we call t(θ, F ) the oracle threshold. From Genovese and Wasserman [2002], it follows that, asymptotically, the FDR is less than α. Moreover, if F is concave this threshold has the smallest asymptotic FNR among all procedures with FDR less than or equal to α (cf. Genovese and Wasserman [2002]). The standard plug-in method is to estimate the functional t(θ, F ) by t(ˆθn, ˆF ), where ˆθn and ˆF are estimators of θ and F . We thus

call any threshold of the form t(ˆθn, ˆF ) a plug-in threshold. For instance, let ˆFnbe the empirical

cumulative distribution function of P1, P2,· · · , Pn. Genovese and Wasserman[2004] showed that

under weak conditions on ˆθn, the thresholding procedure t(ˆθn, ˆFn) asymptotically controls FDR

at a level α.

One-stage and two-stage adaptive procedures. Blanchard and Roquain [2009] propose two FDR control procedures called one-stage and two-stage adaptive step-up procedures. In their one-stage procedure, they reject all null hypotheses for which pi ≤ p(k), where

k = max  i : p(i) ≤ min n(1− λ)iα m− i + 1, λ o = max  i : p(i) ≤ iα m min n(1− λ)m m− i + 1, λm iα o , for a fixed constant λ ∈ (0, 1). They focus on the choice λ = α, then this procedure can be viewed as an adaptive linear step-up procedure with θ-estimator defined as

ˆ θn(i) = max m− i + 1 (1− α)m, i m  .

Their two-stage procedure is defined as an adaptive linear step-up procedure with θ-estimator given by

ˆ

θnBR(λ) = m− R

BR(λ) + 1

(1− α)m ,

where RBR(λ) is the number of rejections that result from using the one-stage adaptive step-up

procedure at level λ ∈ (0, 1). These two procedures are proved to be competitive with previous existing ones under the assumption of independence of the p-values. Moreover, the authors propose some adaptive step-up procedures that have provably controlled FDR under positive dependence and unspecified dependence of the p-values, respectively (for more detail, we refer to Blanchard and Roquain [2009]).

(19)

1.3. THE FDR ESTIMATION APPROACH

1.3 The FDR estimation approach

1.3.1 Estimation of pFDR and FDR

Rather than searching for a p-value threshold that can guarantee FDR control at a specified level α, Storey (2002, 2004) proposed to estimate the FDR for a fixed rejection region and provided a family of conservative point estimators. The following is Theorem 1 from Storey

[2002]. It allows us to write pFDR in a very simple form that does not depend on n.

Theorem 1.1. Suppose that n identical hypothesis tests are performed with the independent statistics X1, . . . , Xn and rejection region Γ. Then

pFDR(Γ) = θP(X ∈ Γ|H = 0)

P(X ∈ Γ) =P(H = 0|X ∈ Γ).

In terms of p-values, instead of denoting rejection regions by Γ, we denote them by γ, which refers to the interval [0, γ]. Then the pFDR can be written as

pFDR(γ) = θP(P ≤ γ|H = 0)

P(P ≤ γ) =

θγ F (γ),

where P is the random p-value resulting from any test. And the FDR can be computed as FDR(γ) = pFDR(γ)P(R > 0), where

P(R > 0) = 1 − P(R = 0) = 1 − P(∀i, Pi> γ)

= 1− [1 − P(P ≤ γ)]n= 1− [1 − F (γ)]n.

Thus, pFDR and FDR are asymptotically equivalent for a fixed rejection region, precisely we have

pFDR(γ) − FDR(γ) = pFDR(γ)[1 − F (γ)]n −→ n→∞0.

It is then natural to use the same estimates for FDR(γ) and pFDR(γ). For a given estimator ˆθn

of θ, we estimate FDR(γ) by [ FDR(γ) = θˆnγ ˆ Fn(γ) ,

where ˆFnis the empirical distribution function of P1, . . . , Pn. For example,Storey[2002] considers

a conservative estimate of θ that depends on the tuning parameter λ and is defined as (1.2), then he proposes an estimate of FDR(γ) as [ FDRλ(γ) = ˆ θn(λ)γ ˆ Fn(γ) .

(20)

1.4. LOCAL FALSE DISCOVERY RATE

Note that a good θ-estimator is very important as a conservative θ-estimator in general leads to a conservative FDR estimator, which can be used to control the FDR; this point was well illustrated through the work of Storey [2002] and Storey et al. [2004]. We can also refer to

Benjamini et al.[2006] for more detail on this point.

1.3.2 Connection between FDR estimation and FDR control

Most FDR research has focused on FDR control instead of FDR estimation. However, there is a connection between these two approaches. Let us first note that

ˆl= max{i : p(i)≤

iα nˆθn

} = max{i : nˆθnp(i)

i ≤ α} = max{i : \F DR(p(i))≤ α},

ie, the adaptive linear step-up procedure is equivalent to finding the largest p-value p(l) such that

\

F DR(p(l)) ≤ α. The FDR estimation approach can be thus viewed as the “inverse problem” of the FDR control approach. For any function h defined on [0, 1], let the step-up thresholding function be

tα(h) = sup{0 ≤ t ≤ 1 : h(t) ≤ α}.

Then the threshold of the adaptive linear step-up procedure is exactly tα( [FDR). Similarly, the

oracle threshold of the plug-in threshold procedure can be written t(θ, F ) = sup{0 ≤ t ≤ 1 : θt

F (t) ≤ α} = sup{0 ≤ t ≤ 1 : pFDR(t) ≤ α} = tα(pFDR). The plug-in threshold procedure is thus identical to the adaptive linear step-up procedure when we apply a common estimate ˆθn. We now present how a FDR estimation approach leads to a

FDR control approach. Since tα( [FDR) is a random variable, we use the following notation

FDR{tα( [FDR)} := E hFP{t α( [FDR)} R{tα( [FDR)} i .

Storey et al.[2004] and Liang and Nettleton[2012] proposed some FDR estimation approaches such that FDR{tα( [FDR)} ≤ α. Therefore, these thresholding procedures tα( [FDR) control the

FDR at level α.

1.4 Local false discovery rate

Efron et al. [2001] define the local false discovery rate (ℓFDR) to quantify the plausibility of a particular hypothesis being true, given its specific test statistic or p-value. In a mixture

(21)

1.4. LOCAL FALSE DISCOVERY RATE

framework, the ℓFDR is the Bayes posterior probability ℓFDR(x) =P(Hi being true |X = x) = 1 −

(1− θ)g1(x)

θg0(x) + (1− θ)g1(x)

.

In many multiple testing frameworks, we need information at the individual level about the probability for a given observation to be a false positive [Aubert et al., 2004]. This motivates estimating the local false discovery rate ℓFDR. Moreover, another motivation for estimating the parameters in this mixture model comes from the works of Sun and Cai (2009, 2007), who develop adaptive compound decision rules for false discovery rate control. These rules are based on the estimation of the local false discovery rate ℓFDR. Let R be the set of ranked \ℓFDR(xi):

R = {\ℓFDR(1),· · · , \ℓFDR(n)}. Sun and Cai [2007] proposed the following adaptive step-up procedure: Let k = max{i : 1i i X j=1 \ ℓFDR(j)≤ α};

then reject all H(i), i = 1,· · · , k.

Sun and Cai [2007] showed that this procedure asymptotically attains the performance of an oracle procedure and in some simulation studies, it is more efficient than the conventional p-value-based methods, including the step-up procedure of Benjamini and Hochberg [1995] and the plug-in procedure of Genovese and Wasserman [2004]. Moreover, recall that zi denotes the

z-value and pi denotes the p-value, we can write

ℓFDR(i) := ℓFDR(zi) := θφ(zi) φ(zi)[θ + (1− θ)f1(Φ(zi))] = θ θ + (1− θ)f1(pi) := ℓFDR(pi),

thus this procedure is more adaptive than the BH adaptive procedure in the sense that it adapts to both the global feature p-value and z-value.

Let us note that pFDR and ℓFDR are analytically related by pFDR(γ) = Z γ −∞ ℓFDR(p)f (p)dp Z γ −∞ f (p)dp−1 = E{ℓFDR(P )|P ≤ γ},

then we can estimate pFDR or FDR by [ FDR(p(i)) = 1 i i X j=1 \ ℓFDR(j).

(22)

1.5. SEMIPARAMETRIC INFERENCE

So that the above adaptive step-up procedure can be viewed as a plug-in threshold procedure tα( [FDR). To conclude, let us stress that all FDR control procedures presented in Sections 1.2.2

and 1.4 can also be viewed as plug-in threshold procedures tα( [FDR) with suitable estimates

[ FDR.

1.5 Semiparametric inference

In this section, we recall concepts from semiparametric theory. We follow the notation of Chapter 25 and more particularly Section 25.4 invan der Vaart[1998] and refer to this book for more details. Semiparametric models are statistical models in which the parameters are indexed by a finite-dimensional vector and an infinite-dimensional parameter. Precisely, a semiparametric model in a strict sense may have a natural parametrization (θ, f) 7→ Pθ,f, where θ is a Euclidean

parameter and f belongs to a nonparametric class of distributions. Here, we aim at estimating the value ψ(Pθ,f) = θ and consider f as a nuisance parameter. We shall recall the theory of

asymptotic efficiency for semiparametric models which is extended from parametric models.

1.5.1 Tangent sets and efficient influence function

We first recall the definition of tangent set in a general model. In this section, suppose that we observe a random sample X1, X2,· · · , Xn from a distribution P which belongs to a

set P of probability measures on some measurable space (X , A). In particular, we consider a framework that is more general than the semiparametric one. We aim at estimating the value ψ(P) of a functional ψ : P → Rk. Assume for simplicity that the parameter to be estimated

is one-dimensional (k = 1). In parametric models, we have a strict definition for the Fisher information for estimating the parameters. So, what can we say about the information for the semiparametric model P for estimating ψ(P)? For every smooth parametric submodel P0 ⊂ P

that contains the true distribution P, we can calculate its Fisher information for estimating ψ(P). Then the information for estimating ψ(P) for the whole model is not bigger than the information covered by each of these parametric submodels. So it is certainly not bigger than the infimum of the informations over all submodels. The information for P is then simply defined as this infimum. It seems that in most situations, it suffices to consider one-dimensional submodels P0.

We know that they should pass through the true distribution P and be differentiable in quadratic mean at P which we shall define now.

(23)

1.5. SEMIPARAMETRIC INFERENCE

Definition 1.1. A differentiable path is a map t 7→ Ptfrom a neighbourhood [0, ǫ) of 0 toP with

P0 =P such that, for some measurable function g : X → R,

Z dP1/2 t − dP1/2 t − 1 2gdP 1/22 → 0 as t → 0. (1.3)

The parametric submodel {Pt: 0≤ t < ǫ} is called differentiable in quadratic mean at P and the

function g is called the score function of the submodel{Pt: 0≤ t < ǫ}.

Letting t 7→ Pt range over a collection of these submodels, we obtain a collection of score

functions, which we call a tangent set of the model P at P. We denote this tangent set by ˙PP.

When we consider all possible differentiable paths t 7→ Pt, we obtain the maximal collection of

score functions. This set is referred to as the maximal tangent set. A tangent set is usually a cone: if g ∈ ˙PP and a ≥ 0, then ag ∈ ˙PP, since the path t 7→ Pat has score function ag when

t7→ Pt has score function g. Usually, we construct the submodels t 7→ Ptsuch that, for every x,

g(x) = ∂ ∂t t=0log dPt(x).

This pointwise differentiability is not required by (1.3) . Conversely, given this pointwise differ-entiability, we are not assured to have (1.3). We still need to be able to apply a convergence theorem for integrals to obtain this type of convergence in quadratic mean, such as the dom-inated convergence theorem of Lebesgue or the monotone convergence theorem, since we need to interchange limit and integration. The following lemma solves most examples as stated in

van der Vaart[2002].

Lemma 1.1. If ptis the density function of a probability distributionPtrelative to a fixed measure

µ and t7→ppt(x) is continuously differentiable in a neighbourhood of 0 and t7→R ˙p2t/ptdµ, where

˙pt= ∂pt/∂t, is finite and continuous in this neighbourhood, then t7→ Pt is a differentiable path.

The following lemma gives two fundamental but familiar properties of score functions. These are consequences of the differentiability in quadratic mean. We denote by L2(P) the space of

measurable functions g : X → R with Pg2 =R g2dP < ∞, where almost surely equal functions

are identified.

Lemma 1.2. Every score function belongs to the set {g ∈ L2(P) : Pg = 0}.

From this lemma, we can conclude that a tangent set is a subset of the space L2(P) with

(24)

1.5. SEMIPARAMETRIC INFERENCE

Example (nonparametric model). Suppose that P consists of all probability distributions on the sample space. Let g be an arbitrary function such that g ∈ L2(P) and Pg = 0. We

consider the submodel given by t 7→ pt(x) = c(t)k(tg(x))p0(x) for a nonnegative function k with

k(0) = k′(0) = 1 and [c(t)]−1 = R k(tg(x))p0(x)dx. The function k(x) = 2(1 + exp(−2x))−1

can be used for example. By a direct calculation or by using Lemma 1.1, we see that the path t7→ pt(x) is differentiable and the corresponding score function is g. Then the maximal tangent

set coincides with the space {g ∈ L2(P) : Pg = 0}.

For defining the information for estimating ψ(P), only those submodels t 7→ Pt along which

the parameter t 7→ ψ(Pt) is differentiable in an appropriate sense are of interest. A minimal

requirement is that the map t 7→ ψ(Pt) is differentiable at t = 0, but we need more. More

precisely, a map ψ : P → R is called differentiable at P relative to a given tangent set ˙PP if

there exists a continuous linear map ˙ψP:L2(P) → R such that for every g ∈ ˙PP and a submodel

t7→ Pt with score function g,

∂ψ(Pt) ∂t t=0= limt→0 ψ(Pt)− ψ(P) t = ˙ψPg.

The Riesz representation theorem for Hilbert spaces yields the existence of a measurable function ˜ ψP:X → R such that ˙ ψPg =h ˜ψP, giL2(P)= Z ˜ ψPgdP. (1.4)

A function ˜ψP satisfying (1.4) is defined to be an influence function. The Riesz representation theorem assures uniqueness of the function ˜ψP when the inner product h·, ·iL2(P) is specified for all functions of L2(P). Here, only inner products of ˜ψP with elements g of the tangent set ˙PP

are specified and the tangent set does not span all of L2(P), therefore the function ˜ψP is not

uniquely defined by the functional ψ and the model P. However, using the projection theorem of Hilbert spaces, we can construct a unique ˜ψP contained in lin ˙PP, the closure of the linear span of the tangent set. This function is called the efficient influence function. So, for further reference, when we write ˜ψP, we refer this to be the efficient influence function.

1.5.2 Asymptotically efficient estimator

To motivate the definition of information in the semiparametric setup, we first consider a differentiable parametric submodel t 7→ Pt with score function g. It is easy to show that the

Fisher information in this parametric submodel is equal to the variance of the score function g, i.e. I = Pg2 = hg, gi

(25)

1.5. SEMIPARAMETRIC INFERENCE evaluated at t = 0, is [∂ψ(Pt)/∂t|t=0]2 Pg2 = h ˜ψP, gi2P hg, giP .

We now present an important lemma.

Lemma 1.3. Suppose that the functional ψ : P → R is differentiable at P relative to a tangent set ˙PP. Then sup g∈lin ˙PP h ˜ψP, gi2 P hg, giP =P ˜ψ 2 P.

Now the special meaning of the efficient influence function becomes clear. The squared norm P ˜ψ2

P of the efficient influence function ˜ψP plays the role of a smallest asymptotic variance an

estimator for ψ(P) can have. We thus call the number P ˜ψ2P the efficiency bound or the optimal variance.

For every function g in a given tangent set ˙PP, we write Pt,g for a corresponding submodel

with score function g along which the function ψ is differentiable. The asymptotic minimax risk of an estimator sequence Tn(relative to the tangent set ˙PP), is defined as

sup I lim inf n→∞ supg∈I P1/ √ n,g √ n Tn− ψ(P1/√n,g) 2 ,

where the first supremum is taken over all finite subsets I of the tangent set ˙PP. We now state

the local asymptotic minimax theorem that gives a lower bound of the asymptotic minimax risk of an arbitrary estimator Tn [see Theorem 25.21 invan der Vaart,1998].

Theorem 1.2. (Local Asymptotic Minimax, LAM). Let the function ψ : P → R be dif-ferentiable at P relative to the tangent cone ˙PP with efficient influence function ˜ψP. If ˙PP is a

convex cone, then for any estimator sequence Tn,

sup I lim inf n→∞ supg∈I P1/ √ n,g √ n Tn− ψ(P1/√n,g) 2 ≥ P ˜ψP2. (1.5)

The first supremum is taken over all finite subsets I of the tangent set ˙PP.

An estimator sequence Tn is called regular at P for estimating ψ(P) (relative to ˙PP) if there

exists a probability measure L such that √

nTn− ψ(P1/√n,g)

P1/√n,g

(26)

1.5. SEMIPARAMETRIC INFERENCE

where P denotes convergence in distribution under P. We now state the convolution theorem, that shows that the limit distribution L writes as the convolution between some unknown distri-bution and the centered Gaussian distridistri-bution N(0, P ˜ψP2) [see Theorem 25.20 in van der Vaart,

1998].

Theorem 1.3. (Convolution). Let the function ψ : P → R be differentiable at P relative to the tangent cone ˙PP with efficient influence function ˜ψP. Then the asymptotic variance of every regular sequence of estimators is bounded below by P ˜ψP2. Furthermore, if ˙PP is a convex cone,

then every limit distribution L of a regular sequence of estimators can be written L = U + M where U ∼ N(0, P ˜ψ2

P) and M is some probability distribution independent of U .

According to this theorem, we say that an estimator sequence is asymptotically efficient at P (relative to the tangent set ˙PP) if it is regular at P with limit distribution L = N(0, P ˜ψP2),

in other words it is the best regular estimator. The definition of asymptotic efficiency is not absolute since it is defined relative to a given tangent set. In practice, we aim at finding a tangent set and an estimator sequence such that the tangent set is big enough and the estimator sequence efficient enough so that this estimator sequence is asymptotically efficient relative to this tangent set. We end this section on general efficiency theory with an interesting lemma. Lemma 1.4. Let the functional ψ : P → R be differentiable at P relative to the tangent cone

˙

PP with efficient influence function ˜ψP. A sequence of estimators Tn is regular at P with limit

distribution N (0,P ˜ψ2 P) if and only if √ n Tn− ψ(P) = 1 √ n n X i=1 ˜ ψP(Xi) + oP(1).

The nice thing about asymptotically efficient estimators is that they have interesting asymp-totic properties and they are fully characterized by their efficient influence function. First, we note that Tn is consistent, i.e., Tn → ψ(P). In addition, by the central limit theorem andP

Slutsky’s theorem, we obtain that √

n Tn− ψ(P) P N (0,P ˜ψP2).

This means an asymptotically efficient estimator is asymptotically normal with asymptotic vari-ance equal to the optimal varivari-ance. By Prohorov’s theorem, the estimator Tn is also

√n-consistent, i.e., √n Tn− ψ(P) = OP(1). We conclude that every asymptotically efficient

(27)

1.5. SEMIPARAMETRIC INFERENCE

1.5.3 Expressions for semiparametric models in a strict sense

We now focus our attention on semiparametric models in a strict sense, P = {Pθ,f : θ ∈

Θ, f ∈ F}, with Θ ⊂ R an open set and F an arbitrary infinite dimension set of probability distributions. Our aim is to study the efficiency of an estimator Tn for ψ(Pθ,f) = θ. Thus, we

are looking for the efficient influence function ˜ψθ,f in this special setting. We will express the

efficient influence function in terms of the efficient score function and the efficient information matrix. As submodels, we use paths of the form t 7→ Pθ+ta,ft, for given paths t 7→ ft in F and

a∈ R. The score functions for such submodels will typically have the form of a sum of partial derivatives with respect to the parametric component θ and the nonparametric component f. If ˙lθ,f is the ordinary score function for θ in the model where f is fixed (as we consider an ordinary

parametric model), then we expect ∂ ∂t t=0log dPθ+ta,ft = a ˙lθ,f + g.

The function g has the interpretation of a score function for f when θ is fixed and typically runs through an infinite-dimensional set. We refer to this set as the tangent set for f, and denote it by fP˙Pθ,f. The functional ψ(Pθ+ta,ft) = θ + ta is certainly differentiable with respect to t in

ordinary sense with derivative a. However, to be differentiable at Pθ,f relative to ˙PPθ,f, we need

something more. By definition, ψ is differentiable relative to ˙PPθ,f if and only if there exists a

function ˜ψθ,f such that

a = ∂ ∂t t=0ψ(Pθ+ta,ft) =h ˜ψθ,f, a ˙lθ,f + giPθ,f, ∀a ∈ R, g ∈f ˙ PPθ,f.

By putting a = 0, we obtain that h ˜ψθ,f, giPθ,f = 0 for all g ∈ fP˙Pθ,f. Thus, ˜ψθ,f must be

orthogonal to the tangent set fP˙Pθ,f for the nuisance parameter. In particular, the efficient

influence function, which we denote again by ˜ψθ,f, is orthogonal to the nuisance tangent space fP˙Pθ,f. We shall state a lemma that gives an interesting form for the efficient influence function.

Before doing that, we define the operator Πθ,f : L2(Pθ,f) → linfP˙Pθ,f to be the orthogonal

projection onto the closure of the linear span of the nuisance tangent space in L2(Pθ,f). The

function defined by ˜lθ,f = ˙lθ,f− Πθ,f˙lθ,f is called the efficient score function for θ and its variance

˜

Iθ,f =Pθ,f˜l2θ,f is called the efficient information matrix for θ.

Lemma 1.5. Suppose that for every a ∈ R and every g ∈fP˙Pθ,f there exists a path t7→ ft in F

such that Z hdP1/2 θ+ta,ft − dP 1/2 θ,f t − 1 2(a ˙lθ,f + g)dP 1/2 θ,f i2 → 0 as t → 0.

(28)

1.5. SEMIPARAMETRIC INFERENCE

If ˜Iθ,f is nonsingular, then the function ψ(Pθ,f) = θ is differentiable atPθ,f relative to the tangent

set ˙PPθ,f = lin ˙lθ,f +fP˙Pθ,f = {a˙lθ,f + g : a ∈ R, g ∈ fP˙Pθ,f} with efficient influence function

˜

ψθ,f = ˜Iθ,f−1˜lθ,f.

As a consequence, we obtain a specialized version of Lemma 1.4. Suppose the nuisance tangent set fP˙Pθ,f is a cone, then a sequence of estimators Tn is regular at Pθ,f with limiting

distribution N(0, Pθ,fψ˜θ,f2 ) (asymptotically efficient) if and only if it satisfies

√ n(Tn− θ) = 1 √ n n X i=1 ˜ Iθ,f−1˜lθ,f(Xi) + oPθ,f(1). We first see that Pθ,fψ˜2θ,f =  Pθ,f˜lθ,f2 −1 = ˜Iθ,f−1.

The variance of the efficient influence function is equal to the inverse of the variance of the efficient score function or the inverse of the efficient information matrix. Thus, the reason why we call ˜lθ,f the efficient score function and ˜Iθ,f the efficient information matrix is now clear. Secondly,

under regularity conditions (see Chapters 5 and 8,van der Vaart[1998]), the maximum likelihood estimator ˆθn in a parametric model satisfies

√ n(ˆθn− θ) = 1 √ n n X i=1 Iθ−1˙lθ(Xi) + oPθ(1),

where Iθ is the ordinary Fisher information matrix and ˙lθ is the ordinary score function. The

only difference in a semiparametric model is that the ordinary score function ˙lθ,f is replaced by

the efficient score function ˜lθ,f and the Fisher information matrix Iθ,f for θ is replaced by the

efficient information matrix ˜Iθ,f. A part of the score function for θ can be accounted for by score

functions for the nuisance parameter f when the nuisance parameter is unknown, a part of the information for θ is lost. The orthogonal projection Πθ,f˙lθ,f of the score function for θ onto the

nuisance tangent space fP˙Pθ,f corresponds with this loss. When there is no nuisance parameter,

there is no nuisance tangent space and thus no loss of information for estimating θ.

1.5.4 One-step estimator method

In this section, we introduce the one-step method to construct an asymptotically efficient estimator, relying on a √n-consistent one [see van der Vaart, 1998, Section 25.8]. Let ˆθn be a

(29)

1.5. SEMIPARAMETRIC INFERENCE

that we are given a sequence of estimators ˆln,θ(·) = ˆln,θ(·; X1, . . . , Xn) of the efficient score

function ˜lθ,f. Define with m = ⌊n/2⌋,

ˆln,θ,i(·) =

 ˆl

m,θ(·; X1, . . . , Xm) if i > m,

ˆln−m,θ(·; Xm+1, . . . , Xn) if i≤ m.

Thus, for Xi ranging through each of the two halves of the sample, we use an estimator ˆln,θ,i

based on the other half of the sample. Then the one-step estimator is defined as ˜ θn= ˆθn− Xn i=1 ˆl2 n,ˆθn,i(Xi) −1Xn i=1 ˆln,ˆθn,i(Xi).

This estimator ˜θn can be considered a one-step iteration of the Newton-Raphson algorithm for

solving an approximation of the equation Pi˜lθ,f(Xi) = 0 with respect to θ, starting at the initial

guess ˆθn. We now assume that, for every deterministic sequence θn= θ + O(n−1/2), we have

√ nPθn,fˆln,θn Pθ,f −−−→ n→∞ 0, (1.6) Pθn,fkˆln,θn− ˜lθn,fk 2 Pθ,f −−−→ n→∞ 0, (1.7) Z k˜lθn,fdP 1/2 θn,f − ˜lθ,fdP 1/2 θ,fk2−−−→n→∞ 0. (1.8)

Note that in the above notation, the term Pθn,fˆl for some random function ˆl is an abbreviation

for the integral R ˆl(x)dPθn,f(x). Thus the expectation is taken with respect to x only and not

the random variables in ˆl.

Theorem 1.4. [Theorem 25.57 invan der Vaart,1998] Suppose that the model{Pθ,f : θ∈ Θ} is

differentiable in quadratic mean with respect to θ at (θ, f ), let the efficient information matrix ˜Iθ,f

be nonsingular. Assume that (1.6)- (1.8) hold. Then the one-step estimator ˜θn is asymptotically

efficient at (θ, f ).

This theorem reduces the problem of efficient estimation of θ to estimation of the efficient score function. The estimator of the efficient score function must satisfy a “no-bias” (1.6) and a consistency (1.7) conditions. The consistency condition is usually easy to arrange, but the “no-bias” condition requires a convergence to zero of the bias at a rate faster than 1/√n. If it fails, then the sequence ˜θnis not asymptotically efficient and may even converge at a slower rate

than √n. The good news is that if an efficient estimator sequence exists, then it can always be constructed by the one-step method. In that sense the no-bias condition is necessary.

(30)

1.6. ORGANIZATION

Theorem 1.5. [Theorem 7.4 in van der Vaart, 2002] Suppose that the model{Pθ,f : θ ∈ Θ} is

differentiable in quadratic mean with respect to θ at (θ, f ), let the efficient information matrix ˜Iθ,f

be nonsingular, and assume that (1.8) holds. Then the existence of an asymptotically efficient estimator of ψ(Pθ,f) = θ implies the existence of a sequence of estimators ˆln,θ satisfying (1.6)

and (1.7).

1.5.5 The infinite bound case

We end this section with an impossibility result due to Chamberlain [1986]. Chamberlain showed that if the semiparametric efficiency bound (i.e., the variance of the efficient influence function ˜ψP) is infinitely large (e.g., ˜Iθ,f =Pθ,f˜lθ,f2 is singular), then no regular estimator exist.

More precisely, if the efficient information matrix is singular, the variance of the efficient influence function is infinite and since this is a lower bound for the variance of any regular estimator, no regular estimator can exist. More details about this and other impossibility theorems can be found in Newey[1990].

1.6 Organization

Throughout this dissertation, we assume the test statistics are independent and identically distributed (iid) with a continuous distribution under the corresponding null or alternative hy-potheses, then the p-values are iid and follow the uniform distribution U([0, 1]) in interval [0, 1] under the null hypotheses. The density g of the p-values is modeled by a two-component mixture with following expression

∀x ∈ [0, 1], g(x) = θ + (1 − θ)f(x),

where θ ∈ [0, 1] is the unknown proportion of true null hypotheses and f denotes the density of p-values generated under the alternative (false null hypotheses). We recall that an adaptive linear step-up procedure or a plug-in threshold procedure requires an estimator of the parameter θ. A good θ-estimator is very important since a conservative θ-estimator in general leads to a conservative FDR estimator, which can be used in FDR control procedures. Besides, the problem of estimating the component f appears from the estimation of the local false discovery rate, which is used in the adaptive step-up procedure of Sun and Cai[2007].

In the second chapter, we study the estimation of the proportion θ. Firstly, let us note that many different estimators of θ have been proposed in the literature but their rate of convergence

(31)

1.6. ORGANIZATION

or asymptotic efficiency has only been partly studied. To our knowledge, there only exits some estimators that converge to θ at nonparametric rate and it has not been investigated whether the parametric rate of convergence may be achieved by a consistent estimator of θ in this semi-parametric setup. We discuss asymptotic efficiency results and establish that two different cases occur whether f vanishes on a non-empty interval or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. We illustrate those results on simulated data. This chapter is a revised version that is submitted for publication in a journal of statistics.

Motivated by the issue of local false discovery rate estimation, in the third chapter, we focus on the estimation of the nonparametric unknown component f in the mixture, relying on a preliminary estimator of the unknown proportion θ of true null hypotheses. We propose and study the asymptotic properties of two different estimators for this unknown component. The first estimator is a randomly weighted kernel estimator. We establish an upper bound for its pointwise quadratic risk, exhibiting the classical nonparametric rate of convergence over a class of Hőlder densities. To our knowledge, this is the first result establishing convergence as well as corresponding rate for the estimation of the unknown component in this nonparametric mixture. The second estimator is a maximum smoothed likelihood estimator. It is computed through an iterative algorithm, for which we establish a descent property. In addition, these estimators are used in a multiple testing procedure in order to estimate the local false discovery rate. Their respective performances are then compared on synthetic data. This chapter is accepted for publication in ESAIM: Probability and Statistics.

In the fourth chapter, we consider another mixture model that is useful to analyze gene expression data coming from microarray analysis. It is a mixture of two components where one component is assumed to be a known density with prior probability 1−p and the other component is an unknown density that is assumed to be symmetric on R with non-null location parameter µ. This model has been introduced by Bordes et al. [2006]. Here, we aim at computing the efficient information matrix for estimating the Euclidean parameter θ = (p, µ) and some ideas are proposed for future work.

(32)

Chapter 2

Estimation of the proportion of true

null hypotheses

Abstract

We consider the problem of estimating the proportion θ of true null hypotheses in a multiple testing context. The setup is classically modeled through a semiparametric mixture with two components: a uniform distribution on interval [0, 1] with prior probability θ and a nonparametric density f. We discuss asymptotic efficiency results and establish that two different cases occur whether f vanishes on a non-empty interval or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. We illustrate those results on simulated data.

Contents

2.1 Introduction . . . 31

2.2 Lower bounds for the quadratic risk and efficiency . . . 35

2.3 Upper bounds for the quadratic risk and efficiency (when δ > 0) . . 38

2.4 Simulations . . . 43

2.5 Proofs of main results. . . 46

2.6 Proofs of technical lemmas . . . 60

2.1 Introduction

The problem of estimating the proportion θ of true null hypotheses is of interest in situation where several thousands of (independent) hypotheses can be tested simultaneously. One of the typical applications in which multiple testing problems occur is estimating the proportion of

(33)

2.1. INTRODUCTION

genes that are not differentially expressed in deoxyribonucleic acid (DNA) microarray experi-ments [see for instance Dudoit and van der Laan,2008]. Among other application domains, we mention astrophysics [Meinshausen and Rice, 2006] or neuroimaging [Turkheimer et al., 2001]. A reliable estimate of θ is important when one wants to control multiple error rates, such as the false discovery rate (FDR) introduced by Benjamini and Hochberg [1995]. In this work, we discuss asymptotic efficiency of estimators of the true proportion of null hypotheses. We stress that the asymptotic framework is particularly relevant in the above mentioned contexts where the number of tested hypotheses is huge.

In many recent articles [such asBroberg,2005,Celisse and Robin,2010,Genovese and Wasser-man, 2004, Langaas et al., 2005, etc], a two-component mixture density is used to model the behavior of p-values X1, X2, . . . , Xn associated with n independent tested hypotheses. More

precisely, assume the test statistics are independent and identically distributed (iid) with a con-tinuous distribution under the corresponding null hypotheses, then the p-values X1, X2, . . . , Xn

are iid and follow the uniform distribution U([0, 1]) on interval [0, 1] under the null hypotheses. The density g of p-values is modeled by a two-component mixture with following expression

∀x ∈ [0, 1], g(x) = θ + (1 − θ)f(x), (2.1)

where θ ∈ [0, 1] is the unknown proportion of true null hypotheses and f denotes the density of p-values generated under the alternative (false null hypotheses).

Many different identifiability conditions on the parameter (θ, f) in model (2.1) have been discussed in the literature. For example, Genovese and Wasserman[2004] introduce the concept of purity that corresponds to the case where the essential infimum of f on [0, 1] is zero. They prove that purity implies identifiability but not vice versa. Langaas et al.[2005] suppose that f is decreasing with f (1) = 0 while Neuvial [2010] assumes that f is regular near x = 1 with f (1) = 0 and Celisse and Robin [2010] consider that f vanishes on a whole interval included in [0, 1]. These are sufficient but not necessary conditions on f that ensure identifiability. Now, if we assume more generally that f belongs to some set F of densities on [0, 1], then a necessary and sufficient condition for parameters identifiability is stated in the next result, whose proof is given in Section2.5.1.

Proposition 2.1. The parameter (θ, f) is identifiable on a set (0, 1) × F if and only if for all f ∈ F and for all c ∈ (0, 1), we have c + (1 − c)f /∈ F.

(34)

2.1. INTRODUCTION

This very general result is the starting point to considering explicit sets F of densities that ensure the parameter’s identifiability on (0, 1) × F. In particular, if F is a set of densities constrained to have essential infimum equal to zero, one recovers the purity result of Genovese and Wasserman[2004]. However, from an estimation perspective, the purity assumption is very weak and it is hopeless to obtain a reliable estimate of θ based on the value of f at a unique value (or at a finite number of values). Since the p-values that are associated with the false null hypotheses are likely to be small and a large majority of the p-values in the interval [1 − δ, 1], for δ not too large, should correspond to the true null hypotheses, the assumption that f is non-increasing with f(1) = 0 is reasonable. Recall that this assumption is used in Langaas et al.[2005] and partially in Celisse and Robin[2010]. In the following, we explore asymptotic efficiency results for the estimation of θ by assuming that the function f belongs to a set of densities (with respect to the Lebesgue measure µ) defined as

Fδ={f : [0, 1] 7→ R+, continuously non increasing density, positive on [0, 1− δ)

and such that f|[1−δ,1] = 0}. (2.2)

We establish that two different cases are to be distinguished: δ is positive and δ is equal to zero. In the first case, we obtain the existence of √n-consistent estimators of θ that is to say estimators ˆ

θnsuch that √n(ˆθn− θ) is bounded in probability (denoted by√n(ˆθn− θ) = OP(1)). We exhibit

such estimators and also compute the asymptotic optimal variance for this problem. Moreover, we conjecture that asymptotically efficient estimators (that is estimators asymptotically attaining this variance lower bound) do not exist. In the second case, while the existence of an estimator ˆ

θn of θ converging at parametric rate has not been established yet, we prove that if such a

√n-consistent estimator of θ exists, then the variance Var(√nˆθn) cannot have a finite limit. In other

words, the quadratic risk of ˆθn cannot converge to zero at a parametric rate. Note that these

results are also true when we consider the more general case where the function f either vanishes on a non-empty interval included in [0, 1] (thus not necessarily of the form [1 − δ, 1]) or not.

Let us now discuss the different estimators of θ proposed in the literature, starting with those assuming (implicitly or not) that f attains its minimum value on a whole interval. First,

Schweder and Spjøtvoll [1982] suggested a procedure to estimate θ, that has been later used by

Storey[2002]. This estimator depends on an unspecified parameter λ ∈ [0, 1) and is equal to the proportion of p-values larger than this threshold λ divided by 1−λ. Storeyestablished that it is a conservative estimator, and one can note that it is consistent only if f attains its minimum value on the interval [λ, 1] (an assumption not made in the article bySchweder and Spjøtvoll[1982] nor

(35)

2.1. INTRODUCTION

the one byStorey[2002]). Note that even if such an assumption were made, it would not solve the problem of choosing λ such that f attains its infimum on [λ, 1]. Adapting this procedure in order to end up with an estimate of the positive FDR (pFDR),Storey[2002] proposes a bootstrap strategy to pick λ. More precisely, his procedure minimizes the mean squared error for estimating the pFDR. Note thatGenovese and Wasserman[2004] established that, for fixed value λ such that the cumulative distribution function (cdf) F of f satisfies F (λ) < 1,Storey’s estimator converges at parametric rate and is asymptotically normal, but is also asymptotically biased: thus it does not converge to θ at parametric rate. Some other choices of λ are, for instance, based on break point estimation [Turkheimer et al.,2001] or spline smoothing [Storey, 2003]. Another natural class of procedures in this context is obtained by relying on a histogram estimator of g [Mosig et al.,2001,Nettleton et al.,2006]. Among this kind of procedures, we mention the one proposed recently byCelisse and Robin[2010] who proved convergence in probability of their estimator (to the true parameter value) under the assumption that f vanishes on an interval. Note that both

Storey’s and histogram based estimators of θ are constructed using nonparametric estimates ˆg of the density g and then estimate θ relying on the value of ˆg on a specific interval. The main issue with those procedures is to automatically select an interval where the true density g is identically equal to θ. As a conclusion on the existing results for this setup (f vanishing on a non-empty interval), we stress the fact that none of these estimators were proven to be convergent to θ at parametric rate. In Theorem 2.2 below, we prove that a very simple histogram based estimator possesses this property, while in Theorem 2.3, we establish that this is also true for the more elaborate procedure proposed by Celisse and Robin [2010] which has the advantage of automatically selecting the "best" partition among a fixed collection. However, we are not aware of a procedure for estimating θ that asymptotically attains the optimal variance in this context. Besides, one might conjecture that such a procedure does not exist for regular models (see Section 2.3.3).

Other estimators of θ are based on regularity or monotonicity assumptions made on f or equivalently on g, combined with the assumption that the infimum of g is attained at x = 1. These estimators rely on nonparametric estimates of g and appear to inherit nonparametric rates of convergence. Langaas et al.[2005] derive estimators based on nonparametric maximum likelihood estimation of the p-value density, in two setups: decreasing and convex decreasing densities f. We mention that no theoretical properties of these estimators are given. Hengartner and Stark[1995] propose a very general finite sample confidence envelope for a monotone density.

Références

Documents relatifs

This paper deals with the consistency, a rate of convergence and the asymptotic distribution of a nonparametric estimator of the trend in the Skorokhod reflection problem defined by

With the aim of understanding the impact of group-specific genomic features on the species adaptation to different products, we have selected a set of strains isolated from the

Another classical tool often used in geostatistics is the variogram. But for spatial processes with Fr´ echet marginal laws, the variogram does not exist. Spatial extreme models.

In this paper we propose an alternative solution to the above problem of estimating a general conditional density function of tar- gets given inputs, in the form of a Gaussian

We investigate the expected L 1 -error of the resulting estimates and derive optimal rates of convergence over classical nonparametric density classes provided the clustering method

In the specific case of the MTD models, the original single- matrix model is parametrized in a bijective way, whereas its gener- alized version with specific transition matrices

Motivated by the issue of local false discovery rate estimation, we focus here on the estimation of the nonpara- metric unknown component f in the mixture, relying on a

False discovery rate, kernel estimation, local false discovery rate, maximum smoothed likelihood, multiple testing, p -values, semiparametric mixture model.. 1 Laboratoire de