Top scoring pair classifiers: asymptotics and applications

(1)

HAL Id: hal-00784869

https://hal.archives-ouvertes.fr/hal-00784869

Preprint submitted on 4 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de

Top scoring pair classifiers: asymptotics and applications

Christophe Denis

To cite this version:

(2)

Top scoring pair classifiers:

asymptotics and applications

Christophe Denis

Laboratoire MAP5, UMR CNRS 8145,

Université Paris Descartes, Sorbonne Paris Cité, Paris, France

christophe.denis@parisdescartes.fr

Abstract

The original top scoring pair (TSP) classifier was proposed by Geman et al. (2004) for binary classification of diseases based on genetic profiles. We show the consistency of two versions of the TSP classifier and their two cross-validated counterparts relative to two different risks: the classical misclassification risk and an asymmetric version of this risk which gives more weight to the rarer class. A numerical study illustrates our results and sheds further light on the different TSP classification procedures.

Keywords: Top scoring pair classifier; Classification; Cross-validation

1 Introduction

This work is devoted to the study of two top scoring pair (TSP) classifiers inspired by the original TSP classifier defined and studied by Geman et al. (2004).

The original TSP classifier was coined to address the classification of different cancers based on gene-expression profiles. The main feature of the original TSP classifier is that it is based on pairwise-comparisons. More precisely, the method consists in differentiating between two classes by finding pairs of genes whose expression levels typically invert from one class to the other. A single pair of genes (or sometimes a handfull of them) is selected by maximizing a score. Then the resulting TSP classification rule is based only on the selected pair(s) of genes. Thus, the TSP classifier does not suffer from the lack of interpretability which often arises in the statistical analysis of microarray data. Indeed, it is easy to interpret the fact that the expression level of a gene is larger than the expression level of another gene, even if these expression levels are obtained under different experimental conditions. Hopefully, one can pass the so-called "elevator exam": explaining to one’s colleague from the biology department how this classifier works in the elevator between the first and the fourth floors. Moreover, this classification rule is robust to quantization effects and invariant to pre-processing such as normalization methods (Yang et al., 2001).

(3)

Furthermore, the TSP classifier is easy to compute, and its implementation requires no tuning parameters. We refer to the tspair package for an R implementation. Geman et al. (2004) also argue that the TSP classifier behaves well in the "small n large p" paradigm, and they show on several real datasets that the TSP classification rule compares favorably with more complex ones. Note that the TSP procedure proved useful in other contexts: for instance, Chambaz and Denis (2012) apply the TSP classification procedure for classifying subjects in terms of postural maintenance.

Various extensions of the TSP classification procedure have been proposed. Tan et al. (2005) and Zhou et al. (2012) are interested in k-TSP procedures which involve the pairs that achieve the k largest scores rather than the highest score only. Note that the k-TSP procedure of Tan et al. (2005) applies to the multi-class framework. Furthermore, Czajkowski and Kretowski (2011) propose a TSP procedure based on decision trees. As far as we know, there is no theoretical study of the TSP classification procedure.

The aim of this work is to provide such a theoretical study. We show that the dif-ferences in risks (for two different risks and their cross-validated versions) of the two em-pirical TSP classifiers that we consider here relative to their theoretical counterparts are

O(qlog(M )/N ), where M is the number of pairs and N is either the sample size n or the product nπ with π the probability to observe the rarer of the two labels. In particular, the results shed some light on how the empirical TSP classifiers behave in the "small n large

p" paradigm.

The article is organized as follows. In Section 2, we define the two TSP classification procedures of interest as two maximizers of two different scores. Their empirical counter-parts are defined as maximizers of the corresponding empirical scores in Section 3, where we also carry out their asymptotic study. We introduce the cross-validated versions of the two TSP classification rules in Section 4, where we also investigate their asymptotic behaviors. We present a numerical illustration based on a real dataset in Section 5, where we also summarize the results of a simulation study to the compare the performances of the different TSP procedures. We draw some conclusions and present some perpectives in Section 6. The proofs of the main results are postponed to Section 7.

2 General framework

This section is devoted to the definition of two TSP classifiers. We first introduce some useful notations and definitions in Section 2.1. We define the TSP classifiers in Section 2.2.

2.1 Notations

Let O = (X, Y ) be the observed data-structure taking values in RG_{× {0, 1} (for a possibly} large integer G). For instance, X can be viewed as the expression levels of G genes while

Y can indicate if the subject is healthy or not. The true data-generating distribution is P0,

which is an element of the set M of all candidate data-generating distributions. We denote

Dn =

n

(4)

(Xn, Yn) are independent copies of O. We set J = {J = (i, j) ∈ {1, . . . , G}2 , i < j} and ZJ = 1{Xi < Xj} for all J ∈ J . Obviously, card(J ) = G(G − 1)/2.

We introduce the following notations: p1 _{= P}

0(Y = 1) = 1 − p0, p = min(p1, p0), and

for each J ∈ J , αJ = P0(ZJ = 1), ηJ(ZJ) = P0(Y = 1|ZJ), pJ(1) = P0(ZJ = 1|Y = 1), and pJ(0) = P0(ZJ = 1|Y = 0). We assume that card(J ) ≥ 2 and p > 0.

Let F be the set of these functions which map RG _{onto {0, 1}, and consider the loss} functions L : RG× {0, 1} × F → R+ such that

L((X, Y ), Ψ) = 1{Ψ(X) 6= Y } = Y 1{Ψ(X) 6= 1} + (1 − Y )1{Ψ(X) 6= 0}.

The loss function L is the usual loss function in the classification framework. For all

P ∈ M, this yields the risks R(P )₁ , R(P )₂ _{: F → R}+ characterized by

R(P )₁ (Ψ) = EP[L(O, Ψ)] = P (Ψ(X) 6= Y ), and

R(P )₂ (Ψ) = EP[L(O, Ψ)|Y = 1] + EP[L(O, Ψ)|Y = 0] = 1

p1P (Ψ(X) 6= 1, Y = 1) +

1

p0P (Ψ(X) 6= 0, Y = 0).

The risk R1 is called misclassification risk. We call R2 a weighted misclassification risk.

The risk R2 is particularly usefull when p max(p1, p0) and it is important to identify

the elements of the rare class.

Finally, we define Fpair =SJ ∈JFJ where FJ is the set of these functions t of ZJ such

that t(ZJ) ∈ {0, 1}. For J ∈ J , a classifier t ∈ FJ is called a pair classifier. Note that card(Fpair) = 4card(J ).

2.2 Definition of the TSP classifiers

The two TSP classifiers that we consider here are elements of Fpair. Their definitions

involve the risks R1 and R2. Of course there is no guarantee a priori that classifying based

on basic comparisons as they do will prove efficient. However, they are so simple and so fast that one can try them almost at no cost.

2.2.1 TSP for the misclassifcation risk

We first introduce the TSP classifier for the misclassification risk R1.

For each J ∈ J , let ΨJ denote the Bayes classifier on the set FJ, defined by

ΨJ(X) = ΨJ(ZJ) = 1{ηJ(ZJ) ≥ 1/2}. (1) The classifier ΨJ votes for the class with the larger probability conditionally on Xi < Xj or Xi ≥ Xj. We recall that ΨJ is also characterized by ΨJ ∈ arg mint∈FJ R

(P0)

1 (t). We

define the score γJ of the pair J as

γJ = αJ|ηJ(1) − 1/2| + (1 − αJ)|ηJ(0) − 1/2|. (2) The following lemma connects the score of a pair J ∈ J to the misclassification risk of ΨJ.

(5)

Lemma 1. For each J ∈ J it holds that γJ = 1/2 − R

(P0)

1 (ΨJ).

Proof. Set J ∈ J . We first decompose R(P0)

1 (ΨJ) as follows: R(P0) 1 (ΨJ) = EP0[1{ΨJ(1) 6= 1}1{Y = 1}1{ZJ = 1}] + EP0[1{ΨJ(0) 6= 1}1{Y = 1}1{ZJ = 0}] + EP0[1{ΨJ(1) 6= 0}1{Y = 0}1{ZJ = 1}] + EP0[1{ΨJ(0) 6= 0}1{Y = 0}1{ZJ = 0}] .

From this decomposition, we deduce that

R(P0)

1 (ΨJ) = αJ[1{ΨJ(1) 6= 1}ηJ(1) + (1 − 1{ΨJ(1) 6= 1})(1 − ηJ(1))]

+ (1 − αJ) [1{ΨJ(0) 6= 1}ηJ(0) + (1 − 1{ΨJ(0) 6= 1})(1 − ηJ(0))] .

Using the facts that, firstly, ΨJ(1) 6= 1 implies ηJ(1) < 1/2 and, secondly, ΨJ(0) 6= 1 implies ηJ(0) < 1/2, we obtain that

1/2 − [1{ΨJ(1) 6= 1}ηJ(1) + (1 − 1{ΨJ(1) 6= 1})(1 − ηJ(1))] = |ηJ(1) − 1/2| , 1/2 − [1{ΨJ(0) 6= 1}ηJ(0) + (1 − 1{ΨJ(0) 6= 1})(1 − ηJ(0))] = |ηJ(0) − 1/2| .

The last equalities with 1/2 = αJ/2 + (1 − αJ)/2 completes the proof.

Lemma 1 teaches us that the larger the score γJ, the better the classification based only on the pair J . Therefore, the TSP J₁∗ is characterized by

J₁∗ ∈ arg max

J ∈J γJ. (3)

It yields the TSP classifier for the misclassification risk:

ΨJ₁∗(X) = ΨJ₁∗(ZJ₁∗) = 1{ηJ₁∗(ZJ₁∗) ≥ 1/2}.

By (3) and Lemma 1, one can equivalently characterize this TSP classifier as

ΨJ∗

1 ∈ arg min_t∈F pair

R(P0)

1 (t),

showing that ΨJ₁∗ can also be viewed as a risk minimizer–we will draw advantage of this remark later.

2.2.2 TSP for the weighted misclassification risk

We now introduce the TSP classifier for the weighted misclassification risk. It is the original TSP classifier of Geman et al. (2004). It can be viewed as weighted counterpart of the

(6)

TSP classifier ΨJ₁∗ in the sense that it is a minimizer of the weighted misclassification risk over Fpair.

For each J ∈ J , we introduce the classifier ΦJ ∈ FJ defined by

ΦJ(X) = ΦJ(ZJ) = 1{pJ(1) > pJ(0)}1{ZJ = 1} + 1{pJ(1) ≤ pJ(0)}1{ZJ = 0}. (4) The classifier ΦJ votes for the class where the observed ordering between Xi and Xj is the more likely. We also introduce the score ∆J of each J ∈ J as

∆J = |pJ(1) − pJ(0)| . (5) The following lemma teaches us that one can interpret ∆J as the weighted counterpart of

γJ and ΦJ as the weighted counterpart of ΨJ.

Lemma 2. Set J ∈ J . For all t ∈ FJ, it holds that

R(P0)

2 (t) − R (P0)

2 (ΦJ) = ∆J(1{t(1) 6= ΦJ(1)} + 1{t(0) 6= ΦJ(0)}) ,

which implies that ΦJ ∈ arg mint∈FJR (P0)

2 (t). Moreover, ∆J = 1 − R

(P0)

2 (ΦJ).

Proof. Set J ∈ J , t ∈ FJ, and define

A1 = EP0[1{t(ZJ) 6= 1}|Y = 1] , and A0 = EP0[1{t(ZJ) 6= 0}|Y = 0] .

We can decompose A1 as A1 = EP0[1{t(1) 6= 1}1{ZJ = 1}|Y = 1] + EP0[1{t(0) 6= 1}1{ZJ = 0}|Y = 1] = pJ(1)1{t(1) 6= 1} + (1 − pJ(1))1{t(0) 6= 1}. (6) Similarly, A0 = EP0[1{t(1) 6= 0}1{ZJ = 1}|Y = 0] + EP0[1{t(0) 6= 0}1{ZJ = 0}|Y = 0] = pJ(0)1{t(1) 6= 0} + (1 − pJ(0))1{t(0) 6= 0} = pJ(0) (1 − 1{t(1) 6= 1}) + (1 − pJ(0)) (1 − 1{t(0) 6= 1}) . (7) Since R(P0)

2 (t) = A1+ A0, we deduce from (6) and (7) that

R(p0)

2 (t) = 1 + (pJ(0) − pJ(1))1{t(0) 6= 1} + (pJ(1) − pJ(0))1{t(1) 6= 1}. (8) Equation (8) holds in particular when t = ΦJ. Therefore,

R(P0) 2 (t) − R (P0) 2 (ΦJ) = (pJ(0) − pJ(1)) (1{t(0) 6= 1} − 1{ΦJ(0) 6= 1}) + (pJ(1) − pJ(0)) (1{t(1) 6= 1} − 1{ΦJ(1) 6= 1}) = ∆J(1{t(1) 6= ΦJ(1)} + 1{t(0) 6= ΦJ(0)}) ,

which is the first stated result. Moreover, a direct application of (8) with t = ΦJ yields second the result.

(7)

The TSP classifier for the weighted misclassification risk is ΦJ₂∗ with J2∗ characterized

by

J₂∗ ∈ arg max

J ∈J ∆J. (9)

By Lemma 2 and (9), it holds that ΦJ₂∗ ∈ arg mint∈FpairR

(P0)

2 (t), showing that ΦJ₂∗ can be viewed as a minimizer of the weighted misclassification risk over Fpair.

3 Empirical TSP classifiers

In this section, we introduce our empirical TSP classifiers and study their asymptotic be-haviors in terms of risks control. Section 3.1 and Section 3.2 are devoted to the empirical TSP classification procedures for the misclassification risk and the weighted misclassifica-tion risk, respectively.

3.1 Empirical TSP classifier for the misclassification risk

The definition of the empirical TSP classifier for misclassification risk relies on estimators of J₁∗ and ηJ₁∗ that we plug into (1).

For every t ∈ Fpair, we setRb₁(t) = 1

n

Pn

k=11{t(Xk) 6= Yk}, the empirical misclassifica-tion risk of t. For each J ∈ J , let γbJ = αbJ|ηbJ(1) − 1/2| + (1 −αbJ)|ηbJ(0) − 1/2| be the

empirical score, where αbJ =

1 n Pn k=1ZJk and b ηJ(z) =    1 nβbJ(z) Pn k=11{ZJk = z}1{Yk = 1} ifβb_J(z) > 0 1 2 otherwise, with βb_J(z) = z b

αJ+ (1 − z)(1 −αbJ) (for both z = 0, 1). The random variable ηbJ(z) is the

empirical version of ηJ(z). If card({k, Z_Jk = z}) = 0, we chooseηbJ(z) = 1/2 by convention.

The plug-in estimator Ψb_J(·) = 1 b

ηJ(·) ≥ 1/2 of ΨJ implements a majority voting rule:

b ΨJ(z) =    1 if card({k, Zk J = z, Yk = 1}) ≥ card({k, ZJk = z, Yk = 0}) 0 otherwise, hence b ΨJ ∈ arg min t∈FJ b R1(t). (10)

We illustrate the classification ruleΨb_J in Figure 1. Finally,Jb₁ = arg max_{J ∈J} γ_b_J defines an

estimator of the TSP J₁∗which leads to the emprical TSP classifierΨb b

J1. A slight adaptation

of the proof of Lemma 1 shows the following result:

Lemma 3. For each J ∈ J , it holds that γbJ = 1/2 −Rb1

b

ΨJ

(8)

Figure 1: Illustration of the empirical classification rules Ψb_J and Φb_J for a pair J = (i, j).

First, we have η_bJ(1) = 5/8 and ηbJ(0) = 5/7. Therefore, for a new observation X, ΨbJ(X) = 1

if X_j ≤ X_i and Ψb_J(X) = 1 if X_j ≤ X_i. Moreover, the score γ_b_J of the pair J is equal to

(8/15) |5/8 − 1/2| + (7/15) |5/7 − 1/2| = 1/6. For the computation of Φb_J, p_b_J(1) = 1/2 and b

pJ(0) = 3/5. Therfore, for a new observation X we obtainΦb_J(X) = 0 if X_j > X_i andΦb_J(X) = 1

si Xj ≤ Xi. Moreover, the score ∆b_J of the pair J is equal to |1/2 − 3/5| = 1/10.

Lemma 3 and (10) entail that

b

Ψ

b

J1 ∈ arg min_t∈F_pair

b

R1(t). (11)

This property leads to the following asymptotic result which teaches us that, in the limit,

b

Ψ

b

J1 performs as well as the TSP classifier ΨJ ∗ 1.

Theoreme 1. It holds that

0 ≤ EhR(P0) 1 (Ψb b J1) − R (P0) 1 (ΨJ∗ 1) i = O   s log(card(J )) n  .

This is the classical rate of convergence that one expects for a classifier which can be viewed as a minimizer, over the set of the classifiers defined on Fpair, of the empirical

misclassification risk (Bousquet et al., 2004). We see clearly how the number of pairs affects the rate of convergence.

(9)

3.2 Empirical TSP classifier for the weighted misclassification

risk

The definition of the empirical TSP classifier for the weighted misclassification risk relies on estimators of J₂∗ and pJ∗

2. It is the original empirical TSP classifier of Geman et al.

(2004).

Set I(y) = {k ≤ n, Yk_{= y} and N (y) = card(I(y)) for y ∈ {0, 1}. For each J ∈ J , the} empirical score ∆Jb is defined as ∆Jb = |

b

pJ(1) −pbJ(0)|, where, for each y = 0, 1,

b pJ(y) = 1{N (y) > 0} N (y) X k∈I(y) Z_Jk

(with convention 0/0 = 0). We also define for each J ∈ J the empirical counterpart Φb_J

of ΦJ as Φb_J(X) = Φb_J(Z_J) = 1{ b

pJ(1) >pbJ(0)}1{ZJ = 1} + 1{pbJ(1) ≤pbJ(0)}1{ZJ = 0}.

Finally, Jb₂ ∈ arg max_{J ∈J} ∆b_J defines an estimator of the TSP J∗

2, which leads to the

empirical TSP classifier Φb b

J2 for the weighted misclassification risk.

The following asymptotic result shows that Φb b

J2 performs as well, in the limit, as ΦJ ∗ 2:

0 ≤ EhR(P0) 2 (ΦbJ_b2) − R (P0) 2 (ΦJ2∗) 1{0 < N (1) < n}i= O   s log(card(J )) np  . (12)

The above rate of convergence is the same as in Theorem 1 with n replaced by np, the expected number of those observations Ok _{= (X}k_{, Y}k_{) such that Y}k _{= y where y is the rare} outcome (i.e., p = P0(Y = y)). The additional factor 1/

√

p featured in (12) quantifies to

what extent working with R2 instead of R1 makes the classification problem more difficult.

The proof of Theorem 2 is given in Section 7.3.

4 Cross-validated TSP classifiers

This section parallels Section 3. The main idea is to adopt a different approach to estimate the two TSP classifiers: instead of building the empirical TSP classifiers that we introduce and study in Section 3, we rely here on the cross-validation principle. By doing so, we could possibly achieve a greater stability and greater performances for the resulting estimators. The cross-validation principle has been widely studied both from the theoretical and prat-ical viewpoints (Dudoit and van der Laan, 2005; Arlot, 2007, and references therein). We define the cross-validated counterparts of R1 and R2 in Section 4.1. We introduce the two

cross-validated TSP classifiers in Section 4.2, and we study their asymptotic behaviors in Section 4.3.

(10)

4.1 Cross-validated risk estimator

We set an integer V ≥ 2 and a regular partition (Bv)1≤v≤V of {1, . . . , n}, i.e., a partition

such that, for each v = 1, . . . , V , card(Bv) ∈ {bn/V c, bn/V c + 1}. For each v ∈ {1, . . . , V }, we denote D(v)

n (respectively D(−v)n ) the dataset {Ok, k ∈ Bv} (respectively {Ok_{, k 6∈ B}

v}), and define the corresponding empirical measures

P_n(v) = 1 card(Bv) X k∈Bv Dirac(Ok), and P_n(−v) = 1 n − card(Bv) X k6∈Bv Dirac(Ok).

Let bt be a pair classifier, i.e., a function mapping the empirical distribution to F_pair.

Note that t can be viewed simply as a black box algorithm that one applies to data. Web

characterize the empirical cross-validated risk estimatorsRb_1,n, for the misclassification risk,

and Rb_2,n, for the weighted misclassification risk, by

b R1,n(t) =b 1 V V X v=1 R(P (v) n ) 1 b t(P_n(−v)), and b R2,n(t) =b 1 V V X v=1 R(P (v) n ) 2 b t(P_n(−v)) for all t.b

For each v ∈ {1, . . . , V } and m = 1, 2, R(Pn(v))

m (t(Pb _n(−v))) is the empirical estimator of

R(P0)

m (t(Pb _n(−v)), based on D(v)_n and conditionally on D_n(−v). Obviously, it holds that, for

every v ∈ {1, . . . , V }, R(P (v) n ) 1 b t(P_n(−v)) = 1 card(Bv) X k∈Iv L(Ok,bt(P_n(−v))), and R(P (v) n ) 2 b t(P_n(−v)) = 1{Nv(1) > 0} Nv(1) X k∈Iv(1) L(Ok,bt(P_n(−v))) + 1{Nv(0) > 0} Nv(0) X k∈Iv(0) L(Ok,bt(P_n(−v))),

with Iv(y) = {k ∈ Bv, Yk = y} and Nv(y) = card(Iv(y)) for y = 0, 1.

4.2 V-fold cross-validation principle

Lettb₁, . . . ,tb_L be L pair classifiers (with L a possibly large integer).

We first address the case of the misclassification risk R1. Each pair classifier can be

viewed as a candidate to estimate the TSP classifier ΨJ₁∗ for the misclassification risk

(11)

goal is to select a pair classifier in the collection {bt₁, . . . ,tb_L}, whose risk is the closest to

R(P0)

1 (ΨJ₁∗). The V -fold cross-validation procedure consists in selecting the pair classifier which minimizes the cross-validated riskRb_1,n. So, we introduce the cross-validated selector b

`1,n ∈ arg min`≤LRb_1,n(tb_`). The cross-validated TSP classifier is finally defined asΨb_n=bt b

`1,n.

Consider now the case of the weighted misclassification risk R2. In that case, each

pair classifier can be viewed as a candidate to estimate the TSP classifier ΦJ∗

2 for the

misclassification risk R2. One could for instance take L = card(J ) and {tb₁, . . . ,bt_L} =

{Φb_J, J ∈ J }. Similarly, we set `b_2,n ∈ arg min_`≤LRb_2,n(bt_`) and Φb_n =tb b

`2,n.

4.3 Asymptotic perfomances of the cross-validated TSP

classi-fiers

The asymptotic results that we obtain for the cross-validated TSP classifiers defined in Section 4.2 results are similar in nature to those of Dudoit and van der Laan (2005). They are expressed as comparisons to the oracle counterparts of the cross-validated TSP classifiers in terms of risks. Accordingly, define Re_1,n and Re_2,n the oracle counterparts of

b

R1,n and Rb_2,n: for any t,b

e R1,n(bt) = 1 V V X v=1 R(P0) 1 b t(P_n(−v)), and e R2,n(bt) = 1 V V X v=1 EP0[L(O,bt(P (−v) n ))|Y = 1]1{Nv(1) > 0} +EP0[L(O,bt(P (−v) n ))|Y = 0]1{Nv(0) > 0}.

They yield the oracle counterparts`e_1,n= arg min_`≤LRe_1,n(tb_`) and`e_2,n= arg min_`≤LRe_2,n(tb_`)

of `b_1,n and `b_2,n, which yield in turn the oracle couterparts Ψe_n = bt e `1,n and e Φn =Φe_n =bt e `2,n

of Ψb_n and Φb_n. We obtain the following result:

EhR˜1,n(Ψb_n) − ˜R_1,n( ˜Ψ_n) i = O   v u u t log(L) bn/V c  , and (13) EhR˜2,n(Φb_n) − ˜R_2,n( ˜Φ_n) i = O   v u u t log(L) bn/V cp  . (14)

As usual when one deals with cross-validated estimators, the theorem comparesΨb_nand b

Φn to their oracle counterparts in terms of the oracle cross-validated risks. The theorem teaches us that, in the limit, Ψb_n and Φb_n perform as well as Ψe_n and Φe_n.

If we choose {tb₁, . . . ,tbL} equal to {Ψb_J, J ∈ J } or {Φb_J, J ∈ J }, then the results in

Theorem 3 are similar to those in Theorems 1 and 2. However, the rates of convergence in Theorem 3 are slightly slower than those of Theorems 1 and 2 due to the factor √V .

(12)

Equation (13) directly stems from (Dudoit and van der Laan, 2005, Theorem 2). The proof of (14) is postponed to Section 7.4.

5 Numerical study

We gather here the presentations of the application to a real dataset, and the results of a simulation study. The R (R Core Team, 2012) coding of our original TSP procedures was eased by the tspair package of Leek (2012).

5.1 Application on a real dataset

The different versions of the TSP classifier were applied to the Central Nervous System (CNS) cancer dataset. Originally used by Pomeroy et al. (2002) for a study of

medulloblas-toma (a brain tumor), this dataset is included in the R-package stepwiseCM. The CNS

dataset consists of the 60 vectors of gene expression measurements of 7128 genes of 60 pa-tients who received a treatment of medulloblastomas. Twenty-one papa-tients died within two years after the end of their treatment. We tackle the classification problem of recovering whether the patient died or survived based on the gene expression measurements. More specifically, we evaluate the risks R1 and R2 achieved by the different versions of the TSP

classifiers and the stepwise classification rule (implemented in the R-package stepwiseCM). We actually provide two different evaluations, relying either on the leave-one-out rule or on the validation hold-out rule. The training and validation sets (respectively made of 40 and 20 patients) are defined in the package.

We refer the reader to Table 1 for a succinct presentation of each classifier, and to Tables 2 and 3 for the evaluations of their performances (by leave-one-out in Table 2 and by validation hold-out in Table 3).

classifier

tsp1 empirical TSP classifier for R1

tsp2 empirical TSP classifier for R2

ctsp1(2) 2-fold cross-validated TSP for R1

stepwise stepwise classifier

Table 1: Description of the different classifiers involved in the numerical study.

Four features of Table 2 and 3 are specially worth commenting on.

First, it appears that tsp1 performs better than tsp2 in terms of the misclassification error rateRb₁ for both performances evaluations. Note that a TSP for R₁ (R₂, respectively)

(13)

leave-one-out rule

tsp1 tsp2 ctsp1(2) ctsp1(5) ctsp2(2) ctsp2(5) stepwise 1 −Rb₁ 0.78 0.38 0.80 0.82 0.50 0.38 0.68

1 −Rb₂ 1.40 0.69 1.52 1.42 0.63 0.96 1.22

Table 2: Performances of the different versions of the TSP classifier on the dataset CNS, with leave-one-out evaluation.

validation hold-out rule

tsp1 tsp2 ctsp1(2) ctsp1(5) ctsp2(2) ctsp2(5) stepwise 1 −Rb₁ 0.80 0.65 0.80 0.80 0.65 0.65 0.85

1 −Rb₂ 1.43 1.12 1.43 1.43 1.43 1.43 1.59

Table 3: Performances of the different versions of the TSP classifier on the dataset CNS, with validation hold-out evaluation.

Second, and perhaps disappointingly at first glance, we also see that tsp2 does not per-form better than tsp1 in terms of the weighted misclassification error rateRb₂, although the

definition of tsp2 relies on the weighted misclassification risk R2, and although Theorem 2

guarantees its R2-consistency. This may be a numerical illustration of the fact that its rate

of convergence is O(1/√np), with p the proportion of the rarer class (and not O(1/√np)).

In the numerical example, 21 observations out of 60 belong to the rarer class.

Third, let us comment on the interest of the cross-validated versions of tsp1 and tsp2. On the one hand, we note that both cross-validated versions of tsp1 perform at least as well as tsp1 in terms of Rb₁ and for both performances evaluations. On the other hand,

we note that except for ctsp2(2) the cross-validated versions of tsp2 perform better than tsp2 in terms of Rb₂ and for both performances evaluations.

Fourth, comparing what can be compared, tsp1 performs better than stepwise in terms of leave-one-out evaluation (5 more correct labellings) of the performances relative toRb₁, but slightly worse in terms of validation hold-out evaluation relative to Rb₁ (one less

correct labelling).

5.2 Simulation study

In light of the third comment above, we now undertake a simulation study of the influence of the sample size and true probability of the rarer class on the performances of tsp2 relative to those of tsp1. The simulation scheme relies on the dataset CNS. To lessen the computational burden, we only consider the gene expression measurements of the first 100 genes of the original dataset. The simulation of an observation (X, Y ) meets the following constraints:

(i) The label Y is drawn from the Bernoulli law with parameter 1 − p = 0.8. Thus, the

(14)

(ii) The vector of gene expression measurements X is subsequently drawn conditionally

on Y from a slightly pertubed version of the empirical conditional distribution of X given Y in the CNS dataset.

We rely on the leave-one-out rule to evaluate and compare the performances of tsp1 and tsp2. More specifically, we repeat independently B = 100 times the following steps:

1. simulate a dataset of sample size n = 60 (hence np = 12);

2. compute the performances of tsp1 and tsp2 (leave-one-out rule) over the rare and the frequent classes separately.

From these results, we compute the mean and standard deviation of the perfomances obtained by each classifier. The results are presented in Table 4.

classification performances (leave-one-out rule) rare+frequent class rare class frequent class

tsp1 0.75 (0.11) 0.13 (0.19) 0.89 (0.09) tsp2 0.61 (0.17) 0.45 (0.31) 0.65 (0.19)

Table 4: Classification performances of tsp1 and tsp2 on simulated data with np = 12 (p = 0.2, n = 60). We report the empirical mean (and standard deviation, between parentheses) of the performances of each classifier over both classes (first column), the rare class (second column) and the frequent class (third column).

Although the standard deviations are rather large (especially for tsp2), we can draw interesting conclusions from Table 4. (Note that in each column, the standard deviations are larger for tsp2 than for tsp1. This may be due to the fact that the rate of convergence of tsp2 is O(1/√np) and not O(1/√n)—more on this later). First, we see again that tsp1

performs better than tsp2 in terms of R1. Inspecting the second and third columns of

the table confirms the intuition that this happens because tsp1 does a good job on the frequent class and a poor one (at low cost for R1) on the rare class. By construction, tsp2

outperforms tsp1 as far as the rare class is concerned.

We now take a closer look at the influence on tsp2 of the sample size for fixed p = 0.2. For that sake, we repeat independently B = 100 times the above two-step simulation scheme with n = 300 and for tsp2 only. The results are presented in Table 5.

It is striking that the performances of tsp2 on both classes and on the frequent class alone are almost identical in Tables 4 and 5. (In particular, this suggests that the larger standard deviations attached to tsp2 relative to tsp1 are not due to the difference in rates of convergence.). On the contrary, increasing the sample size does seem to enhance the performances of tsp2 over the rare class (both in mean and standard deviation).

6 Discussion

The TSP procedures for binary classification that we have studied here involve only one TSP. In future work, we will extend our results to TSP procedures for multi-class

(15)

clas-classification performance (leave-one-out rule) rare+frequent class rare class frequent class

tsp2 0.64 (0.13) 0.54 (0.22) 0.66 (0.17)

Table 5: Classification performances of tsp2 only on simulated data with np = 60 (p = 0.2,

n = 300). We report the empirical mean (and standard deviation, between parentheses) of

the performances computed over both classes (first column), the rare class (second column) and the frequent class (third column).

sification that may involve several TSPs, in the spirit of (Tan et al., 2005; Zhou et al., 2012).

Obviously, the TSP procedures do not lead in general to optimal classification rules. In future work, we will characterize and study the families of distibutions for which the TSP procedures lead to (near) optimal classifiers.

Acknowledgments

The author thanks warmly his supervisor A.Chambaz for his helpful suggestions throughout this work.

7 Proof

This section gathers the proofs of the Theorems 1, 2 and 3.

7.1 Two useful lemmas

Lemma 4. Set two positive integers N, M and introduce the function f defined on the

set of non-negative real numbers by f (x) = min(1, exp(log(2M ) − 2N x2_{)). The following}

inequality holds: Z +∞ 0 f (x)dx ≤ s log(2M ) 2N + √ π 2√2N.

Proof. For all x ≥ 0, we have f (x) = exp(−(2N x2 − log(2M ))+). Therefore

Z +∞ 0 f (x)dx = s log(2M ) 2N + Z x≥ q log(2M ) 2N exp(−(2N x2− log(2M )))dx. (15)

Since a2_{− b}2 _{≥ (a − b)}2 _{for a ≥ b ≥ 0, note that:}

Z x≥ q log(2M ) 2N exp(−(2N x2− log(2M )))dx ≤ Z x≥ q log(2M ) 2N exp   −2N  x − s log(2M ) 2N   2  dx = √1 2N Z +∞ 0 exp(−x2)dx = √ π 2√2N. (16)

(16)

Finally, Equation (15) and Equation (16) yield the result.

Lemma 5. Let Z = B(n, p) be a binomial random variable. ThenL

E " 1{Z > 0} √ Z # ≤ s 2 (n + 1)p.

Proof. By the Cauchy-Schwartz inequality, it holds that

E " 1 √ Z + 1 #!2 ≤ E 1 Z + 1 = n X k=0 1 k + 1 n k ! pk(1 − p)n−k = Z 1 0 (xp + 1 − p)ndx ≤ 1 (n + 1)p.

Now, since 1/√k ≤√2/√k + 1 for all k ≥ 1, we obtain

E " 1{Z > 0} √ Z # = n X k=1 1 √ k n k ! pk(1 − p)n−k ≤√2 n X k=1 1 √ k + 1 n k ! pk(1 − p)n−k ≤√2E " 1 √ Z + 1 # ≤ s 2 (n + 1)p, which is the stated result.

7.2 Proof of Theorem 1

The proof of Theorem 1 relies on the characterization (11) of the empirical TSP classifier. We have: 0 ≤ R(P0) 1 (ΨbJ_b1) − R (P0) 1 (ΨJ₁∗) = R(P0) 1 (ΨbJ_b1) − b R1(Ψb b J1) +Rb₁(Ψb b J1) − R (P0) 1 ΨJ₁∗ .

By (11), this yields that

0 ≤ R(P0) 1 (Ψb b J1) − R (P0) 1 (ΨJ∗ 1) ≤ 2 sup t∈Fpair R (P0) 1 (t) −Rb₁(t) . Therefore 0 ≤ EhR(P0) 1 (Ψb b J1) − R (P0) 1 (ΨJ∗ 1) i ≤ 2E " sup t∈Fpair R (P0) 1 (t) −Rb₁(t) # . (17)

Next, we provide an upper bound for the right-hand side expectation. By the Bonferroni inequality, we have for all h ≥ 0,

P sup t∈Fpair R (P0) 1 (t) −Rb₁(t) ≥ h ! ≤ min  1, X t∈Fpair P R (P0) 1 (t) −Rb₁(t) ≥ h  .

(17)

Since for each t ∈ Fpair,Rb₁(t) is an empirical mean of i.i.d Bernouilli random variables

with common mean R(P0)

1 (t), we deduce from Hoeffding’s inequality that:

P sup t∈Fpair R (P0) 1 (t) −Rb₁(t) ≥ h !

≤ min1, explog(2card(Fpair)) − 2nh2

.

Now, with card(Fpair) = 4card(J ),

E " sup t∈Fpair R (P0) 1 (t) −Rb₁(t) # = Z +∞ 0 P sup t∈Fpair R (P0) 1 (t) −Rb₁(t) ≥ h ! dh ≤ s log(8card(J )) 2n + √ π 2√2n,

by Lemma 4. Then (17) yields the theorem.

7.3 Proof of Theorem 2

We have: 0 ≤ R(P0) 2 (ΦbJ_b2) − R (P0) 2 (ΦJ2∗) = R(P0) 2 (ΦbJ_b2) − R (P0) 2 (ΦJ_b2) +R(P0) 2 (ΦJ_b2) − R (P0) 2 (ΦJ2∗) = R(P0) 2 (ΦbJ_b2) − R (P0) 2 (ΦJ_b2) +∆J₂∗− ∆ b J2 = R(P0) 2 (Φb b J2) − R (P0) 2 (ΦJ_b2) +∆J₂∗−∆b_J∗ 2 +∆b_J∗ 2 − ∆Jb2 ≤ R(P0) 2 (Φb b J2) − R (P0) 2 (ΦJ_b2) +∆J₂∗−∆b_J∗ 2 +∆b b J2 − ∆Jb2 , by definition ofJb₂.

To complete the proof, it remains to control Eh1{0 < N (1) < n} sup_{J ∈J}

∆J −∆bJ i and Eh1{0 < N (1) < n} sup_{J ∈J} R (P0) 2 (ΦJ) − R (P0) 2 (Φb_J) i

, by relying on Lemmas 6 and 7.

Lemma 6. For all J ∈ J , it holds that

Proof. Inequality (18) is a by-product of Lemma 2 and the fact that, for each y ∈ {0, 1},

b

ΦJ(y) 6= ΦJ(y)

this implication, we just check one of the four different cases that can arise (the others can be addressed similarly). For instance, if y = 1 and Φb_J(1) = 0 then p_b_J(0) ≥p_b_J(1) and

pJ(0) < pJ(1). Thus,

∆J = |pJ(1) − pJ(0)| = pJ(1) − pJ(0) = (pJ(1) −pbJ(1)) + (pbJ(1) − pJ(0))

≤ (pJ(1) −pbJ(1)) + (pbJ(1) − pJ(0))

≤ |pbJ(1) − pJ(1)| + |pbJ(0) − pJ(0)| .

(18)

Lemma 7. For each y ∈ {0, 1}, it holds that EP0 " 1{N (y) > 0} sup J ∈J |pbJ(y) − pJ(y)| # ≤ s 2 log(2card(J )) np + s π 2np. (20)

Proof. By symmetry, it suffices to present the proof in the case where y = 1. Let Y denotes

the σ-field spanned by {Yk_{, k = 1, . . . , n}. We have:}

EP0 " 1{N (1) > 0} sup J ∈J |pbJ(1) − pJ(1)| # = EP0 " E " 1{N (1) > 0} sup J ∈J |pbJ(1) − pJ(1)| Y ## , which equals EP0 " 1{N (1) > 0} Z +∞ 0 P sup J ∈J |pbJ(1) − pJ(1)| ≥ h|Y ! dh # .

If N (1) > 0 then conditionally on Y and for each J ∈ J , the random variable pbJ(1) is

an empirical mean of i.i.d Bernoulli random variables with common mean pJ(1). Therefore, by the Bonferroni and Hoeffding inequalities, we obtain for all h ≥ 0:

1{N (1) > 0}P sup J ∈J |pbJ(1) − pJ(1)| ≥ h Y !

≤ 1{N (1) > 0} min1, explog (2card(J ) − 2N (1)h2.

Applying Lemma 4 then gives

1{N (1) > 0} Z +∞ 0 P sup J ∈J |pbJ(1) − pJ(1)| ≥ h Y ! dh ≤ 1{N (1) > 0}q 2N (1) q log (2card(J )) + √ π 2 ! . (21)

Since N (1)= B(n, pL 1), (21) and Lemma 5 yield the result.

7.4 Proof of Theorem 3

We recall that (13) directly stems from Dudoit and van der Laan (2005). We now give the proof of (14). By definition of ˜`2_n, one has

0 ≤ ˜R2,n(Φb_n) − ˜R_2,n( ˜Φ_n) = ( ˜R_2,n(Φb_n) −Rb_2,n(Φb_n)) + (Rb_2,n(Φb_n) − ˜R_2,n( ˜Φ_n)) ≤ ( ˜R2,n(Φb_n) −Rb_2,n(Φb_n)) + (Rb_2,n( ˜Φ_n) − ˜R_2,n( ˜Φ_n)) ≤ 2 sup `∈L Rb2,n(tb`) − ˜R2,n(bt`) .

(19)

Now, for each ` ∈ {1, . . . , L}, Rb_2,n(bt_`) − ˜R_2,n(tb_`) is equal to 1 V V X v=1   1{Nv(1) > 0} Nv(1) X i∈Iv(1)

L(Oi,tb_`(P_n(−v))) − E_P₀[L(O,tb_`(P_n(−v)))|Y = 1]   + 1 V V X v=1   1{Nv(0) > 0} Nv(0) X i∈Iv(0)

L(Oi,tb_`(P_n(−v))) − E_P₀[L(O,tb_`(P_n(−v)))|Y = 0]  , hence sup `∈L Rb2,n(bt`) − ˜R2,n(tb`) ≤ 1 V V X v=1 sup `∈{1,...,L} H 1 `,v + sup `∈{1,...,L} H 0 `,v ! , (22) where, for y = 0, 1, H_`,vy = 1{Nv(y) > 0} Nv(y) X i∈Iv(y)

L(Oi,tb`(P_n(−v))) − EP₀[L(O,bt`(P_n(−v)))|Y = y]

.

For each v ∈ {1, . . . , V } and y ∈ {0, 1}, conditionally on D(−v)

n and (Yi)i∈Bv, H

y `,v is an empirical mean of i.i.d bounded centered variable. Thus, the Bonferroni and Hoeffding inequalities imply that, for all h ≥ 0,

P sup `∈{1,...,L} H y `,v ≥ h D (−v) n , (Y i₎ i∈Bv !

≤ min(1, exp(log(2L) − 2Nv(y)h2)),

so that, for each v ∈ {1, . . . , V }, we deduce by Lemma 4 that

E " sup `∈{1,...,L} H y `,v D (−v) n , (Yi)i∈Bv # ≤ 1{Nqv(y) > 0} 2Nv(y) q log(2L) + √ π 2 ! . Since Nv(y) L

= B(n, py), we complete the proof by applying again Lemma 5 and (22).

References

Arlot, S. (2007). Rééchantillonage et selection de modèles. Thèse. Université Paris-Sud, Orsay.

Bousquet, O., Boucheron, S. and Lugosi, G. (2004). Introduction to statistical learning theory. Advanced Lectures in Machine Learning, Springer pp. 169–207.

Chambaz, A. and Denis, C. (2012). Classification in postural style. Annals of Applied

Statistics 6, 977–993.

Czajkowski, M. and Kretowski, M. (2011). Top scoring pair decision tree for gene expression data analysis. Software Tools and Algorithms for Biological Systems, Springer 3, 27–36.

(20)

Dudoit, S. and van der Laan, M. J. (2005). Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol. 2, 131–154. ISSN 1572-3127.

Geman, D., d’Avignon, C., Naiman, D. Q. and Winslow, R. L. (2004). Classifying gene ex-pression profiles from pairwise mRNA comparisons. Statistical Applications in Genetics

and Molecular Biology 3.

Leek, J. T. (2012). tspair: Top Scoring Pairs for Microarray Classification. R package version 1.16.0.

Pomeroy, S., Tamayo, P., Gaasenbeek, M., Sturla, L., Angelo, M., McLaughlin, M., Kim, J., Goumnerova, L., Black, P., Lau, C., Allen, J., Zagzag, D., Olson, J., Curran, T., Wetmore, C., Biegel, J., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D., Mesirov, J., Lander, E. and Golub, T. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442. R Core Team (2012). R: A Language and Environment for Statistical Computing. R

Foun-dation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/. ISBN 3-900051-07-0.

Tan, A., Naiman, D., Xu, L., Winslow, R. and D, G. (2005). Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21, 3896–3904. Yang, Y., Dudoit, S., Luu, P., Peng, V., Ngal, J. and Speed, T. (2001). Normalization for cdna microarray data. Microarrays: Optical Technologies and Informatics 4266, 141–152.

Zhou, C., Wang, S., Blanzieri, E. and Liang, Y. (2012). An entropy-based improved k-top scoring pairs (tsp) method for classifying human cancers. African Journal of