Variable selection in discriminant analysis for mixed variables and several groups

(1)

HAL Id: hal-01491838

https://hal.archives-ouvertes.fr/hal-01491838

Preprint submitted on 17 Mar 2017

HAL

is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire

HAL, est

destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Variable selection in discriminant analysis for mixed variables and several groups

Alban Mbina Mbina, Guy Nkiet, Fulgence Eyi-Obiang

To cite this version:

Alban Mbina Mbina, Guy Nkiet, Fulgence Eyi-Obiang. Variable selection in discriminant analysis for

mixed variables and several groups. 2017. �hal-01491838�

(2)

Variable selection in discriminant analysis for mixed variables and several groups

Alban Mbina Mbina^1,a, Guy Martial Nkiet^1,bFulgence EYI-OBIANG^1,c

1Laboratoire URMI, Universit´e des Sciences et Techniques de Masuku, BP 943 Franceville, Gabon.

Email:^a[email protected],^b[email protected],^c[email protected]

Abstract

We propose a method for variable selection in discriminant analysis with mixed categorical and continuous variables. This method is based on a criterion that permits to reduce the variable selection problem to a problem of estimating suitable permutation and dimensionality. Then, estimators for these parameters are proposed and the resulting method for selecting variables is shown to be consistent. A simulation study that permits to study several poperties of the proposed approach and to compare it with an existing method is given.

Keywords:

Variable selection; Discriminant analysis; Classification; Mixed variables

MSC:

62H30; 62H12

1 Introduction

The problem of classifying an observation into one of several classes in the basis of data consisting of both continuous and categorical variables is an old problem that have been tackled under different forms in the literature. The earliest works in this field go back to Chang and Afifi (1974) and Krzanowski (1975) who used the location model introduced by Olkin and Tate (1961) to form a classification rule in the context of discriminant analysis involving two groups.

More recent work has focused on defining distance measures between populations or making inference on them (e.g., Krzanowski 1983, Krzanowski 1984, Bar-Hen and Daudin 1995, Bedrick et al. 2000, de Leon and Carri`ere 2005). One of the most important problem in the context described above is the problem of selecting the appropriate categorical and/or continuous variables to use for discrimination. Indeed, it is well recognized that using fewer variables improve classification performance and permits to avoid estimation problems (e.g. McLachlan 1992, Mahat et al. 2007). There are several works dealing with this problem, mainly in the context of location model. Some of these works are based on the use of distances between populations for determining the most predictive variables (Krzanowski 1983, Daudin 1986, Bar-Hen and Daudin 1995, Daudin and Bar-Hen 1999). Krusinska (1989a, 1989b, 1990) used methods based on the percentage of missclassification, Hotelling’sT² and graphical models. More recently, Mahat et al. (2007) proposed a method based on distance between groups as measured by smoothed Kullback-Leiler divergence. All these works consider the case of two groups and, to the best of our knowledge, the case of more than two groups have not yet been considered for variable selection purpose. So, it is of great interest to introduce a method that can be used when the number of groups is greater than two. Such an approach have been proposed recently in Nkiet (2012) for the case of continuous variables only. It is based on a criterion that permits to characterize the set of variables that are appropriate for discrimination by means of two parameters, so that the variable selection problem reduces to that of estimating these parameters.

In this paper, we extend the approach of Nkiet (2012) to the case of mixed variables. The resulting method has two advantages; first, it can be used when the number of groups is greater than two, and secondly it just require

1

(3)

that the random vector consisting of the continuous variables has finite fourth order moment. No assumption on the distribution of this random vector is needed and, therefore, we do not suppose that the location model holds. In section 2, we introduce a criterion by means of which the set of variables to be estimated is characterized by means of suitable permutation and dimensionality. Then, estimating this criterion is tackled in section 3. More precisely, empirical estimators as well as non-parametric smoothing procedure are used for defining an estimator of the criterion. In the first case, we obtain properties of the resulting estimator that permits to obtain its asymptotic distribution. Section 4 is devoted to the definition of our proposal for variable selection. Consistency of the method, when empirical estimators are used, is then proved. Section 5 is devoted to the presentation of numerical experiments made in order to study several properties of the proposal and to compare it with an existing method. The first issue that is adressed concerns the impact of chosing penalty functions that are involved in our procedure, and that of the type of estimators that is used. The results reveal low impact on the performance of the proposed method. Since this method depends on two real parameters, it is of interest to study their influence on its performance and, consequently, to define a strategy that permits to chose optimal values for them. The simulation results clearly show their impact on the performance, and we propose a method based on leave-one-out cross validation for obtaining optimal results. When using this appoach, the obtained results show that the proposal is competitive with that of Mahat et al. (2007). All the proofs are given in Section 6.

2 Statement of the problem

Letting(Ω,A, P)be a probability space, we consider random vectors X=

X⁽¹⁾,· · · , X^(p)^T

and Y =

Y⁽¹⁾,· · · , Y^(d)^T

defined on this probability space and valued intoR^pand{0,1}^drespectively. The r.v.Xconsists of continuous random variables whereasY consists of binary random variables. As usual,Y may be associated to a multinomial random variable by consideringU = 1 +Pd

j=1Y^(j)2^j−1which is valued into{1,· · ·, M}, whereM = 2^d. Suppose that the observations of(X, Y)come fromqgroupsπ1,· · ·πq(withq≥2) characterized by a random variableZvalued into {1,· · · , q}; this means that(X, Y)belongs toπ`if, and only if, one hasZ=`. Such framework has been considered in the literature for classification purpose. Indeed, for the case of two groups, that is when q = 2, Krzanowski (1975) proposed a classification rule based on the location model under the assumption that the distribution ofX, conditionally toU =mandZ =`, is the multivariate normal distributionN(µm`,Σ). This rule allocates a future observation(x, y)of(X, Y)toπ1if

(µ_m1−µ_m2)^TΣ⁻¹

x−1

2(µ_m1+µ_m2)

≥log p_m2

pm1

+ log(α), (1)

wherem = 1 +Pd

j=1y^(j)2^j−1, pm` = P(U = m|Z = `)and αis a constant that depends on costs due to missclassification and prior probabilities for the two groups. The case whereq >2was considered by de Leon et al (2011) in the context of general mixed-data models. In this case, the optimum rule classifies an observation(x, y)as belonging toπ`^∗if

δ_m^(`^∗⁾(x, y) = max

`=1,···,qδ^(`)_m(x, y) (2)

where

δ^(`)_m(x, y) = (µm`)^TΣ⁻¹x−1

2(µm`)^TΣ⁻¹µm`+ log(pm`) + log(β`),

withβ`=P(Z=`). As it can be seen, these rules involve observations of all the variablesX^(j)inX. Nevertheless, as it is well recognized (see, e.g., McLachlan 1992, Mahat et al. 2007), using fewer variables improve classification performance. So, it is of real interest to perform selection of the X^(j)’s from a sample of(X, Y, Z). For doing that, we extend to the case of mixed variables an approach used in Nkiet (2012) for the case of continuous variables only. This approach first consists in introducing a criterion by means of which the set of variables that are adequate

(4)

for discrimination is characterized. From now on, we assume thatE kXk⁴

< +∞, wherek · kdenotes the usual Euclidean norm ofR^p, and for anym∈ {1,· · · , M}we consider

pm=P(U =m), µm=E(X|U =m) and µ`,m=E(X|Z=`, U =m).

We denote by⊗the tensor product defined as follows: for any(u, v) ∈(R^p)²,u⊗vis the linear maph∈R^p 7→

hu, hiv, where h·,· >is the usual Euclidean inner product of R^p. Then, we consider the conditional covariance operator given by

V_m=E((X−µ_m)⊗(X−µ_m)|U =m) and assume that it is invertible. Let us putI:={1,· · · , p}and, for any subsetKofI:

Q_K|m:=A^∗_K(A_KV_mA^∗_K)⁻¹A_K (3) whereAKdenotes the projector:

x= (xi)i∈I ∈R^p7→xK = (xi)i∈K ∈R^card(K).

Then, putting

p_`|m=P(Z=`|U =m), we introduce

ξ_K|m=

q

X

`=1

p²_`|mk I_R^p−VmQ_K|m

(µ`,m−µm)k², (4) whereI_Rpis the identity operator ofR^p. This leads us to consider the criterion

ξK=

M

X

m=1

p²_mξ_K|m (5)

by means of which we will try to characterize the subset of variables that is adequate for discrimination. For defining this subset of I, let us keep in mind that if a variable X^(j) is not adequate for discrimination, then it is the case whateverY. So, denoting byI_0,mthe subset ofIconsisting of continuous variables inX that are not adequate for discrimination whenU =m, we can consider the subset

I0=

M

\

m=1

I0,m

as the subset of variables that are not adequate for discrimination in the mixed case. Therefore, our problem reduces to the problem of estimating the subsetI1given by

I1=I−I0=

M

[

m=1

I1,m

whereI_1,m=I−I_0,m. As it was done in Nkiet (2012) for the case of continuous variables only, an explicit expression ofI_1,mcan be obtained by using results from McKay (1977). Indeed, letλ_1,m ≥ λ_2,m ≥ · · · ≥ λ_p,m denote the eigenvalues ofTm=V_m⁻¹BmwhereBmis the between groups covariance operator conditionally toU =mgiven by

B_m=

q

X

`=1

p_`|m(µ_`,m−µ_m)⊗(µ_`,m−µ_m),

and let υ_i^m = (υ^m_i1,· · · , υ_ip^m)^T (i = 1,· · ·, p) be an eigenvector of Tm associated with λi,m. Then, I1,m = {k∈I|∃i∈ {1,· · ·, rm}, υ_ik^m6= 0}. Now, we are able to give a characterization of I1 by means of the criterion given in (5).

(5)

Proposition 1. ForK⊂I, we haveξK = 0if and only ifI1⊂K.

From this proposition, it is easily seen that, puttingK_i =I− {i}, one has the equivalence: ξ_K_i >0⇔i ∈I₁. Now, letσbe the permutation ofIsuch that:

(A1) ξKσ(1)≥ξKσ(2) ≥ · · · ≥ξKσ(p);

(A2) ξ_Kσ(i)=ξ_Kσ(j)andi < j implyσ(i)< σ(j).

SinceI1is a non-empty set, there exists an integers∈Iwhich is equal topwhenI1=I, and satisfying ξ_Kσ(1)≥ · · · ≥ξ_Kσ(s)> ξ_Kσ(s+1)=· · ·=ξ_Kσ(p)= 0

whenI16=I. Hence

I1={σ(i) ; 1≤i≤s}.

Therefore, estimatingI1reduces to estimating the two parametersσands. For doing that, we first need to consider an estimator of the criterion given in (5).

3 Estimating the criterion

Let{(Xi, Yi, Zi)}_1≤i≤nbe an i.i.d. sample of(X, Y, Z)with

Xi= (X_i⁽¹⁾,· · · , X_i^(p))^T and Yi= (Y_i⁽¹⁾,· · ·, Y_i^(d))^T; we putU_i = 1 +Pd

j=1Y_i^(j)2^j−1. In this section, we define estimators for the criterion given in (5) by estimating the parameters involved in its definiion. First, empirical estimators are introduced and properties of the resulted estimator of the criterion are given, and secondly we consider estimators obtained by using non-parametric smoothing procedures as in Mahat et al. (2007).

3.1 Empirical estimators

Putting

Nb_m⁽ⁿ⁾=

n

X

i=1

1_{U_i_=m} and Nb_`,m⁽ⁿ⁾=

n

X

i=1

1_{Z_i_=`,U_i_=m},

we estimatepm,p_`|m,µm,µ`,mandVmrespectively by:

pb⁽ⁿ⁾_m =Nbm⁽ⁿ⁾

n , pb⁽ⁿ⁾_`|m=Nb_`,m⁽ⁿ⁾ Nbm⁽ⁿ⁾

, µb⁽ⁿ⁾_m = 1 Nbm⁽ⁿ⁾

n

X

i=1

1_{U_i_=m}X_i,

µb⁽ⁿ⁾_`,m= 1 Nb_`,m⁽ⁿ⁾

n

X

i=1

1_{Z_i=`,U_i=m}Xi

and

Vb_m⁽ⁿ⁾= 1 Nbm⁽ⁿ⁾

n

X

i=1

1_{U_i_=m}(Xi−µb⁽ⁿ⁾_m )⊗(Xi−µb⁽ⁿ⁾_m ).

Then, considering

Qb⁽ⁿ⁾_K|m=A^∗_K

A_KVb_m⁽ⁿ⁾A^∗_K⁻¹ A_K

and

ξb_K|m⁽ⁿ⁾ =

q

X

`=1

(pb⁽ⁿ⁾_`|m)²k

I_R^p−Vb_m⁽ⁿ⁾Qb⁽ⁿ⁾_K|m bµ⁽ⁿ⁾_`,m−µb⁽ⁿ⁾_m k²,

(6)

we take as estimator ofξKthe random variableξb_K⁽ⁿ⁾defined by:

ξb⁽ⁿ⁾_K =

M

X

m=1

pb⁽ⁿ⁾_m 2

ξb⁽ⁿ⁾_K|m. (6)

Now, we will give a result which establishes strong consistency forξb_K⁽ⁿ⁾and will be useful for determining its asymptotic distribution and consistency of the proposed method for selecting variables. Let us consider the random variables

X =







1_{Z=1,U=1}X . . . 1_{Z=1,U=M}X 1_{Z=2,U=1}X . . . 1_{Z=2,U=M}X

... . .. ... 1_{Z=q,U=1}X . . . 1_{Z=q,U=M}X





 ,

Xi=







1_{Z_i_=1,U_i_=1}Xi . . . 1_{Z_i_=1,U_i_=M_}Xi

1_{Z_i_=2,U_i_=1}Xi . . . 1_{Z_i_=2,U_i_=M_}Xi

... . .. ...

1_{Z_i_=q,U_i_=1}Xi . . . 1_{Z_i_=q,U_i_=M_}Xi







valued into the spaceMqp,M(R)ofpq×M matrices. We also introduce the random vectors

Y=







1_{U=1}X 1_{U=2}X

... 1_{U=M}X





 , Yi=







1_{U_i_=1}Xi

1_{U_i_=2}Xi

... 1_{U_i_=M}Xi





 ,Z=





 1_{U=1}

1_{U=2}

... 1_{U=M_}





 ,Zi=







1_{U_i_=1}

1_{U_i_=2}

... 1_{U_i_=M}





 ,

the random variables valued intoMq,M(R),

U =







1_{Z=1,U=1} . . . 1_{Z=1,U=M}

1_{Z=2,U=1} . . . 1_{Z=2,U=M}

... . .. ... 1_{Z=q,U=1} . . . 1_{Z=q,U=M}





 , Ui =







1_{Z_i_=1,U_i_=1} . . . 1_{Z_i_=1,U_i_=M_} 1_{Z_i_=2,U_i_=1} . . . 1_{Z_i_=2,U_i_=M_}

... . .. ... 1_{Z_i=q,U_i=1} . . . 1_{Z_i=q,U_i=M}





 ,

and the random variables

V=







1_{U=1}X⊗X 1_{U=2}X⊗X

... 1_{U=M_}X⊗X





 , Vi =







1_{U_i_=1}Xi⊗Xi

1_{U_i_=2}Xi⊗Xi

... 1_{U_i=M}X_i⊗X_i







valued into(L(R^p))^M, whereL(R^p)denotes the space of operators fromR^pto itself. Further, we consider the random variables

W= (X,Y,Z,U,V), Wi= (Xi,Yi,Zi,Ui,Vi) valued into the Euclidean vector space

E=Mqp,M(R)×R^pM×R^M× Mq,M(R)× L(R^p)^M;

then, we put

cW⁽ⁿ⁾=√ n 1

n

X

i=1

W_i−E(W)

!

. (7)

Note that, for any(a, b, c, d, e)∈ E, we can write:

(7)

a=







a_1,1 a_1,2 . . . a_1,M a2,1 a2,2 . . . a2,M

... ... . .. ... aq,1 aq,2 . . . aq,M







, wherea_`,m∈R^p, 1≤`≤q, 1≤m≤M,

b=





 b1

b2

... b_M







, wherebm∈R^p,1≤m≤M,

c=





 c₁ c2

... cM







, where c_m∈R, 1≤m≤M,

d=







d1,1 d1,2 . . . d1,M

d2,1 d2,2 . . . d2,M

... ... . .. ... d_q,1 d_q,2 . . . d_q,M







, wherec`,m∈R, 1≤`≤q, 1≤m≤M,

e=





 e1

e2

... e_M







, whereem∈ L(R^p),1≤m≤M.

We introduce the projectors

π₁^`m : (a, b, c, d, e)∈ E 7→a`,m∈R^p, π₂^m : (a, b, c, d, e)∈ E 7→bm∈R^p, π₃^m : (a, b, c, d, e)∈ E 7→cm∈R, π₄^`m : (a, b, c, d, e)∈ E 7→d`,m∈R,

π₅^m : (a, b, c, d, e)∈ E 7→e_m∈ L(R^p),

the vector

∆_`,K|m= I_Rp−V_mQ_K|m

(µ_`,m−µ_m) and the operatorsΛ_K|m,Φ_`,K|mandΨ_`,K|mdefined onEby

Λ_K|m(T) = 2p_mπ^m₃ (T)ξ_K|m,

Φ_`,K|m(T) =p⁻¹_m π^`m₄ (T)−π^m₃ (T)p_`|m

k∆_`,K|mk, and

Ψ_`,K|m(T) = p⁻¹_m (I_R^p−VmQK,m)h

p⁻¹_`|m π^`m₁ (T)−π^`m₄ (T)µ`,m

−π₂^m(T) +π₃^m(T)µm

−(π₅^m(T)−π₃^m(T) (Vm+µm⊗µm))Q_K|m(µ`,m−µm) + ((π₂^m(T)−π₃^m(T)µ_m)⊗µ_m)Q_K|m(µ_`,m−µ_m) + (µ_m⊗(π₂^m(T)−π₃^m(T)µ_m))Q_K|m(µ_`,m−µ_m)

.

Then, we obtain the following theorem that gives consistency forξb⁽ⁿ⁾_K and a results that permits to derive its asymptotic distribution, and that is useful for proving consistency of the proposed method for variable selection.

(8)

Theorem 1. For any subsetKofI,

(i) ξb⁽ⁿ⁾_K converges almost surely toξKasn→+∞.

(ii)

nbξ_K⁽ⁿ⁾ =

M

X

m=1

√nbΛ⁽ⁿ⁾_K|m(cW⁽ⁿ⁾) +

M

X

m=1 q

X

`=1

p_mΦb⁽ⁿ⁾_`,K|m Wc⁽ⁿ⁾

+ pmp_`|mkΨb⁽ⁿ⁾_`,K|m cW⁽ⁿ⁾

+√

n∆_`,K|mk2

where Λb⁽ⁿ⁾_K|m

n∈N^∗

,

Φb⁽ⁿ⁾_`,K|m

n∈N^∗

and

Ψb⁽ⁿ⁾_`,K|m

n∈N^∗

are sequences of random operators that converge almost surely uniformly toΛ_K|m,Φ_`,K|mandΨ_`,K|mrespectively.

3.2 Non-parametric smoothing procedure

As it is well known, empirical estimators could be not suitable, because many cell incidences could be very small or zero, so that corresponding estimates could be poor or non-existent (see Aspakourov and Krzanowski 2000). To overcome these problems, we propose in this section a non-parametric smoothing procedure for estimating the criterion (5). This procedure is based on smoothing as it is done in Aspakourov and Krzanowki (2000) and Mahat et al. (2007).

Denote byDthe dissimilarity defined on{1,· · · , M}²by

D(m, k) =kxm−xkk², where, for anyk ∈ {1,· · · , M}, x_k =

x⁽¹⁾_k ,· · ·,x^(d)_k T

∈ {0,1}^d is the vector of binary variables satisfying 1 +Pd

j=1x^(j)_k 2^j−1=k. Then, given a smoothing parameterλ∈]0,1[, we consider the weightsw(m, k) =λ^D(m,k) and estimatepm,p_`|m,µm,µ`,mandVmrespectively by:

pe⁽ⁿ⁾_m =

PM

j=1w(m, j)Nb_j⁽ⁿ⁾ PM

k=1

PM

j=1w(m, j)Nb_j⁽ⁿ⁾

, pe⁽ⁿ⁾_`|m=

PM

j=1w(m, j)Nb_`,j⁽ⁿ⁾ pe⁽ⁿ⁾m PM

k=1

PM

j=1w(m, j)Nb_`,j⁽ⁿ⁾ ,

µe⁽ⁿ⁾_m =







M

X

j=1

w(m, j)Nb_j⁽ⁿ⁾







−1 M

X

j=1

( w(m, j)

n

X

n=1

X_i1_{U_i_=j}

) ,

µe⁽ⁿ⁾_`,m=







M

X

j=1

w(m, j)Nb_`,j⁽ⁿ⁾







−1 M

X

j=1

( w(m, j)

n

X

n=1

Xi1_{Z_i=`,U_i=j}

)

and

Ve_m⁽ⁿ⁾=







M

X

j=1

w(m, j)Nb_j⁽ⁿ⁾







−1 M

X

j=1

( w(m, j)

n

X

n=1

1_{U_i_=j}(Xi−µe⁽ⁿ⁾_m )⊗(Xi−µe⁽ⁿ⁾_m ) )

.

Then, we obtain an estimatorξe_K⁽ⁿ⁾of the criterion by replacing in (3), (4) and (5) the parametersp_m,p_`|m,µ_m,µ_`,m andVmby their estimators given above.

4 Selection of variables

Estimation ofI1reduces to that ofσands. In this section, estimators for these two parameters are proposed. When empirical estimators are used, we then obtain consistency properties for these estimators.

(9)

4.1 Estimation of σ and s

Let us consider a sequence(fn)_n∈

N^∗of functions fromI toR⁺such thatfn∼n^−αf whereα∈]0,1/2[andfis a strictly decreasing function fromItoR⁺. Then, recalling thatKi=I− {i}, we put

φb⁽ⁿ⁾_i =ξb_K⁽ⁿ⁾

i +fn(i) (i∈I) and we take as estimator ofσthe random permutationbσ⁽ⁿ⁾ofIsuch that

φb⁽ⁿ⁾

σb⁽ⁿ⁾(1)≥φb⁽ⁿ⁾

bσ⁽ⁿ⁾(2)≥ · · · ≥φb⁽ⁿ⁾

bσ⁽ⁿ⁾(p)

and ifφb⁽ⁿ⁾

bσ⁽ⁿ⁾(i) = φb⁽ⁿ⁾

σb⁽ⁿ⁾(j) withi < j, thenσb⁽ⁿ⁾(i) < σb⁽ⁿ⁾(j). Furthermore, we consider the random setJb_i⁽ⁿ⁾ =

σb⁽ⁿ⁾(j) ; 1≤j≤i and the random variable ψb_i⁽ⁿ⁾=ξb⁽ⁿ⁾

Jb_i⁽ⁿ⁾+gn

bσ⁽ⁿ⁾(i)

(i∈I) where(gn)_n∈

N^∗ is a sequence of functions fromIto R+such thatgn ∼n^−βgwhereβ ∈]0,1[andgis a strictly increasing function. Then, we take as estimator ofsthe random variable

bs⁽ⁿ⁾= min

i∈I /ψb_i⁽ⁿ⁾= min

j∈I

ψb_j⁽ⁿ⁾

.

The variable selection is achieved by taking the random set Ib₁⁽ⁿ⁾=n

σb⁽ⁿ⁾(i) ; 1≤i≤bs⁽ⁿ⁾o as estimator ofI1.

4.2 Consistency

When the empirical estimators defined in section3.1are considerd, we establish consistency for the preceding estimators. We first give a proposition that is useful for proving the consistency theorem. There exist t ∈ I and (m₁,· · ·, m_t)∈I^tsuch that m₁+· · ·+m_t=p, andξ_K_σ(1) =· · ·=ξ_K_σ(m

1 ) > ξ_K_σ(m

1 +1)=· · ·=ξ_K_σ(m

1 +m2 )>

· · · > ξ_K_σ(m

1 +···+mt−1 +1) =· · · =ξ_K_σ(m

1 +···+mt). Then considering the setE ={`∈N^∗; 1≤`≤t, m_`≥2}

and puttingm0:= 0andF`:=n P`−1

k=0mk+ 1,· · ·,P`

k=0mk−1o

(`∈ {1,· · · , t}), we have:

Proposition 2. IfE 6= ∅, then for any` ∈ E and anyi ∈ F`, the sequencen^α ξb_K⁽ⁿ⁾

σ(i)−ξb⁽ⁿ⁾_K

σ(i+1)

converges in probability to0, asn→+∞.

From a proof that uses the preceding proposition and that is similar to the proof of Theorem 3.1 in Nkiet (2012), we obtain the consistency theorem given below.

Theorem 2. We have:

(i)limn→+∞P σb⁽ⁿ⁾=σ

= 1;

(ii)bs⁽ⁿ⁾converges in probability tos, asn→+∞.

As a consequence of this theorem, we easily obtain:

n→+∞lim P

Ib₁⁽ⁿ⁾=I1

= 1.

This shows the consistency of our method for selecting variables in discriminant analysis with mixed variables.

(10)

5 Numerical experiments

In this section, we report results of simulations made for studying properties of the proposed method. Several issues are adressed: the influence of the penalty functions, the type of estimator and the parametersαandβon the performance of the procedure, optimal choice of these parameters and comparison with the method proposed in Mahat et al. (2007).

Since this latter method consider only the case of two groups, we have placed ourselves in this framework although our method can be used for more than two groups. Each data set was generated as follows: X_iis generated from a multivariate normal distribution inR⁵with meanµand covariance matrix given byΓ = ¹₂(I₅+J₅), whereI₅is the 5-dimensional identity matrix andJ5is the5×5matrix whose elements are all equal to1;U⁽ⁱ⁾is generated from the uniform distribution in{1,· · ·,8}, that is equivalent to generateYi as random vector with3 coordinates being binary random variables. Two groups of data was generated as indicated above withµ = µ1 = (0,0,0,0,0)^T for the first group (for whichZi = 1), andµ =µ2 = (1/4,0,1/2,0,3/4)^T for the second group (for whichZi = 2) . Our simulated data is based on two independent data sets: training data and test data, each with sample sizen = 100,200,300,400,500and with sizen1 =n2 = n/2for the two groups. The training data is used for selecting variables and the test data is used for computing the classification capacity (CC), that is the proportion of correct classification (obtained by using the rule (1)) after variable selection. The average of CC over 1000 independent replications is used for measuring the performance of the methods.

5.1 Influence of penalty functions and type of estimator

In order to evaluate the impact of penalty functions on the performance of our method we tookf_n(i) =n^−1/4/h_k(i) andg_n(i) =n^−1/4h_k(i),k= 1,· · ·,13, withh₁(x) =x,h₂(x) =x^0.1,h₃(x) =x^0.5,h₄(x) =x^0.9,h₅(x) =x¹⁰, h6(x) = ln(x),h7(x) = ln(x)^0.1,h8(x) = ln(x)^0.5,h9(x) = ln(x)^0.9,h10(x) =xln(x),h11(x) = (xln(x))^0.1, h₁₂(x) = (xln(x))^0.5,h₁₃(x) = (xln(x))^0.9. For each of these functions, we computed classifcation capacity by using both empirical estimators and non-parametric smoothing procedure introduced in section3.2. For this latter type of estimator, the smoothing parameterλwas computed from a cross validation method on the training sample in order to maximize classification capacity. The results are given in Table 1. Comparing the results it is observed that there is no significant difference between the results obtained for the different functions, and also between the two types of estimators. So, it seems that chosing penalty functions have no influence on the performance of our method.

5.2 Influence of parameters α and β

Since tuning parameters may have impact on the performance of a statistical procedure, it is important to study their influence. That is why numerical experiments was made in order to appreciate the influence of αand β on the performance of our method. For doing that, we made simulations as indicated above by taking

f_n(i) =n^−α/h₇(i) and g_n(i) =n^−βh₇(i) (8) withα = 0.1,0.2,0.3,0.4,0.45, and β varying in[0,1[. The results are reported in Fig. 1 (a)–(c) and show that the parametersαandβ clearly have impact on the performance of the method. Indeed, the obtained curves vary asα varies. Further, for a fixedα, CC is not constant asβvaries in[0,1[. Finally, choosing the proper values forαandβ is an important issue to address in practice. An approach for doing that via cross-model validation is described in the following section.

5.3 Chosing optimal (α, β )

We propose a method for making an optimal choice of(α, β)based on leave-one-out cross validation used in order to minimize classification capacity. For eachk ∈ {1,· · ·, n}, after removing thek-th observation forX andY in the training sample our method for selecting variable is applied on this remaining sample with a given value for(α, β) and penalty functions taken as in (8). Then, the observation that have been removed is allocated to a groupegα,β(k)

(11)

Empirical Estimator Non-parametric Estimator

n = 100(n1=n2=50) Function CC CC

h₁ 0.60800 0.60800

h₂ 0.60900 0.60900

h₃ 0.61000 0.61000

h₄ 0.60900 0.60900

h₅ 0.60800 0.60800

h6 0.61028 0.61028

h7 0.60903 0.60903

h8 0.61066 0.61066

h9 0.60898 0.60898

h10 0.60842 0.60842

h11 0.60948 0.60948

h12 0.60915 0.60915

h13 0.60908 0.60908

n = 300(n₁=n₂=150) Function

h₁ 0.56421 0.56424

h₂ 0.56328 0.56332

h₃ 0.56417 0.56417

h4 0.56346 0.56346

h5 0.56382 0.56382

h6 0.56343 0.56343

h7 0.56374 0.56374

h8 0.56354 0.56354

h9 0.56312 0.56312

h10 0.56379 0.56379

h11 0.56404 0.56404

h12 0.56345 0.56345

h₁₃ 0.56375 0.56375

n = 500(n1=n2=250) Function

h1 0.54998 0.54998

h2 0.55047 0.55067

h3 0.55032 0.55039

h4 0.55049 0.55058

h5 0.54982 0.55982

h6 0.55058 0.55062

h7 0.55050 0.55054

h8 0.55008 0.55013

h9 0.54972 0.54934

h10 0.55013 0.55018

h₁₁ 0.55036 0.55037

h₁₂ 0.55046 0.55046

h₁₃ 0.55028 0.55028

Table 1: Average of classification capacity (CC) over 1000 replications with different penalty functions fn = n^−1/4/hkandgn =n^−1/4hk,k= 1,· · · ,13, andα= 0.25,β= 0.5.

(12)

0.0 0.2 0.4 0.6 0.8 1.0

0.6050.6100.6150.620

Fig.1−(a)

beta

Classification Capacity

alpha=0.1 alpha=0.2 alpha=0.3 alpha=0.4 alpha=0.45

0.0 0.2 0.4 0.6 0.8 1.0

0.5620.5630.5640.5650.5660.567

Fig.1−(b)

beta

in{1,· · ·, q}by using the rule given in (1) (for the two groups case) or in (2) (for the case of more than two groups) based on the variables that have been selected in the previous step. Then, we consider

CV(α, β) = 1 n

n

X

k=1

1_{Z_k=eg_α,β(k)}

and we take as optimal value for(α, β)the pair(αopt, βopt)defined by:

(αopt, βopt) = argmax

(α,β)∈]0,1/2[×]0,1[

CV(α, β). (9)

(13)

0.0 0.2 0.4 0.6 0.8 1.0

0.5490.5500.5510.5520.553

Fig.1−(c)

beta

Figure1: Average of CC over 1000 replications versusβ, for different values ofα. (a)n = 100(b)n = 300(c) n= 500.

CC

n n1=n2 Our method Mahat method

100 50 0.60860 0.60826

200 100 0.57810 0.57836

300 150 0.56244 0.56221

400 200 0.55534 0.55546

500 250 0.54989 0.54950

Table2: Average of classification capacity (CC) over 1000 replications.

5.4 Comparison with the method of Mahat et al. (2007)

For each of the 1000 independent replications:

(i) the training sample is used for selecting variables from our method and that of Mahat et al. (2007); our method is used with penalty functions given in (8) and optimal(α, β)obtained by using leave-one-out cross validation as indicated in (9);

(ii) the test sample is then used for computing classification capacity for the two methods.

The average of classification capacity over the 1000 replications is then computed. The results, given in Table 2, do not show a superiority of one of the methods compared to the other.

(14)

6 Proofs

6.1 Proof of Proposition 1

For any fixedm ∈ {1,· · ·, M}, we denote byP^{U=m} the conditional probability to the event{U = m}. Then applying Theorem 2.1 of Nkiet (2012) with the probability space Ω,A, P^{U^=m}

, we obtain the equivalence:ξ_K|m= 0⇔I_1,m⊂K. Thus:

ξK = 0 ⇔

M

X

m=1

p²_mξ_K|m= 0

⇔ ∀m∈ {1,· · · , M}, ξ_K|m= 0

⇔ ∀m∈ {1,· · · , M}, I_1,m⊂K

⇔

M

[

m=1

I1,m⊂K.

6.2 Some useful results

In this section, we give two lemmas that are useful for proving Theorem1.

Lemma 1. We have:

(i) √ n

pb⁽ⁿ⁾m −p_m

=π₃^m(cW⁽ⁿ⁾);

(ii) √ n

pb⁽ⁿ⁾_`|m−p_`|m

=p⁻¹_m π₄^`m

cW⁽ⁿ⁾

−π₃^m Wc⁽ⁿ⁾

pb⁽ⁿ⁾_`|m

; (iii) √

n

µb⁽ⁿ⁾m −µm

=p⁻¹_m

π₂^m(cW⁽ⁿ⁾)−π₃^m(cW⁽ⁿ⁾)bµ⁽ⁿ⁾m

; (iv) √

n

µb⁽ⁿ⁾_`,m−µ`,m

=p⁻¹_mp⁻¹_`|m

π^`m₁ (cW⁽ⁿ⁾)−π^`m₄ (cW⁽ⁿ⁾)µb⁽ⁿ⁾_`,m

;.

(v)

√n

Vb_m⁽ⁿ⁾−V_m

= p⁻¹_m h π₅^m

Wc⁽ⁿ⁾

−π^m₃ Wc⁽ⁿ⁾

(Vb_m⁽ⁿ⁾+µb⁽ⁿ⁾_m ⊗bµ⁽ⁿ⁾_m )i

− p⁻¹_m h

π₂^m(cW⁽ⁿ⁾)−π₃^m(cW⁽ⁿ⁾)µb⁽ⁿ⁾_m

⊗µb⁽ⁿ⁾_m

+ µm⊗

π₂^m(cW⁽ⁿ⁾)−π₃^m(cW⁽ⁿ⁾)bµ⁽ⁿ⁾_m i .

Proof. (i). Clearly, π₃^m

cW⁽ⁿ⁾

=√ n 1

n

X

i=1

1_{U_i_=m}−E(1_{U=m})

!

=√ n

pb⁽ⁿ⁾_m −p_m .

(ii).

π₄^`m cW⁽ⁿ⁾

= √ n 1

n

X

i=1

1_{Z_i_{=`, U}_i_=m}−E(1_{{Z=`, U=m}})

!

= √

n Nb_`,m⁽ⁿ⁾

n −P(Z=`, U=m)

!

= √ n

bp⁽ⁿ⁾_m bp⁽ⁿ⁾_`|m−p_mp_`|m .

(15)

Further,

√n

pb⁽ⁿ⁾_m pb⁽ⁿ⁾_`|m−pmp_`|m

= √ n

bp⁽ⁿ⁾_m −pm

pb⁽ⁿ⁾_`|m+pm

√n

bp⁽ⁿ⁾_`|m−p_`|m

= π₃^m cW⁽ⁿ⁾

bp⁽ⁿ⁾_`|m+pm

√n

pb⁽ⁿ⁾_`|m−p_`|m .

Hence √

n

pb⁽ⁿ⁾_`|m−p_`|m

=p⁻¹_m π₄^`m

cW⁽ⁿ⁾

pb⁽ⁿ⁾_`|m .

(iii).

π^m₂ Wc⁽ⁿ⁾

= √ n 1

n

X

i=1

1_{U_i_=m}Xi−E(1_{U=m}X)

!

=√ n

pb⁽ⁿ⁾_m µb⁽ⁿ⁾_m −pmµm

= √ n

pb⁽ⁿ⁾_m −pm

µb⁽ⁿ⁾_m +pm

√n

µb⁽ⁿ⁾_m −µm

= π^m₃ Wc⁽ⁿ⁾

µb⁽ⁿ⁾_m +pm

√n

µb⁽ⁿ⁾_m −µm) .

Hence √

n(bµ⁽ⁿ⁾_m −µm) =p⁻¹_m π₂^m

Wc⁽ⁿ⁾

µb⁽ⁿ⁾_m .

(iv).

π^`m₁ Wc⁽ⁿ⁾

= √ n 1

n

X

i=1

1_{Z_i=`, U_i=m}X_i−E(1_{{Z=`, U=m}}X)

!

= √ n

pb⁽ⁿ⁾_`,mbµ⁽ⁿ⁾_`,m−p`,mµ`,m

,

where

pb⁽ⁿ⁾_`,m=Nb_`,m⁽ⁿ⁾

n =pb⁽ⁿ⁾_m pb⁽ⁿ⁾_`|m andp`,m=P(Z =`, U =m) =pmp_`|m. (10) Moreover,

√n

pb⁽ⁿ⁾_`,mµb⁽ⁿ⁾_`,m−p_`,mµ_`,m

= √ n

bp⁽ⁿ⁾_`,m−p_`,m

µb⁽ⁿ⁾_`,m+p_`,m√ n

bµ⁽ⁿ⁾_`,m−µ_`,m

= π₄^`m cW⁽ⁿ⁾

bµ⁽ⁿ⁾_`,m+p`,m

√n

µb⁽ⁿ⁾_`,m−µ`,m

.

From this last equality and the second one in (10), we deduce that

√n(µb⁽ⁿ⁾_`,m−µ_`,m) =p⁻¹_mp⁻¹_`|m π₁^`m

Wc⁽ⁿ⁾

−π^`m₃ Wc⁽ⁿ⁾

µb⁽ⁿ⁾_`,m .

(16)

(v).

π^m₅ cW⁽ⁿ⁾

= √ n 1

n

X

i=1

1_{U_i_=m}X_i⊗X_i−E(1_{U=m}X⊗X)

!

= √

n pb⁽ⁿ⁾_m 1 Nbm⁽ⁿ⁾

n

X

i=1

1_{U_i_=m}Xi⊗Xi−pmE(X⊗X|U =m)

!

= √ n

pb⁽ⁿ⁾_m Vb_m⁽ⁿ⁾+bp⁽ⁿ⁾_m bµ⁽ⁿ⁾_m ⊗µb⁽ⁿ⁾_m −pmVm−pmµm⊗µm

= pm

√n(Vb_m⁽ⁿ⁾−Vm) +√

n(pb⁽ⁿ⁾_m −pm)

Vb_m⁽ⁿ⁾+µb⁽ⁿ⁾_m ⊗µb⁽ⁿ⁾_m + pm

√n(µb⁽ⁿ⁾_m −µm)⊗bµ⁽ⁿ⁾_m +µm⊗√

n(µb⁽ⁿ⁾_m −µm)

= pm

√n(Vb_m⁽ⁿ⁾−Vm) +π₃^m

cW⁽ⁿ⁾ Vb_m⁽ⁿ⁾+bµ⁽ⁿ⁾_m ⊗µb⁽ⁿ⁾_m

+

π₂^m cW⁽ⁿ⁾

µb⁽ⁿ⁾_m

⊗µb⁽ⁿ⁾_m

+ µ_m⊗ π₂^m

Wc⁽ⁿ⁾

µb⁽ⁿ⁾_m Thus

√n

Vb_m⁽ⁿ⁾−Vm

= p⁻¹_m h π^m₅

cW⁽ⁿ⁾

(bV_m⁽ⁿ⁾+µb⁽ⁿ⁾_m ⊗µb⁽ⁿ⁾_m )i

− p⁻¹_m h

π₂^m(cW⁽ⁿ⁾)−π₃^m(cW⁽ⁿ⁾)bµ⁽ⁿ⁾_m

⊗µb⁽ⁿ⁾_m

+ µ_m⊗

π^m₂ (cW⁽ⁿ⁾)−π^m₃ (cW⁽ⁿ⁾)µb⁽ⁿ⁾_m i .

Lemma 2. For anyK⊂Iand anym∈ {1,· · ·, M}:

(i) ξb⁽ⁿ⁾_K|mconverges almost surely toξ_K|masn→+∞;

(ii) nbξ⁽ⁿ⁾_K|m=Pq

`=1

Φb⁽ⁿ⁾_`,K|m(cW⁽ⁿ⁾) +p_`|mkΨb⁽ⁿ⁾_`,K|m(cW⁽ⁿ⁾) +√

n∆_`,K|mk² .

Proof. (i). First, using the strong law of large numbers (SLLN) we obtain the almost sure convergence ofNbm⁽ⁿ⁾/n (resp.Nb_`,m⁽ⁿ⁾/n) top_m(resp.p_`,m=p_mp_`|m) asn→+∞. Then, asn→+∞,pb⁽ⁿ⁾_`|mconverges almost surely (a.s.) to p_`|mand, by the preceding convergence properties and the SLLN,

bµ⁽ⁿ⁾_m = Nbm⁽ⁿ⁾

n

!−1

1 n

n

X

i=1

1_{U_i_=m}X_i

!

−→a.s. p⁻¹_mE 1_{U=m}X

=µ_m,

µb⁽ⁿ⁾_`,m= Nb_`,m⁽ⁿ⁾ n

!⁻¹ 1 n

n

X

i=1

1_{Z_i_=`,U_i_=m}Xi

!

−→a.s. p⁻¹_`,mE 1_{Z=`,U=m}X

=µ`,m

and

Vb_m⁽ⁿ⁾ = Nbm⁽ⁿ⁾

n

!⁻¹ 1 n

n

X

i=1

1_{U_i_=m}X_i⊗X_i

!

−bµ⁽ⁿ⁾⊗µb⁽ⁿ⁾

−→a.s. p⁻¹_mE 1_{U_=m}X⊗X

−µ⊗µ=Vm.