• Aucun résultat trouvé

"A Non-parametric Semi-supervised Discretization Method

N/A
N/A
Protected

Academic year: 2022

Partager ""A Non-parametric Semi-supervised Discretization Method"

Copied!
10
0
0

Texte intégral

(1)

A Non-parametric Semi-supervised Discretization Method

Bondu, A. and Boulle, M. and Lemaire, V.

Orange Labs 2 av. Pierre Marzin 22300 Lannion - France alexis.bondu@orange-ftgroup.com

Loiseau, S. and Duval, B.

LERIA, Universit´e d’Angers 2 Boulevard Lavoisier, 49045 Angers Cedex 01, France

Abstract

Semi-supervised classification methods aim to exploit labelled and unlabelled examples to train a predictive model. Most of these approaches make assumptions on the distribution of classes. This article first proposes a new semi-supervised discretization method which adopts very low informative prior on data. This method dis- cretizes the numerical domain of a continuous input vari- able, while keeping the information relative to the predic- tion of classes. Then, an in-depth comparison of this semi- supervised method with the original supervised MODL approach is presented. We demonstrate that the semi- supervised approach is asymptotically equivalent to the su- pervised approach, improved with a post-optimization of the intervals bounds location.

1 Introduction

Data mining can be defined as the non trivial process of identifying valid, novel, potentially useful, ultimately un- derstandable patterns in data [10]. Even though the model- ing phase is the core of the process, the quality of the results rely heavily on data preparation which usually takes around 80%of the total time [17]. An interesting method for data preparation is to discretize the input variables.

Discretization methods aim to induce a list of intervals which splits the numerical domain of a continuous input variable, while keeping the information relative to the out- put variable [6] [12] [8] [14] [15]. A na¨ıve Bayes classifier can exploit a discretization of its input space [3] as the inter- vals set which is used to estimate conditional probabilities of classes given the data. Discretization methods are useful for data mining, to explore, prepare and model data.

The objective of semi-supervised learning is to exploit unlabelled data to improve a predictive models. This arti- cle focuses on semi-supervised classification, a well known problem in the literature. Most of semi-supervised ap-

proaches deal with particular cases where information about unlabelled data is available. Semi-supervised learning with- out strong assumption on data distribution is a great chal- lenge.

This article proposes a new semi-supervised discretiza- tion method which adopts very low informative priors on data. Our semi-supervised discretization method is based on the MODL framework [3] (“Minimal Optimized De- scription Length”). This approach turns the discretization problem into a model selection one. A Bayesian approach is applied and leads to an analytical evaluation criterion.

Then, the best discretization model is selected by optimiz- ing this criterion.

The organization of this paper is as follows: Section 2 presents the motivation for non-parametric semi-supervised learning; Section 3 formalizes our semi-supervised ap- proach; our discretization method is compared with the supervised approach in Section 4; in Section 5, empirical and theoretical results are exploited to demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, improved with a post-optimization of the intervals bounds location. Future work is discussed in Section 7.

2 Related works

This section introduces the semi-supervised learning ow- ing to a short state of the art. Previous works on supervised discretization are then summarized.

2.1 Semi-supervised algorithms

Semi-supervised classification methods [7] exploit la- belled and unlabelled examples to train a predictive model.

The main existing approaches are the following:

The Self-training approach is a heuristic which itera- tively uses the predictions of a model to label new ex- amples. The new labelled examples in turns are used to

(2)

train the model. The uncertainty of predictions is evalu- ated in order to label only the most confident examples [18].

TheCo-trainingapproach involves two predictive mod- els which are independently trained on disjoint sub- feature sets. This heuristic uses the predictions of both models to label two examples at every iteration. Each model labels one example and “teaches” the other classi- fier with its prediction [2] [16].

TheCovariate shiftapproach estimates the distributions of labelled and unlabelled examples [21]. The covariate shift formulation [20] weights labelled examples accord- ing to the disagreement between these distributions. This approach incorporates this disagreement into the training algorithm of a supervised model.

Generative modelbased approaches estimate the distri- bution of classes, under hypothesis on data. These meth- ods make the assumption that the distributions of classes belong to a known parametric family. Then training data is exploited in order to fit parameters values [11].

Semi-supervised learning without making hypothesis on data distribution is a great challenge. Therefore, most of semi-supervised approaches make assumptions on the dis- tribution of classes.

For instance, generative model based approaches aim to estimate P(x, y) = P(y)P(x|y)the joint distribution of data and classes (with data denoted byx Xand classes denoted byy Y). The distributionP(x, y)is assumed to belong to a parametric family {P(x, y)θ}. The vec- tor θ of finite size corresponds to the modeling parame- ters of P(x, y). The joint distribution can be rewritten as P(x, y)θ =P(y)θP(x|y)θ. The termP(y)θis defined by a prior knowledge on the distribution of classes. P(x|y)θ is identified in a given family of distributions, thanks to the vectorθ.

LetUbe the set of unlabelled examples andLthe set of labelled examples. The setLcontains couples(x, y), withx a scalar value andy [1, J]a discrete class value. The set U contains scalar values without labels. Semi-supervised generative model based approaches aim to find the parame- tersθwhich maximizeP(x, y)θon the data setD=U∪L. The quantity to be maximized is p(L, U|θ), the probabil- ity of data given the parametersθ. The Maximum Likeli- hood Estimation (MLE) is widely employed to maximize p(L, U|θ)(with(xi, yi)∈Landxi ∈U):

maxθ∈Θ

|L|

i=1

log [p(yi)θp(xi|yi)θ]

+ |U|

i=1

log |Y|

j=1

p(yj)θp(xi|yj)θ

(1)

These approaches are usable only if information about the distribution of classes is available. The hypothesis that P(x, y) belongs to a known family of distributions is a strong assumption which could be invalid in practice.

The objective of a non-parametric semi-supervised method is to estimate the distribution of classes without making strong hypothesis on these distributions. There- fore, our approach can be put in opposition with the gen- erative approaches. This article exploits the MODL frame- work [3] and proposes a new semi-supervised discretization method. This “objective” Bayesian approach makes very low assumptions on the data distribution.

2.2 Summary of the supervised MODL discretization method

The discretization of a descriptive variable aims at es- timating the conditional distribution of class labels, owing to a piece wise constant density estimator. In the MODL approach [3], the discretization is turned into a model se- lection problem. First, a space of discretization models is defined. The parameters of a specific discretization are the number of intervals, the bounds of the intervals and the out- put frequencies in each interval. A Bayesian approach is ap- plied to select the best discretization model, which is found by maximizing the probabilityP(M|D)of the model M given the dataD. Using the Bayes rule and since the prob- ability P(D) is constant under varying the model, this is equivalent to maximizingP(M)P(D|M).

Let Nl be the number of labelled examples, J the number of classes, I the number of intervals for the in- put domain. Nil denotes the number of labelled exam- ples in the interval i, and Nijl the number of labelled examples of output value j in the interval i. A dis- cretization model is then defined by the parameter set I,

Nil 1≤i≤I,

Nijl 1≤i≤I,1≤j≤J

.

Owing to the definition of the model space and its prior distribution, the priorP(M)and the conditional likelihood P(D|M)can be calculated analytically. Taking the nega- tive log ofP(M)P(D|M), we obtain the following crite- rion to minimize:

Csup= logNl+ log Nl+I−1 I−1

! +

XI i=1

log Nil+J−1 J−1

!

| {z }

logP(M)

+ XI i=1

log Ni.l! Nil1!Nil2!. . . NiJl !

| {z }

logP(D|M)

(2) The first term of the criterionCsupstands for the choice of the number of intervals and the second term for the

(3)

choice of the bounds of the intervals. The third term corre- sponds to the choice of the output distribution in each inter- val and the last term represents the conditional likelihood of the data given the model. Therefore “complex” models with large numbers of intervals are penalized. This discretization method for classification provides the most probable dis- cretization given the data sample. Extensive comparative experiments showed high performances [3].

3 A new semi-supervised discretization method

This section presents a new semi-supervised discretiza- tion method which is based on previous work described above. The same modeling hypothesis as [3] are adopted.

A prior distributionP(M)which exploits the hierarchy of the model parameters is first proposed. This prior distribu- tion is uniform at each stage of this hierarchy. Then, we defineP(D|M)the conditional likelihood of data given the model. This leads to an exact analytical criterion for the posterior probabilityP(M|D).

Discretization models:

LetMbe a family of semi-supervised discretization models denotedM(I,{Ni},{Nij}). These models consider unla- belled and labelled examples together, andN is the total number of examples in the data set. The models parameters are defined as follows: I is the number of intervals;{Ni} the number of examples in each interval;{Nij}the number of examples of each class in each interval.

3.1 Prior distribution

A prior distributionP(M)is defined on the parameters of the models. This prior exploits the hierarchy of the pa- rameters. The number of intervals is first chosen, then the bounds of the intervals and finally the output frequencies are chosen. The joint distribution P(I,{Ni},{Nij})can be written as follows:

P(M) =P(I,{Ni},{Nij})

P(M) =P(I)×P({Ni}|I)×P({Nij}|{Ni}, I) The number of intervals is assumed to be uniformly dis- tributed between1andN. Thus we get:

P(I) = 1 N

We now assume that all data partitions intoIintervals are equiprobable for a given number of intervals. Computing the probability of one set of intervals turns into the combi- natorial evaluation of the number of possible intervals sets, which is equal toN+I−1

I−1

. The second term is defined as:

P({Ni}|I) = 1 N+I−1

I−1

The last term P({Nij}|{Ni}, I) can be rewritten as a product, assuming the independence of the distribution of classes between the intervals. For a given intervalicontain- ingNiexamples, all the distributions of the class values are considered equiprobable. The probabilities of distributions are computed as follows:

P({Nij}|{Ni}, I) =I

i=1

J−11

Ni+J−1

Finally, the prior distribution of the model is similar to the supervised approach [3]. The only one difference is that the semi-supervised prior takes into account all examples, including unlabelled ones:

P(M) = 1

N × 1 N+I−1

I−1

×I

i=1

J−11

Ni+J−1

(3)

3.2 Likelihood

This section focuses on the conditional likelihood P(D|M) of the data given the model. First, the family Λ of labelling models has to be defined. Semi-supervised discretization handles labelled and unlabelled pieces of data, Λ represents all possible labellings. Each model λ(Nl,{Nil},{Nijl}) Λ is characterized by the follow- ing parameters: Nl is the total number of labelled exam- ples;{Nil}the number of labelled examples in the interval i;{Nijl}the number of labelled examples of the classjin the intervali.

Owing to the formula of the total probability, the likeli- hood can be written as follows:

P(D|M) =

λ∈Λ

P(λ|M)×P(D|M, λ)

P(D|M)can be drastically simplified considering that P(D|M, λ)is equal to0for all labelling models which are incompatible with the observed dataDand the discretiza- tion modelM. The only one compatible labelling model that is considered is denotedλ. The previous expression can be rewritten as follows:

P(D|M) =P(λ|M)×P(D|M, λ) (4) The first termP(λ|M)can be written as a product using the hypothesis of independence of the likelihood between the intervals. In a given intervaliwhich containsNij ex- amples of each class, the computation ofP(λ|M)consists in finding the probability of observing{Nijl}examples of each class, drawingNilexamples. Once again, this problem is turned into a combinatorial evaluation. The number of draws which induce{Nijl}can be calculated, assuming the Nillabelled examples are uniformly drawn:

(4)

P(λ|M) = I i=1

J

j=1

Nij

Nijl

Ni

Nil

(5) Let us consider a very simple and intuitive problem to explain Equation 5. An intervalican be compared with a

“bag” containingNi1“black balls” andNi2 “white balls”.

Given the parametersNi1 = 6andNi2 = 20, what is the probability to simultaneously drawNi1l = 2black balls and Ni2l = 3white balls? Let26

5

be the number of possible draws, and6

2

×20

3

the number of draws which are com- posed of2black balls and3white balls. Assuming that all draws are equiprobable, the probability to simultaneously draw2black balls and3white balls is given by: (62)×(203)

(265) . The second termP(D|M, λ)of Equation 4 is estimated considering a uniform prior over all possible permutations of{Nijl}examples of each class amongNil. The indepen- dence assumption between the intervals gives:

P(D|M, L) = I i=1

1

Nil! Ni1l!Ni2l!...NiJl!

= I i=1

J

j=1Nijl! Nil!

Finally, the likelihood of the model is:

P(D|M) = I i=1

J

j=1

Nij

Nijl

×Nijl! Ni

Nil

×Nil!

In every interval, the number of unlabelled examples is denoted by Niju = Nij −Nijl andNiu = Ni −Nil. The previous expression can be rewritten:

P(D|M) =I

i=1

J

j=1Nij! Niju! Ni! Niu!

P(D|M) =I

i=1

J

j=1Nij!

Ni! × Niu! J

j=1Niju!

(6)

3.3 Evaluation criterion

The best semi-supervised discretization model is found by maximizing the probabilityP(M|D). A Bayesian eval- uation criterion is obtained exploiting Equations 3 and 6.

The maximum a posteriori model, denoted “Mmap”, is de- fined by:

Mmap= max

M∈M

1

N × 1 N+I−1

I−1

× I i=1

Ni+J−11

J−1

×I

i=1

J

j=1Nij!

Ni! × Niu! J

j=1Niju! (7)

Taking the negativelog of the probabilities, the maxi- mization problem turns into the minimization of the crite- rionCsemi sup:

Mmap= min

M∈MCsemi sup(M)

= min

M∈M

log(N) + log

N+I−1 I−1

+I

i=1

log

Ni+J−1 J−1

+I

i=1

log

Ni! J

j=1Nij!

I i=1

log

Niu! J

j=1Niju!

(8)

4 Comparison: semi-supervised vs. super- vised criteria

In this section, the semi-supervised criterionCsemi supof Equation 8 is compared with the supervised criterionCsup

of Equation 2:

both criteria are analytically equivalent whenU =∅;

the semi-supervised criterion corresponds to the prior distribution when L = ∅, in this case, semi-supervised and supervised approaches give the same discretization;

the semi-supervised approach is penalized by a high modeling cost when the data set includes labelled and unlabelled examples, in this case, the optimization of the criterionCsemi supgives a model with less intervals than the supervised approach.

4.1 Labelled examples only

In this case, all training examples are supposed to be la- belled: D = LandU = ∅. We have Niu = 0 for each interval andNiju = 0for each class. Therefore, the last term of Equation 8 is equal to zero. The criterionCsemi supcan be rewritten as follows:

log(N) + log

N+I−1 I−1

+I

i=1

log

Ni+J−1 J−1

+ I i=1

log

Ni! J

j=1Nij!

(9) When all the training examples are labelled,N = Nl, Ni = NilandNij = Nijl. The semi-supervised criterion Csemi supand the supervised criterionCsupare equivalent.

(5)

4.2 Unlabelled examples only

In the case where no example is labelled we haveD=U andL=∅. For each intervalNiu =Niand for each class Niju =Nij. Therefore, the termP(D|M)is equal to1for any model. The conditional likelihood (Equation 6) can be rearranged as follows :

P(D|M) = I i=1

J

j=1Nij!

Ni! × Ni! J

j=1Nij!

P(D|M) = 1

The posterior distribution is only composed by the prior distributionP(M|D) = P(M), in which case the model Mmapincludes a single interval. Both criteria give the same discretization, as long as supervised approach is not able to cut the numerical domain of the input variable in this case.

Csemi supcan be rewritten as:

log(N) + log

N+I−1 I−1

+I

i=1

log

Ni+J−1 J−1

4.3 Mixture of labelled and unlabelled ex- amples

The main difference between the semi-supervised and the supervised approaches consists in the prior distribu- tion P(M). In semi-supervised approach, the space of discretization models is bigger than in the supervised ap- proach. Unlabelled examples represent additional possible locations for the intervals bounds. Therefore, the model- ing cost of the prior distribution is more important for the semi-supervised criterion. When the number of unlabelled examples increases, the criterionCsemi supprefers models with less intervals.

This behaviour is illustrated with a very simple experi- ment. Let us consider a binary classification problem. All examples belonging to the class “0” [respectively“1”] are located atx = 0[respectivelyx = 1]. During the experi- ment,Nthe number of examples increases. The number of labelled examples is always the same in both classes. For every value ofN, we evaluateNminl the minimal number of labelled examples which induces aMmapwith two intervals (and not a single interval).

Figure 1 plotsNminl againstN =Nl+Nufor both cri- teria. For the criterionCsup, the minimal number of labelled examples necessary to split data does not depend onN. In this case,Nminl = 6for every value ofN. A different be- haviour is observed forCsemi sup. Figure 1 quantifies the influence ofN on the selection of the modelMmap. When

6 8 10 12 14 16 18

10 100 1000

N l min

N

Semi-supervised criterion

6 8 10 12 14 16 18

10 100 1000

N l min

N

Semi-supervised criterion Supervised criterion

Figure 1. Mixture of labelled and unlabelled examples. The vertical axis represents the minimal number of labelled examples nec- essary to obtain a model with two intervals, rather than a model with a single interval. The horizontal axis represents the total number of examples using a logarithmic scale.

the number of examplesNgrows,Nminl increases approx- imately as log(N). Therefore, the criterionCsup gives a model Mmap with less intervals than the supervised ap- proach, due to its high modeling cost.

5 Theoretical and empirical results

See section 5.1

See Lemma A (section 5.2)

See Lemma B (section 5.2) Interpretation of likelihood Semi−supervised approach

(theorical result) in terms of entropy on intervals bounds location

Discretization bias

the parameters Nij Optimization of (theorical result) (empirical result)

end of section 5.2 (theorical result) Asymptotic equivalence Supervised

approach

+

Post optimization

location See section 5.2 See ref. [2]

of bounds

Figure 2. Structure of Section 5

Figure 2 illustrates the structure of the results presented in this section, and their relations. An additional dis- cretization bias is first empirically established for our semi- supervised discretization method. Then, two theoretical re-

(6)

sults are demonstrated: an interpretation of the likelihood in terms of entropy; and an analytical expression of the op- timalNij. Taking into account these empirical and theo- retical results, we demonstrate that the semi-supervised ap- proach is asymptotically equivalent to the supervised ap- proach, associated with a post-optimization of the bounds location.

5.1 Discretization bias

The semi-supervised and the supervised discretization approaches are based on the ranks statistics. Therefore, the location of the bounds between intervals of the optimal model are defined in a discrete space, thanks to the number of examples in every interval. The discretization bias aims to define the bounds location in the numerical domain of the continuous input variable.

5.1.1 How to position a bound between two training examples?

The parameters{Ni}[respectively{Nil}] given by the op- timization ofCsemi sup[respectivelyCsup] are not sufficient to define continuous bounds location. Indeed, there is an in- finity of possible locations between two training examples.

A prior is adopted in [3] which considers the best bound location as the median of the distribution of the true bound locations, denotedPtb. This median minimizes the general- ization Means Square Error, for anyPtb. The objective is to place a bound between two examples without information about the distributionPtb. In this casePtb is assumed to be uniform, and a bound is placed midway between the two concerned examples.

5.1.2 How to position a bound in an unlabelled area?

The optimization of the semi-supervised criterionCsemi sup does not indicate the best bounds location, when the pa- rameters{Nil}are constant. This phenomenon is observed on a toy example below. Considering an area of the in- put space X where no example is labelled, all possible bounds locations have the same cost according to the cri- terionCsemi sup. Therefore, the semi-supervised approach is not able to determine bounds location in such an unla- belled area. The same prior as [3] which aims minimizing the generalization Means Square Error is adopted, in order to define continuous bounds location. The unlabelled ex- amples are supposed to be drawn from the distributionPtb. In this case, the median of Ptb is estimated exploiting the unlabelled examples. The intervals bounds are placed in the middle of unlabelled areas.

Finally, the supervised and the semi-supervised ap- proaches are not able to position a continuous bound be- tween two labelled examples. In both cases, the same prior

on the best bounds location is adopted. The only one inter- est of the unlabelled examples is to bring information about Ptb, and to refine the median of this distribution.

5.1.3 Empirical evidence :

Let us consider an univariate binary classification problem.

Training examples are uniformly distributed in the interval [0,1]. This data set contains three separate areas denoted

“A”, “B”, “C”. The part “A” [respectively“C”] includes40 labelled examples of class “0” [respectively“1”] and cor- responds to the interval[0,0.4][respectively[0.6,1]]. The part “B” corresponds to the interval[0.4,0.6]and contains 20unlabelled examples.

As part of this experiment, the family of discretization modelsMis restricted to the models which contain two in- tervals. This toy problem consists in finding the best bound b [0,1]between the two intervals of the model. Every bound is related to the number of examples in each inter- vals,{N1, N2}.

There are a lot of possible models for a given bound (due to the Nij parameters). We estimate the probability of a bound by a Bayesian averaging over all possible mod- els which are compatible with the bound. This evaluation is not biased by the choice of a particular model among all possible models. For a given boundb, the parameters{Nij} are not defined, we have:

P(b|D) =

{Nij}

P(b,{Nij} M∈M

|D)

Using the Bayes rule, we get:

P(b|D)×P(D) =

{Nij}

P(D|b,{Nij})×P(b,{Nij})

Figure 3 plotslogP(b|D)against the bound’s location b. Minimal values of this curve give the best bound’s loca- tions. This figure indicates that it is neither wise to cut the data set in part “A” nor in part “C”. All bound’s locations in part “B” are equivalent and optimal according to the crite- rionCsemi sup.

This experiment empirically shows that the criterion Csemi sup can not distinguish between bounds’ location in an unlabelled area of the input space X. This result is unexpected and difficult to demonstrate formally (due to the Bayesian averaging over models). Intuitively, this phe- nomenon can be explained by the fact that the criterion Csemi suphas no expressed preferences on bounds’ location.

This is consistent with an “objective” Bayesian approach [1].

(7)

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

Figure 3. Bound’s quantity of informationvs.

bound’s location

5.2 A post-optimisation of the supervised approach

This section demonstrates that the semi-supervised ap- proach is asymptotically equivalent to the supervised ap- proach improved with a post-optimization on the bounds location. This post-optimization consists in exploiting un- labelled examples in order to position the intervals bounds in the middle of unlabelled areas.

5.2.1 Equivalent prior distribution

The discretization bias established in Section 5.1 modifies our a priori knowledge about the distributionP(M). From now, the bounds are forced to be placed in the middle of unlabelled areas. The number of possible locations for each bound is substantially reduced. The criterionCsemi supcon- sidersN−1possible locations for each bound. Exploiting the discretization bias of Section 5.1, onlyNl1possible locations are considered. In these conditions, the prior dis- tributionP(M)(see Equation 3) can be easily rewritten as in the supervised approach (see Equation 2).

5.2.2 Asymptotically equivalent likelihood

Lemma A: The conditional likelihood of the data given the model can be expressed using the entropy (denotedHM) of the setsU,LandD, given the model M:

Supervised case

logP(D|M)=NlHM(L) +O(logN)

Semi-supervised case

logP(D|M) = NHM(D) NuHM(U) + O(logN)

Proof :

Let us denote HM(D) the Shannon’s entropy [19] of the data, given a discretization modelM. We assume that HM(D)is equals to its empirical evaluation:

HM(D) =I

i=1

Ni

N J

j=1logNNij

i

In the semi-supervised caseP(D|M) =I

i=1 QJ

j=1Nij! Nuij! Ni! Nui!

. Consequently:

logP(D|M) = XI

i=1

2

4log(Ni!)log(Niu!)XJ

j=1

log(Nij!) +XJ

j=1

log(Niju!) 3 5

The Stirling’s approximation giveslog(n!) =nlog(n) n+O(logn):

logP(D|M) =XI

i=1

»

Nilog(Ni)NiNiulog(Niu) +Niu

XJ

j=1

[Nijlog(Nij)Nij]

+XJ

j=1

ˆNijulog(Niju)Niju˜

+O(logNi)− O(logNiu)

XJ

j=1

O(logNij) +XJ

j=1

O(logNiju)

Exploiting the fact that : J j=1

Nij=Ni

and

J j=1

Niju =Niu,

we obtain:

logP(D|M) =XI

i=1

»XJ j=1

Niju`logNijulogNiu´

Nij(logNijlogNi) +O(logNi)

logP(D|M) =XI

i=1

»

Ni

XJ j=1

Nij

Ni logNij

Ni

«

+Niu XJ j=1

Niju Niulog

Niju Niu

«

+O(logNi)

The entropy is additive on disjoint sets. We get:

logP(D|M) =NHM(D)NuHM(U) +O(logN)

(8)

Lemma B: The values of parameters{Nij}which minimize the criterionCsemi sup(denoted{Nij}) cor- respond to the proportion of labels observed in each interval:

Nij =

(Ni+ 1)×Nijl Nil 1

* IfPJ

j=1Nij =Ni1, simply choose one of theNij and add1. All possibilities are equivalent and optimal forCsemi sup

Proof :

This proof handles the case of a single interval model. Since data distribution is assumed to be independent between the intervals, this proof can be independently repeated onIin- tervals. We consider a binary classification problem. Let the function f(Ni1, Ni2) denote the criterionCsemi sup, with all parameters fixed except Ni1 andNi2. We aim to find an analytical expression of the minimum of the function f(Ni1, Ni2):

f(Ni1, Ni2) = log (Ni1Nil1)!

Ni1!

!

+ log (Ni2Nil2)!

Ni2!

!

The termsNi1l andNi2l are constant, andNi2=Ni−Ni1. f can be rewritten as a single parameter function:

f(Ni1) = log (Ni1Nil1)!

Ni1!

!

+ log (NiNi1Nil2)!

(NiNi1)!

!

=

Ni1X−Ni1l k=1

logk

Ni1

X

k=1

logk+

Ni−NXi1−Ni2l k=1

logk

NiX−Ni1 k=1

logk

=

Ni1

X

k=Ni1−Ni1l+1

logk

NiX−Ni1 k=Ni−Ni1−Ni2l+1

logk

And:

f(Ni1+ 1) =

NXi1+1 k=Ni1−Ni1l+2

logk

Ni−NXi11 k=Ni−Ni1−Ni2l

logk

Consequently:

f(Ni1)f(Ni1+ 1) = log(Ni1+ 1)log(Ni1+ 1Nil1)

log(NiNi1) + log(NiNil2Ni1)

= log (Ni1+ 1)(NiNil2Ni1) (Ni1+ 1Nil1)(NiNi1)

!

f(Ni1)decreases if:

f(Ni1)−f(Ni1+ 1)>0

(N(Ni1+1)(Ni−Ni2l−Ni1)

i1+1−Ni1l)(Ni−Ni1)>1

−Ni2l ×Ni1−Ni2l >−Ni1l ×Ni+Ni1l ×Ni1

Ni1< −NNi2ll+Ni1l×Ni i1+Ni2l

In the same way,f(Ni1)increases if:

f(Ni1)−f(Ni1+ 1)<0⇔Ni1> −NNi2ll+Ni1l×Ni i1+Ni2l

As f(Ni1) is a discrete function, its maximum is reached for Ni1 = −NNi2ll+Ni1l×Ni

i1+Ni2l . This expression can be generalized to the case ofJclasses1:

Nij =

(Ni+ 1)×Nijl Nil 1

Theorem 5.1 Given the best model Mmap, Lemma Bstates that the proportion of the labels are the same in the setsLandD. Thus, L andD have the same entropy. The setU also has the same entropy because U = D\L. Exploitinglemma A, we have for the semi-supervised case:

logP(D|Mmap) =NHMmap(D)−NuHMmap(U) +O(logN)

logP(D|Mmap) = (N−Nu)HMmap(L)+O(logN)

logP(D|Mmap) =NlHMmap(L) +O(logN) We have:

logP(D|Mmap) + logP(D|Mmap)=O(logN)

N→+∞lim

logP(D|Mmap) + logP(D|Mmap)

logP(D|Mmap) = 0 WithP(D|Mmap)[respectivelyP(D|Mmap)] corre- sponding to the semi-supervised [respectively super- vised] approach.

The conditional likelihood P(D|Mmap) is asymptoti- cally the same in the supervised and the semi-supervised cases. Both approaches aim to solve the same optimiza- tion problem. Owing to this result, the semi-supervised ap- proach can be reformulated a posteriori. Our approach is equivalent to [3] improved with a post-optimization on the bounds location.

1The generalized expression ofNij has been empirically verified on multi-class data sets.

(9)

6 Experiments

A toy problem is exploited to evaluate the behavior of the post-optimized method (see Section 5.2). This problem consists in estimating a step function from data. The artifi- cial dataset is constituted by examples which belong to the class “1” on left-hand part of Figure 4 and to the class “2”

on the right-hand part. The objective of this experiment is to find the step location with less labelled examples as pos- sible.

Class

Examples 1

2

x=220

Figure 4. Step dataset

The values of the variable (horizontal axis) which char- acterizes the 100 examples xi are drawn according to the expressionxi = eα,αvarying between 0 and 10 with a step of 0.1. The step location is placed atx= 220: 47 ex- amples belong to the class 1 and 53 to the class 2. The train set and the test set are both constituted by 100 examples.

We compare two discretization methods: the supervised method (see Section 2) and the supervised method with a post-optimisation of bounds location (see Section 5.2). For both methods, theMmap is exploited to discretize the in- put variable. Then this variable is placed on the input of a naive Bayes classifier. The predictive model is evaluated using the area under the ROC curve (AUC) [9]. The num- ber of labelled examples is the only free parameter and al- lows comparisons between both methods, examples to be labelled are drawn randomly. The experiment of this sec- tion is realized considering discretization models with one or two intervals, that is consistent with theoretical proofs demonstrated above.

Figure 5 plots the average AUC versus the number of la- belled examples. For each value of the number of labelled examples the experiment has been realized 10 times. Points reprensent the mean AUC and natches the variance of the results. Considering less than 6 labelled examples, both dis- cretisation methods give aMmapwith a single interval. In this cas, the AUC is equal to0.5. From 6 labelled exam- ples, the Mmap includes two intervals for both discretiza-

0.5 0.6 0.7 0.8 0.9 1

4 6 8 10 14 18 22 26 30

AUC

Number of labelled data supervised + post-optimization

supervised

Figure 5. Comparison between the post- optimised method and the supervised method.

tion methods. The bound between these two intervals be- comes better and better when examples are labelled. The Figure 5 shows that the post-optimization of the bound lo- cation improve the supervised discretization method. This improvement is weak but always present : (i) the post- optimization always improves the discretization; ii) this im- provement is more important as the number of labelled ex- amples is low.

Results on this artificial dataset, which is often used to test discretization method [4], are very promising. This section shows that our post-optimization of bounds loca- tion improve the optimal discretization model, when the distribution of examples is not uniform. The method de- scribed in this paper will be tested on step function [5] with or whithout noise as in [4] [13]. We will show its robustness compare to other methods which discretize a single dimen- sion into two intervals.

7 Conclusion

This article presents a new semi-supervised discretiza- tion method based on very few assumptions on the data dis- tribution. It provides an in-depth analysis of the problem which consists in dealing with a set of labelled and unla- belled examples.

This paper significantly extends the previous research of Boulle in [3] on supervised discretization method MODL - i.e. it presents a semi-supervised generalization of it where additional unlabelled learning examples are taken into ac- count. The results have been proved in an intuitive manner, and mathematical proofs have also been given.

Références

Documents relatifs

In the following set of experiments with the practical FOM and CG algorithms, we assume that the products Ap can be computed in double, single or half precision (with

In practice, popular SSL methods as Standard Laplacian (SL), Normalized Laplacian (NL) or Page Rank (PR), exploit those operators defined on graphs to spread the labels and, from

In our last experiment, we show that L γ - PageRank, adapted to the multi-class setting described in Section 2.3, can improve the performance of G-SSL in the presence of

Minimum entropy logistic regression is also compared to the classic EM algorithm for mixture models (fitted by maximum likelihood on labeled and unlabeled examples, see

In the non-asymptotic case we should expect transduction to perform better than induction (again, assuming test samples are drawn from the same distribution as the training data),

In this work, we employ a transductive label propagation method that is based on the manifold assumption to make predictions on the entire dataset and use these predictions to

We propose an online SSL algorithm that uses a combi- nation of quantization and regularization to obtain fast and accurate performance with provable performance bounds. Our

A typical scenario for apprenticeship learning is to have an expert performing a few optimal (or close to the optimal) trajectories and use these to learn a good policy.. In this