"A Non-parametric Semi-supervised Discretization Method

(1)

A Non-parametric Semi-supervised Discretization Method

Bondu, A. and Boulle, M. and Lemaire, V.

Orange Labs 2 av. Pierre Marzin 22300 Lannion - France alexis.bondu@orange-ftgroup.com

Loiseau, S. and Duval, B.

LERIA, Universit´e d’Angers 2 Boulevard Lavoisier, 49045 Angers Cedex 01, France

Abstract

Semi-supervised classification methods aim to exploit labelled and unlabelled examples to train a predictive model. Most of these approaches make assumptions on the distribution of classes. This article first proposes a new semi-supervised discretization method which adopts very low informative prior on data. This method dis- cretizes the numerical domain of a continuous input vari- able, while keeping the information relative to the predic- tion of classes. Then, an in-depth comparison of this semi- supervised method with the original supervised MODL approach is presented. We demonstrate that the semi- supervised approach is asymptotically equivalent to the su- pervised approach, improved with a post-optimization of the intervals bounds location.

1 Introduction

Data mining can be defined as the non trivial process of identifying valid, novel, potentially useful, ultimately un- derstandable patterns in data [10]. Even though the modeling phase is the core of the process, the quality of the results rely heavily on data preparation which usually takes around 80%of the total time [17]. An interesting method for data preparation is to discretize the input variables.

Discretization methods aim to induce a list of intervals which splits the numerical domain of a continuous input variable, while keeping the information relative to the output variable [6] [12] [8] [14] [15]. A na¨ıve Bayes classifier can exploit a discretization of its input space [3] as the intervals set which is used to estimate conditional probabilities of classes given the data. Discretization methods are useful for data mining, to explore, prepare and model data.

The objective of semi-supervised learning is to exploit unlabelled data to improve a predictive models. This article focuses on semi-supervised classification, a well known problem in the literature. Most of semi-supervised ap-

proaches deal with particular cases where information about unlabelled data is available. Semi-supervised learning without strong assumption on data distribution is a great challenge.

This article proposes a new semi-supervised discretization method which adopts very low informative priors on data. Our semi-supervised discretization method is based on the MODL framework [3] (“Minimal Optimized De- scription Length”). This approach turns the discretization problem into a model selection one. A Bayesian approach is applied and leads to an analytical evaluation criterion.

Then, the best discretization model is selected by optimiz- ing this criterion.

The organization of this paper is as follows: Section 2 presents the motivation for non-parametric semi-supervised learning; Section 3 formalizes our semi-supervised approach; our discretization method is compared with the supervised approach in Section 4; in Section 5, empirical and theoretical results are exploited to demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, improved with a post-optimization of the intervals bounds location. Future work is discussed in Section 7.

2 Related works

This section introduces the semi-supervised learning owing to a short state of the art. Previous works on supervised discretization are then summarized.

2.1 Semi-supervised algorithms

Semi-supervised classification methods [7] exploit labelled and unlabelled examples to train a predictive model.

The main existing approaches are the following:

• The Self-training approach is a heuristic which itera- tively uses the predictions of a model to label new examples. The new labelled examples in turns are used to

(2)

train the model. The uncertainty of predictions is evaluated in order to label only the most confident examples [18].

• TheCo-trainingapproach involves two predictive models which are independently trained on disjoint sub- feature sets. This heuristic uses the predictions of both models to label two examples at every iteration. Each model labels one example and “teaches” the other classifier with its prediction [2] [16].

• TheCovariate shiftapproach estimates the distributions of labelled and unlabelled examples [21]. The covariate shift formulation [20] weights labelled examples according to the disagreement between these distributions. This approach incorporates this disagreement into the training algorithm of a supervised model.

• Generative modelbased approaches estimate the distribution of classes, under hypothesis on data. These methods make the assumption that the distributions of classes belong to a known parametric family. Then training data is exploited in order to fit parameters values [11].

Semi-supervised learning without making hypothesis on data distribution is a great challenge. Therefore, most of semi-supervised approaches make assumptions on the distribution of classes.

For instance, generative model based approaches aim to estimate P(x, y) = P(y)P(x|y)the joint distribution of data and classes (with data denoted byx ∈ Xand classes denoted byy ∈ Y). The distributionP(x, y)is assumed to belong to a parametric family {P(x, y)_θ}. The vector θ of finite size corresponds to the modeling parameters of P(x, y). The joint distribution can be rewritten as P(x, y)_θ =P(y)_θP(x|y)_θ. The termP(y)_θis defined by a prior knowledge on the distribution of classes. P(x|y)_θ is identified in a given family of distributions, thanks to the vectorθ.

LetUbe the set of unlabelled examples andLthe set of labelled examples. The setLcontains couples(x, y), withx a scalar value andy ∈[1, J]a discrete class value. The set U contains scalar values without labels. Semi-supervised generative model based approaches aim to find the parame- tersθwhich maximizeP(x, y)_θon the data setD=U∪L. The quantity to be maximized is p(L, U|θ), the probability of data given the parametersθ. The Maximum Likeli- hood Estimation (MLE) is widely employed to maximize p(L, U|θ)(with(x_i, y_i)∈Landx_i ∈U):

maxθ∈Θ

_|L|

i=1

log [p(yi)_θp(xi|yi)_θ]

+ |U|

i=1

log |Y|

j=1

p(y_j)_θp(x_i|y_j)_θ

(1)

These approaches are usable only if information about the distribution of classes is available. The hypothesis that P(x, y) belongs to a known family of distributions is a strong assumption which could be invalid in practice.

The objective of a non-parametric semi-supervised method is to estimate the distribution of classes without making strong hypothesis on these distributions. There- fore, our approach can be put in opposition with the generative approaches. This article exploits the MODL framework [3] and proposes a new semi-supervised discretization method. This “objective” Bayesian approach makes very low assumptions on the data distribution.

2.2 Summary of the supervised MODL discretization method

The discretization of a descriptive variable aims at estimating the conditional distribution of class labels, owing to a piece wise constant density estimator. In the MODL approach [3], the discretization is turned into a model selection problem. First, a space of discretization models is defined. The parameters of a specific discretization are the number of intervals, the bounds of the intervals and the output frequencies in each interval. A Bayesian approach is applied to select the best discretization model, which is found by maximizing the probabilityP(M|D)of the model M given the dataD. Using the Bayes rule and since the probability P(D) is constant under varying the model, this is equivalent to maximizingP(M)P(D|M).

Let N^l be the number of labelled examples, J the number of classes, I the number of intervals for the input domain. N_i^l denotes the number of labelled examples in the interval i, and N_ij^l the number of labelled examples of output value j in the interval i. A discretization model is then defined by the parameter set I,

N_i^l _1≤i≤I,

N_ij^l 1≤i≤I,1≤j≤J

.

Owing to the definition of the model space and its prior distribution, the priorP(M)and the conditional likelihood P(D|M)can be calculated analytically. Taking the nega- tive log ofP(M)P(D|M), we obtain the following criterion to minimize:

Csup= logN^l+ log N^l+I−1 I−1

! +

XI i=1

log N_i^l+J−1 J−1

!

| {z }

−logP(M)

+ XI i=1

log N_i.^l! N_i^l1!N_i^l2!. . . N_iJ^l !

| {z }

−logP(D|M)

(2) The first term of the criterionC_supstands for the choice of the number of intervals and the second term for the

(3)

choice of the bounds of the intervals. The third term corresponds to the choice of the output distribution in each interval and the last term represents the conditional likelihood of the data given the model. Therefore “complex” models with large numbers of intervals are penalized. This discretization method for classification provides the most probable discretization given the data sample. Extensive comparative experiments showed high performances [3].

3 A new semi-supervised discretization method

This section presents a new semi-supervised discretization method which is based on previous work described above. The same modeling hypothesis as [3] are adopted.

A prior distributionP(M)which exploits the hierarchy of the model parameters is first proposed. This prior distribution is uniform at each stage of this hierarchy. Then, we defineP(D|M)the conditional likelihood of data given the model. This leads to an exact analytical criterion for the posterior probabilityP(M|D).

Discretization models:

LetMbe a family of semi-supervised discretization models denotedM(I,{N_i},{N_ij}). These models consider unlabelled and labelled examples together, andN is the total number of examples in the data set. The models parameters are defined as follows: I is the number of intervals;{N_i} the number of examples in each interval;{N_ij}the number of examples of each class in each interval.

3.1 Prior distribution

A prior distributionP(M)is defined on the parameters of the models. This prior exploits the hierarchy of the parameters. The number of intervals is first chosen, then the bounds of the intervals and finally the output frequencies are chosen. The joint distribution P(I,{Ni},{Nij})can be written as follows:

P(M) =P(I,{Ni},{Nij})

P(M) =P(I)×P({Ni}|I)×P({Nij}|{Ni}, I) The number of intervals is assumed to be uniformly distributed between1andN. Thus we get:

P(I) = 1 N

We now assume that all data partitions intoIintervals are equiprobable for a given number of intervals. Computing the probability of one set of intervals turns into the combinatorial evaluation of the number of possible intervals sets, which is equal to_N_+I−1

I−1

. The second term is defined as:

P({N_i}|I) = 1 _N_+I−1

I−1

The last term P({Nij}|{Ni}, I) can be rewritten as a product, assuming the independence of the distribution of classes between the intervals. For a given intervalicontain- ingNiexamples, all the distributions of the class values are considered equiprobable. The probabilities of distributions are computed as follows:

P({N_ij}|{N_i}, I) =^I

i=1

_J−11

Ni+J−1

Finally, the prior distribution of the model is similar to the supervised approach [3]. The only one difference is that the semi-supervised prior takes into account all examples, including unlabelled ones:

P(M) = 1

N × 1 _N_+I−1

I−1

×^I

i=1

_J−11

Ni+J−1

(3)

3.2 Likelihood

This section focuses on the conditional likelihood P(D|M) of the data given the model. First, the family Λ of labelling models has to be defined. Semi-supervised discretization handles labelled and unlabelled pieces of data, Λ represents all possible labellings. Each model λ(N^l,{N_i^l},{N_ij^l}) ∈ Λ is characterized by the following parameters: N^l is the total number of labelled examples;{N_i^l}the number of labelled examples in the interval i;{N_ij^l}the number of labelled examples of the classjin the intervali.

Owing to the formula of the total probability, the likelihood can be written as follows:

P(D|M) =

λ∈Λ

P(λ|M)×P(D|M, λ)

P(D|M)can be drastically simplified considering that P(D|M, λ)is equal to0for all labelling models which are incompatible with the observed dataDand the discretization modelM. The only one compatible labelling model that is considered is denotedλ^∗. The previous expression can be rewritten as follows:

P(D|M) =P(λ^∗|M)×P(D|M, λ^∗) (4) The first termP(λ^∗|M)can be written as a product using the hypothesis of independence of the likelihood between the intervals. In a given intervaliwhich containsN_ij examples of each class, the computation ofP(λ^∗|M)consists in finding the probability of observing{N_ij^l}examples of each class, drawingN_i^lexamples. Once again, this problem is turned into a combinatorial evaluation. The number of draws which induce{N_ij^l}can be calculated, assuming the N_i^llabelled examples are uniformly drawn:

(4)

P(λ^∗|M) = I i=1

_J

j=1

_N_ij

N_ij^l

_N_i

N_i^l

(5) Let us consider a very simple and intuitive problem to explain Equation 5. An intervalican be compared with a

“bag” containingN_i1“black balls” andN_i2 “white balls”.

Given the parametersNi1 = 6andNi2 = 20, what is the probability to simultaneously drawN_i1^l = 2black balls and N_i2^l = 3white balls? Let₂₆

5

be the number of possible draws, and₆

2

×₂₀

3

the number of draws which are composed of2black balls and3white balls. Assuming that all draws are equiprobable, the probability to simultaneously draw2black balls and3white balls is given by: (⁶₂)×(²⁰₃)

(²⁶₅) . The second termP(D|M, λ^∗)of Equation 4 is estimated considering a uniform prior over all possible permutations of{N_ij^l}examples of each class amongN_i^l. The independence assumption between the intervals gives:

P(D|M, L^∗) = I i=1

1

N_i^l! N_i1^l!N_i2^l!...N_iJ^l!

= I i=1

_J

j=1N_ij^l! N_i^l!

Finally, the likelihood of the model is:

P(D|M) = I i=1

_J

j=1

_N_ij

N_ij^l

×N_ij^l! _N_i

N_i^l

×N_i^l!

In every interval, the number of unlabelled examples is denoted by N_ij^u = N_ij −N_ij^l andN_i^u = N_i −N_i^l. The previous expression can be rewritten:

P(D|M) =^I

i=1

_J

j=1Nij! N_ij^u! Ni! N_i^u!

P(D|M) =^I

i=1

_J

j=1Nij!

N_i! × N_i^u! _J

j=1N_ij^u!

(6)

3.3 Evaluation criterion

The best semi-supervised discretization model is found by maximizing the probabilityP(M|D). A Bayesian evaluation criterion is obtained exploiting Equations 3 and 6.

The maximum a posteriori model, denoted “M_map”, is defined by:

Mmap= max

M∈M

1

N × 1 _N+I−1

I−1

× I i=1

_N_i_+J−11

J−1

×^I

i=1

_J

j=1Nij!

N_i! × N_i^u! _J

j=1N_ij^u! (7)

Taking the negativelog of the probabilities, the maxi- mization problem turns into the minimization of the crite- rionCsemi sup:

Mmap= min

M∈MCsemi sup(M)

= min

M∈M

log(N) + log

N+I−1 I−1

+^I

i=1

log

Ni+J−1 J−1

+^I

i=1

log

Ni! _J

j=1N_ij!

− I i=1

log

N_i^u! _J

j=1N_ij^u!

(8)

4 Comparison: semi-supervised vs. super- vised criteria

In this section, the semi-supervised criterionCsemi supof Equation 8 is compared with the supervised criterionCsup

of Equation 2:

• both criteria are analytically equivalent whenU =∅;

• the semi-supervised criterion corresponds to the prior distribution when L = ∅, in this case, semi-supervised and supervised approaches give the same discretization;

• the semi-supervised approach is penalized by a high modeling cost when the data set includes labelled and unlabelled examples, in this case, the optimization of the criterionC_{semi sup}gives a model with less intervals than the supervised approach.

4.1 Labelled examples only

In this case, all training examples are supposed to be labelled: D = LandU = ∅. We have N_i^u = 0 for each interval andN_ij^u = 0for each class. Therefore, the last term of Equation 8 is equal to zero. The criterionC_{semi sup}can be rewritten as follows:

log(N) + log

N+I−1 I−1

+^I

i=1

log

N_i+J−1 J−1

+ I i=1

log

Ni! _J

j=1Nij!

(9) When all the training examples are labelled,N = N^l, N_i = N_i^landN_ij = N_ij^l. The semi-supervised criterion C_{semi sup}and the supervised criterionC_supare equivalent.

(5)

4.2 Unlabelled examples only

In the case where no example is labelled we haveD=U andL=∅. For each intervalN_i^u =N_iand for each class N_ij^u =Nij. Therefore, the termP(D|M)is equal to1for any model. The conditional likelihood (Equation 6) can be rearranged as follows :

P(D|M) = I i=1

_J

j=1N_ij!

N_i! × Ni! _J

j=1Nij!

P(D|M) = 1

The posterior distribution is only composed by the prior distributionP(M|D) = P(M), in which case the model M_mapincludes a single interval. Both criteria give the same discretization, as long as supervised approach is not able to cut the numerical domain of the input variable in this case.

Csemi supcan be rewritten as:

log(N) + log

N+I−1 I−1

+^I

i=1

log

Ni+J−1 J−1

4.3 Mixture of labelled and unlabelled ex- amples

The main difference between the semi-supervised and the supervised approaches consists in the prior distribution P(M). In semi-supervised approach, the space of discretization models is bigger than in the supervised approach. Unlabelled examples represent additional possible locations for the intervals bounds. Therefore, the modeling cost of the prior distribution is more important for the semi-supervised criterion. When the number of unlabelled examples increases, the criterionCsemi supprefers models with less intervals.

This behaviour is illustrated with a very simple experiment. Let us consider a binary classification problem. All examples belonging to the class “0” [respectively“1”] are located atx = 0[respectivelyx = 1]. During the experiment,Nthe number of examples increases. The number of labelled examples is always the same in both classes. For every value ofN, we evaluateN_min^l the minimal number of labelled examples which induces aM_mapwith two intervals (and not a single interval).

Figure 1 plotsN_min^l againstN =N^l+N^ufor both criteria. For the criterionC_sup, the minimal number of labelled examples necessary to split data does not depend onN. In this case,N_min^l = 6for every value ofN. A different behaviour is observed forC_{semi sup}. Figure 1 quantifies the influence ofN on the selection of the modelM_map. When

6 8 10 12 14 16 18

10 100 1000

N l min

N

Semi-supervised criterion

6 8 10 12 14 16 18

10 100 1000

N l min

N

Semi-supervised criterion Supervised criterion

Figure 1. Mixture of labelled and unlabelled examples. The vertical axis represents the minimal number of labelled examples nec- essary to obtain a model with two intervals, rather than a model with a single interval. The horizontal axis represents the total number of examples using a logarithmic scale.

the number of examplesNgrows,N_min^l increases approx- imately as log(N). Therefore, the criterionCsup gives a model Mmap with less intervals than the supervised approach, due to its high modeling cost.

5 Theoretical and empirical results

See section 5.1

See Lemma A (section 5.2)

See Lemma B (section 5.2) Interpretation of likelihood Semi−supervised approach

(theorical result) in terms of entropy on intervals bounds location

Discretization bias

the parameters Nij Optimization of (theorical result) (empirical result)

end of section 5.2 (theorical result) Asymptotic equivalence Supervised

approach

+

Post optimization

location See section 5.2 See ref. [2]

of bounds

Figure 2. Structure of Section 5

Figure 2 illustrates the structure of the results presented in this section, and their relations. An additional discretization bias is first empirically established for our semi- supervised discretization method. Then, two theoretical re-

(6)

sults are demonstrated: an interpretation of the likelihood in terms of entropy; and an analytical expression of the op- timalNij. Taking into account these empirical and theoretical results, we demonstrate that the semi-supervised approach is asymptotically equivalent to the supervised approach, associated with a post-optimization of the bounds location.

5.1 Discretization bias

The semi-supervised and the supervised discretization approaches are based on the ranks statistics. Therefore, the location of the bounds between intervals of the optimal model are defined in a discrete space, thanks to the number of examples in every interval. The discretization bias aims to define the bounds location in the numerical domain of the continuous input variable.

5.1.1 How to position a bound between two training examples?

The parameters{Ni}[respectively{N_i^l}] given by the optimization ofCsemi sup[respectivelyCsup] are not sufficient to define continuous bounds location. Indeed, there is an in- finity of possible locations between two training examples.

A prior is adopted in [3] which considers the best bound location as the median of the distribution of the true bound locations, denotedP_tb. This median minimizes the generalization Means Square Error, for anyP_tb. The objective is to place a bound between two examples without information about the distributionPtb. In this casePtb is assumed to be uniform, and a bound is placed midway between the two concerned examples.

5.1.2 How to position a bound in an unlabelled area?

The optimization of the semi-supervised criterionC_{semi sup} does not indicate the best bounds location, when the parameters{N_i^l}are constant. This phenomenon is observed on a toy example below. Considering an area of the input space X where no example is labelled, all possible bounds locations have the same cost according to the cri- terionCsemi sup. Therefore, the semi-supervised approach is not able to determine bounds location in such an unlabelled area. The same prior as [3] which aims minimizing the generalization Means Square Error is adopted, in order to define continuous bounds location. The unlabelled examples are supposed to be drawn from the distributionPtb. In this case, the median of Ptb is estimated exploiting the unlabelled examples. The intervals bounds are placed in the middle of unlabelled areas.

Finally, the supervised and the semi-supervised approaches are not able to position a continuous bound between two labelled examples. In both cases, the same prior

on the best bounds location is adopted. The only one inter- est of the unlabelled examples is to bring information about Ptb, and to refine the median of this distribution.

5.1.3 Empirical evidence :

Let us consider an univariate binary classification problem.

Training examples are uniformly distributed in the interval [0,1]. This data set contains three separate areas denoted

“A”, “B”, “C”. The part “A” [respectively“C”] includes40 labelled examples of class “0” [respectively“1”] and corresponds to the interval[0,0.4][respectively[0.6,1]]. The part “B” corresponds to the interval[0.4,0.6]and contains 20unlabelled examples.

As part of this experiment, the family of discretization modelsMis restricted to the models which contain two intervals. This toy problem consists in finding the best bound b ∈ [0,1]between the two intervals of the model. Every bound is related to the number of examples in each intervals,{N1, N2}.

There are a lot of possible models for a given bound (due to the Nij parameters). We estimate the probability of a bound by a Bayesian averaging over all possible models which are compatible with the bound. This evaluation is not biased by the choice of a particular model among all possible models. For a given boundb, the parameters{N_ij} are not defined, we have:

P(b|D) =

{Nij}

P(b,{Nij} M∈M

|D)

Using the Bayes rule, we get:

P(b|D)×P(D) =

{Nij}

P(D|b,{Nij})×P(b,{Nij})

Figure 3 plots−logP(b|D)against the bound’s location b. Minimal values of this curve give the best bound’s loca- tions. This figure indicates that it is neither wise to cut the data set in part “A” nor in part “C”. All bound’s locations in part “B” are equivalent and optimal according to the crite- rionCsemi sup.

This experiment empirically shows that the criterion Csemi sup can not distinguish between bounds’ location in an unlabelled area of the input space X. This result is unexpected and difficult to demonstrate formally (due to the Bayesian averaging over models). Intuitively, this phenomenon can be explained by the fact that the criterion C_{semi sup}has no expressed preferences on bounds’ location.

This is consistent with an “objective” Bayesian approach [1].

(7)

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

0 10 20 30 40 50 60

0 0.2 0.4 0.6 0.8 1

-log P(b|D)

b

A B C

Figure 3. Bound’s quantity of informationvs.

bound’s location

5.2 A post-optimisation of the supervised approach

This section demonstrates that the semi-supervised approach is asymptotically equivalent to the supervised approach improved with a post-optimization on the bounds location. This post-optimization consists in exploiting unlabelled examples in order to position the intervals bounds in the middle of unlabelled areas.

5.2.1 Equivalent prior distribution

The discretization bias established in Section 5.1 modifies our a priori knowledge about the distributionP(M). From now, the bounds are forced to be placed in the middle of unlabelled areas. The number of possible locations for each bound is substantially reduced. The criterionCsemi supcon- sidersN−1possible locations for each bound. Exploiting the discretization bias of Section 5.1, onlyN^l−1possible locations are considered. In these conditions, the prior dis- tributionP(M)(see Equation 3) can be easily rewritten as in the supervised approach (see Equation 2).

5.2.2 Asymptotically equivalent likelihood

Lemma A: The conditional likelihood of the data given the model can be expressed using the entropy (denotedH_M) of the setsU,LandD, given the model M:

• Supervised case

−logP(D|M)^∗=N^lH_M(L) +O(logN)

• Semi-supervised case

−logP(D|M) = NHM(D) − N^uHM(U) + O(logN)

Proof :

• Let us denote HM(D) the Shannon’s entropy [19] of the data, given a discretization modelM. We assume that HM(D)is equals to its empirical evaluation:

HM(D) =N×_I

i=1

Ni

N −_J

j=1log^N_N^ij

i

•In the semi-supervised caseP(D|M) =_I

i=1 Q_J

j=1Nij! Nuij! Ni! Nui!

. Consequently:

−logP(D|M) = XI

i=1

2

4log(Ni!)−log(N_i^u!)−X^J

j=1

log(Nij!) +X^J

j=1

log(N_ij^u!) 3 5

The Stirling’s approximation giveslog(n!) =nlog(n)− n+O(logn):

−logP(D|M) =X^I

i=1

»

Nilog(Ni)−Ni−N_iûlog(N_iû) +N_iû

−X^J

j=1

[Nijlog(Nij)−Nij]

+X^J

j=1

ˆN_ijûlog(N_ijû)−N_ijû˜

+O(logNi)− O(logN_i^u)

−X^J

j=1

O(logNij) +X^J

j=1

O(logN_ij^u)–

Exploiting the fact that : J j=1

Nij=Ni

and

J j=1

N_ij^u =N_i^u,

we obtain:

−logP(D|M) =X^I

i=1

»XJ j=1

N_ijû`logN_ijû−logN_iû´

−Nij(logNij−logNi) +O(logNi)–

−logP(D|M) =X^I

i=1

»

−Ni

XJ j=1

Nij

Ni log„Nij

Ni

«

+N_i^u XJ j=1

N_ij^u N_i^ulog

„N_ij^u N_i^u

«

+O(logNi) –

The entropy is additive on disjoint sets. We get:

−logP(D|M) =NHM(D)−N^uHM(U) +O(logN)

(8)

Lemma B: The values of parameters{Nij}which minimize the criterionCsemi sup(denoted{N_ij}) cor- respond to the proportion of labels observed in each interval^∗:

N_ij =

(N_i+ 1)×N_ij^l N_i^l −1

* IfP_J

j=1N_ij =Ni−1, simply choose one of theN_ij and add1. All possibilities are equivalent and optimal forCsemi sup

Proof :

This proof handles the case of a single interval model. Since data distribution is assumed to be independent between the intervals, this proof can be independently repeated onIin- tervals. We consider a binary classification problem. Let the function f(Ni1, Ni2) denote the criterionCsemi sup, with all parameters fixed except Ni1 andNi2. We aim to find an analytical expression of the minimum of the function f(Ni1, Ni2):

f(Ni1, Ni2) = log (Ni1−N_i^l1)!

Ni1!

!

+ log (Ni2−N_i^l2)!

Ni2!

!

The termsN_i1^l andN_i2^l are constant, andNi2=Ni−Ni1. f can be rewritten as a single parameter function:

f(Ni1) = log (Ni1−N_i^l1)!

Ni1!

!

+ log (Ni−Ni1−N_i^l2)!

(Ni−Ni1)!

!

=

Ni1X−N_i1^l k=1

logk−

Ni1

X

k=1

logk+

Ni−NXi1−N_i2^l k=1

logk−

NiX−Ni1 k=1

logk

=−

Ni1

X

k=N_i1−N_i1^l+1

logk−

NiX−Ni1 k=N_i−N_i1−N_i2^l+1

logk

And:

f(Ni1+ 1) =−

NX_i1+1 k=N_i1−N_i1^l+2

logk−

N_i−NX_i1−1 k=N_i−N_i1−N_i2^l

logk

Consequently:

f(Ni1)−f(Ni1+ 1) = log(Ni1+ 1)−log(Ni1+ 1−N_i^l1)

−log(Ni−Ni1) + log(Ni−Ni^l2−Ni1)

= log (Ni1+ 1)(Ni−N_i^l2−Ni1) (Ni1+ 1−N_i^l1)(Ni−Ni1)

!

f(Ni1)decreases if:

f(Ni1)−f(Ni1+ 1)>0

⇔ ^(N_(Nⁱ¹^+1)(Nⁱ^−Nⁱ²^l^−Nⁱ¹⁾

i1+1−N_i1^l)(Ni−Ni1)>1

⇔ −N_i2^l ×N_i1−N_i2^l >−N_i1^l ×N_i+N_i1^l ×N_i1

⇔ Ni1< ^−N_Nⁱ²^ll^+Nⁱ¹^l^×Nⁱ i1+N_i2^l

In the same way,f(Ni1)increases if:

f(Ni1)−f(Ni1+ 1)<0⇔Ni1> ^−N_Nⁱ²^ll^+Nⁱ¹^l^×Nⁱ i1+N_i2^l

As f(Ni1) is a discrete function, its maximum is reached for N_i1 = ^−N_Nⁱ²^ll^+Nⁱ¹^l^×Nⁱ

i1+N_i2^l . This expression can be generalized to the case ofJclasses¹:

N_ij =

(Ni+ 1)×N_ij^l N_i^l −1

Theorem 5.1 Given the best model Mmap, Lemma Bstates that the proportion of the labels are the same in the setsLandD. Thus, L andD have the same entropy. The setU also has the same entropy because U = D\L. Exploitinglemma A, we have for the semi-supervised case:

−logP(D|M_map) =NH_M_map(D)−N^uH_M_map(U) +O(logN)

−logP(D|Mmap) = (N−N^u)HMmap(L)+O(logN)

−logP(D|Mmap) =N^lHMmap(L) +O(logN) We have:

−logP(D|Mmap) + logP(D|Mmap)^∗=O(logN)

N→+∞lim

−logP(D|M_map) + logP(D|M_map)^∗

−logP(D|Mmap) = 0 WithP(D|Mmap)[respectivelyP(D|Mmap)^∗] corre- sponding to the semi-supervised [respectively super- vised] approach.

The conditional likelihood P(D|M_map) is asymptotically the same in the supervised and the semi-supervised cases. Both approaches aim to solve the same optimization problem. Owing to this result, the semi-supervised approach can be reformulated a posteriori. Our approach is equivalent to [3] improved with a post-optimization on the bounds location.

1The generalized expression ofN_ij has been empirically verified on multi-class data sets.

(9)

6 Experiments

A toy problem is exploited to evaluate the behavior of the post-optimized method (see Section 5.2). This problem consists in estimating a step function from data. The artificial dataset is constituted by examples which belong to the class “1” on left-hand part of Figure 4 and to the class “2”

on the right-hand part. The objective of this experiment is to find the step location with less labelled examples as possible.

Class

Examples 1

2

x=220

Figure 4. Step dataset

The values of the variable (horizontal axis) which char- acterizes the 100 examples xi are drawn according to the expressionxi = e^α,αvarying between 0 and 10 with a step of 0.1. The step location is placed atx= 220: 47 examples belong to the class 1 and 53 to the class 2. The train set and the test set are both constituted by 100 examples.

We compare two discretization methods: the supervised method (see Section 2) and the supervised method with a post-optimisation of bounds location (see Section 5.2). For both methods, theM_map is exploited to discretize the input variable. Then this variable is placed on the input of a naive Bayes classifier. The predictive model is evaluated using the area under the ROC curve (AUC) [9]. The number of labelled examples is the only free parameter and al- lows comparisons between both methods, examples to be labelled are drawn randomly. The experiment of this section is realized considering discretization models with one or two intervals, that is consistent with theoretical proofs demonstrated above.

Figure 5 plots the average AUC versus the number of labelled examples. For each value of the number of labelled examples the experiment has been realized 10 times. Points reprensent the mean AUC and natches the variance of the results. Considering less than 6 labelled examples, both dis- cretisation methods give aM_mapwith a single interval. In this cas, the AUC is equal to0.5. From 6 labelled examples, the M_map includes two intervals for both discretiza-

0.5 0.6 0.7 0.8 0.9 1

4 6 8 10 14 18 22 26 30

AUC

Number of labelled data supervised + post-optimization

supervised

Figure 5. Comparison between the post- optimised method and the supervised method.

tion methods. The bound between these two intervals be- comes better and better when examples are labelled. The Figure 5 shows that the post-optimization of the bound location improve the supervised discretization method. This improvement is weak but always present : (i) the post- optimization always improves the discretization; ii) this improvement is more important as the number of labelled examples is low.

Results on this artificial dataset, which is often used to test discretization method [4], are very promising. This section shows that our post-optimization of bounds location improve the optimal discretization model, when the distribution of examples is not uniform. The method described in this paper will be tested on step function [5] with or whithout noise as in [4] [13]. We will show its robustness compare to other methods which discretize a single dimen- sion into two intervals.

7 Conclusion

This article presents a new semi-supervised discretization method based on very few assumptions on the data distribution. It provides an in-depth analysis of the problem which consists in dealing with a set of labelled and unlabelled examples.

This paper significantly extends the previous research of Boulle in [3] on supervised discretization method MODL - i.e. it presents a semi-supervised generalization of it where additional unlabelled learning examples are taken into account. The results have been proved in an intuitive manner, and mathematical proofs have also been given.