Semi-supervised learning - State of the Art in Clustering and

Chapter 2. State of the Art in Clustering and

2.9. Semi-supervised learning

Semi-supervised classification is a framework of algorithms proposed to improve the performance of supervised algorithms through the use of both labeled and unlabeled data [DES 00]. One reported limitation of supervised techniques is their requisite of available training corpora of considerable dimensions to achieve accurate predictions on the test data.

Furthermore, the high effort and cost associated with labeling large amount of training samples by hand—a typical example is the manual compilation of labeled text documents—is a second limiting factor. It led to the development of semi-supervised techniques. Many studies have shown how the knowledge learned from unlabeled data can significantly reduce the size of labeled data required to achieve appropriate classification performances [NIG 99, CAS 95].

Different approaches to semi-supervised classification have been proposed in the literature, including, among others, co-training [MAE 04], self-training [YAR 95], or generative models [NIG 00, NIG 99]. Two extensive surveys on semi-supervised learning are provided in [ZHU 06] and [SEE 01].

Unsupervised learning algorithms can be divided in several groups:self-training,co-trainingand,generative models.

2.9.1. Self training

In self-training, a single classifier is iteratively trained with a growing set of labeled data, stating from a small

initial seed of labeled samples. Commonly, an iteration of the algorithm entails the following steps: 1) training on the labeled data available from previous iterations, 2) applying the model learned from labeled data to predict the unlabeled data, and 3) sorting the predicted samples according to their confidence scores and adding the top most confident ones with their predicted labels to the labeled set, which implies removing them from the unlabeled dataset.

One example of self-training is the work by Yarowski [YAR 95]. A self-training approach was applied to word sense disambiguation. The basic problem was to classify a word and its context into the possible word senses in a polysemic corpus.

The algorithm was supported by two important constraints for the augmentation of the labeled senses: (1) the collocation constraint, according to which a word’s sense is unaltered if the word co-occurs with the same words in the same position (collocation) and (2) the one sense per discourse, according to which a word sense is unaltered in the discourse where the word appears, e.g. within a document. The algorithm was started by a tagged seed for each possible sense of the word, including important seed collocates for each sense. The sense labels were then iteratively augmented according to self-training approaches. In this case, theone sense per discourse criterion was also applied to achieve more augmentation with samples within the documents.

2.9.2. Co-training

In a similar way as self training, co-training approaches are based on an incremental augmentation of the labeled sets by iteratively classifying the unlabeled sets and attaching the most confident predicted samples to the labeled set. However, in contrast to self-training, two complementary classifiers are simultaneously applied, fed with two different “views” of the feature set. The prediction of the first classifier is used to

augment the labeled set available to the second classifier and vice-versa. To obtain a maximum benefit from this “synergy”

of classifiers, two important assumptions should be fulfilled:

–Compatibility of classifiers: The classification models should mostly “agree” in their predictions, i.e. if a sample is classified to classyjby thefirst classifier, it should be probably classified to the same class by the second classifier.

–Conditional independency of feature subsets: No conditional dependency should be observed between the two feature subsets applied to the classifiers.

In [MAE 04], a co-training strategy was applied to predict the emotional/non-emotional character of a corpus of student utterances collected within the ITSPOKE project (Intelligent Tutoring Spoken dialog system). As the conditional independency between the different feature sets could not be proved, the authors selected twohigh-precisionclassifiers.

The first one was trained to recognize the emotional status

of an utterance (e.g. “1” emotional vs “0” for non-emotional), whereas the second one predicted its non-emotional status (“1” non-emotional vs “0” emotional). The labeled set was iteratively increased by attaching the top most confident predicted samples to the labeled set from previous iterations.

Furthermore, the feature subsets applied to each classifier were optimized according to two evaluation criteria, using a greedy search algorithm.

2.9.3. Generative models

DenotingX, the set of data points, inR^D, and Y the set of class labels corresponding to the dataset, a generative model assumes an underlying model of mixturesp(x|y), which should be identifiable by using certain tools such as the EM algorithm or clustering methods.

In [NIG 99], the EM algorithm was used for the semi-supervised classification of texts. The model parameters, θ, to be inferred by the algorithm, were defined as the set the word/class probabilities and the class prior probabilities.

Other strategies attempt to derive the mixture model by means of clustering. These approaches are commonly referred to as cluster-and-label. For example, in [DEM 99] a genetic k-means clustering was implemented using a genetic algorithm (GA) (see section for more details about GAs). The goal of the algorithm was to find a set of k cluster centers that simultaneously optimised an internal quality objective (e.g. minimum cluster dispersion) and an external criterion based on the available labels (e.g. minimum cluster entropy).

Thus, a real value chromosome representation was selected, with a chromosome length CL=Dxk, whereDis the number of features in the dataset. Within each iteration (generation) of the GA, the clusters where built by connecting each point to the closest center (k-means). The labels where expanded to all patterns by labeling each cluster using a majority voting strategy. This way, the total cluster entropy could be calculated. As aforementioned, the simultaneous optimization of both internal and external criteria was attained through the formulation of a new objective as a linear combination of both minimizeO: {α∗Dispersion+β∗Entropy} [2.100]

In [DAR 02], a SOM was applied to cluster unlabeled data.

The SOM was first trained using the labeled data seed. If all the labeled samples that shared an identical winning node also had an identical label, say, li, that node was labeled as li. Otherwise the node was considered as “non-labeling.”

In a subsequent clustering phase, the unlabeled data where

“clustered” to their closest units in the map (winning nodes).

During the clustering process, all unlabeled data clustered to a particular node were also implicitely labeled with the

node’s label, in case the node had been assigned a label in the training phase.

Dans le document Semi-Supervised and Unsupervised (Page 92-96)