Feature selection - Text Classification

2 State of the Art

2.1 Text Classification

2.1.3 Feature selection

A major characteristic or difficulty of text categorization problems is the high dimensionality of the feature space (Joachims 1998). Indeed, most of the work related to text classification employs the

“bag of word” approach to represent documents. With such a representation, each unique word found in the collection of document represents a dimension in the space where documents are projected. This leads to the generation of a feature space of very large dimensions. For instance, collections of moderate size contain often thousands or even tens of thousands features (Gabrilovich and Markovitch 2004 ). Even after eliminating stopwords and performing stemming, the feature set remains too big for many algorithms. One of the problems linked to the high dimensionality of the feature space is the time requirement (Zheng and Srihari 2003). Indeed, the time required to execute an induction algorithm grows dramatically with the number of features. Such dependency between the execution time and the feature space size renders the use of many algorithms impossible for problems with a large number of features (Koller and Sahami 1996).

The most common answer for tackling these problems consists in reducing the feature space by determining the most informative words inside the dictionary built on a collection of document (Seo, Ankolekar et al. 2004). Such a process is called feature selection. By doing so, it is possible to considerably reduce the execution time and increase the accuracy. Indeed, by reducing the dimensionality of the original feature space, the computational requirements for the classification algorithms are reduced as well (especially those that do not scale well with the feature set size) (Fragoudis, Meretakis et al. 2002). Moreover, it has been observed that the use of a feature selection method can be a very efficient mean to speed up the computation time without losing classification accuracy. Indeed, if after the reduction of the search space the selected features still contain sufficient reliable information about the original data, there is usually an improvement in the classification accuracy.

The feature selection process uses an evaluation function that is applied to each single word. All words are evaluated independently and ranked given the chosen criterion. From there, a predefined number of features is selected to compose the best feature subset (Novovicova and Malik 2005). As we said the feature selection method rely strongly on statistics or information-theoretic to evaluate the usefulness of each feature. For text learning tasks, for example, they primarily count on the vocabulary-specific characteristics of given textual data set to identify good word features. Although the statistics itself do not account for the meaning of the text, these methods have been proven to be useful for text learning tasks (e.g., classification and clustering) (Seo, Ankolekar et al. 2004).

2.1.3.1 Feature selection methods

A variety of feature selection techniques have been tested for text categorization. For instance, document frequency, word frequency, mutual information, information gain, odds, chi2 statistic (chi-square) and term strength (Yang and Pedersen 1997; Mladenic and Grobelnik 1999; Forman 2003) have been experimentally compared by Yang and Pedersen (Yang and Pedersen 1997) Mladenic and Grobelnik (Mladenic and Grobelnik 1999). Another study performed by Forman (Forman 2003) presents an extensive comparative study of twelve feature selection criteria for the high-dimensional domain of text classification (Novovicova and Malik 2005). All these studies have revealed that both information gain and very simple frequency measures work well on text data.

2.1.3.1.1 Document frequency

A simple and effective global function to perform feature selection is the document frequency. The basic assumption is that rare terms are either non-informative for category prediction, or not influential in global performance. Frequency-based feature selection easily scales to very large corpora, with a computational complexity approximately linear in the number of training documents (Yang, Wu et al. 2002). Performing a feature selection based on document frequency consists first in computing the document frequency for each unique term in the training corpus. Then, all terms whose document frequency was less than some predetermined threshold are removed from the feature space. In a series of experiments Yang and Pedersen (Yang and Pedersen 1997) have shown that with document frequency it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness (a reduction by a factor of 100 bringing about just a small loss).

2.1.3.1.2 CHI2

The CHI2 statistic measures the lack of independence between a term t and a category c by comparing the observed co-occurrence frequencies in a 2-way contingency table with the frequencies expected for independence.

𝑋𝑋²(𝑡𝑡, 𝑐𝑐) = 𝑁𝑁𝑥𝑥(𝐴𝐴𝐴𝐴 − 𝐶𝐶𝐶𝐶)²

(𝐴𝐴 + 𝐶𝐶) × (𝐶𝐶 + 𝐴𝐴) × (𝐴𝐴 + 𝐶𝐶) × (𝐶𝐶 + 𝐴𝐴) (1)

where A is the number of times t and c co-occur, B is the number of times t occurs without c , C is the number of times c occurs without t, D is the number of times neither c nor t occurs, and N is the total number of documents. c and t are dependent if the difference between observed and expected frequencies is large whereas they are independent if the CHI2 statistics score is close to zero (Manning and Schütze 1999; Seo, Ankolekar et al. 2004).

The computation of CHI2 scores has a quadratic complexity, similar to mutual information and information gain (Zaffalon and Hutter 2002; Hutter and Zaffalon 2002 ; Lee and Lee 2006). A major difference between CHI2 and mutual information is that CHI2 is a normalized value, and hence CHI2 values are comparable across terms for the same category. However, this normalization breaks down (can no longer be accurately compared to the CHI2 distribution) if any cell in the contingency table is lightly populated, which is the case for low frequency terms. Hence, the CHI2 statistic is known not to be reliable for low-frequency terms and therefore CHI2 statistics should be interpreted with care (Dunning 1993; Yang and Pedersen 1997).

2.1.3.1.3 Information Gain

Information Gain (IG) is frequently employed as a term goodness criterion in the field of machine learning (Quinlan 1986). The idea behind IG is to select features that contain the most valuable information to discriminate the target classes. For this purpose, it measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document.

(2)

IG is considered as a global measure because it averages the reduction of uncertainty that occurs by the selection of feature t over all target classes (Yang and Pedersen 1997). In cases where not all features are redundant with one another, IG is very appropriate. However, in cases where many features are highly redundant with one another, it can be more relevant to employ other means, for example, more complex dependence models. Comparative experimental studies have consistently shown Information Gain (Cover and Thomas 1991) based feature selection to result in good classifier performance (Yang and Pedersen 1997; Forman 2003; Gabrilovich and Markovitch 2004 ). This performance success and IG’s well found principles in Information Theory (Shannon 2001), both make it a popular feature selection algorithm for text classification (Mukras 2007).

2.1.4 Discretization

The features employed to represent the data to classify can take different forms: nominal, discrete and/or continuous. In opposition to the nominal values that do not possess any order among them, discrete and continuous values are ordered. Given the fact that the discrete values are specific intervals in a continuous spectrum of value, whereas the number of values in a continuous interval can be infinite, a discrete interval can take only a limited number of values. The continuous values are not always easy to manipulate for two main reasons. Firstly, it is easier for both users and experts to understand and to work with discrete features. Indeed, discrete features are closer to a knowledge-level representation than continuous ones (Simon 1996). Secondly, many algorithms developed in the machine learning community rely on discrete data only and cannot deal with continuous data. Therefore, for all the real-world classification tasks involving continuous features;

such algorithms could not be applied (Liu, Hussain et al. 2002).

In order to employ every learning algorithm with representation containing continuous features, a pre-processing is required to transform the continuous value into discrete ones. This process called discretization consists in performing a quantization of the data to generate discrete values.

Quantization is the process of approximating a continuous range of values (or a very large set of possible discrete values) by a relatively small set of discrete symbols or integer values. The discretization has the advantage to reduce and simplify the data. As reported in a study (Dougherty, Kohavi et al. 1995), discretization increases the speed of induction algorithms and makes the learning process more accurate. In the context of a decision tree, the two kinds of values make a noticeable difference. A decision tree obtained using discrete features is usually more compact and shorter.

Therefore, the rules generated can be examined and compared more easily. However, there are

∑

many cases where a decision tree produces a more efficient classification model by choosing by itself the best splitting value for the continuous features (Liu, Hussain et al. 2002).

2.1.4.1 Balance between recall and precision, towards utility measures

A divergent behavior between recall and precision is frequently observed in classification. Indeed, an increase in precision generally involves a loss in recall and an increase of recall comes usually with a loss of precision. The reasons for this phenomenon are quite simple. In order to increase the recall, we tend to multiply the answers for a query in order to be sure to have the good one in the set returned. In following such a strategy, we also produce unfortunately many false positives and therefore reduce the precision. On the other hand, when we try to increase the precision, we tend to answer only the query when we are sure of the answer. Consequently, many queries are not answered and this tends to reduce the recall.

For example, given 1000 queries with 1000 possible answers, the more answers are returned for a query, the more likely the returned answer set contains the correct answer. Indeed, when answering a question randomly by returning one answer, there is one chance on 1000 to return the correct answer. In case 200 answers are returned, there is one chance on five that the answer set contains the correct answer. Finally, to be sure to have the right answer, we can return all the answers possible. This example shows that, by increasing the number of answers, we obviously increase the recall, but we decrease the precision in a similar way. Indeed, let us return to our example: if we return only one answer and it is the right one, the precision is one; if we return 200 answers that contain the right one, the recall fall to one on 200, and if we return all the possible answer, the precision falls to 1/1000.

Dans le document Modular text mining for protein-protein interactions extraction (Page 42-45)