• Aucun résultat trouvé

Experiments

Dans le document tel-00431117, version 1 - 10 Nov 2009 (Page 162-169)

138 CHAPTER 8. VALIDATION OF UNEXPECTED SENTENCES IN TEXT DOCUMENTS erty 4, the we can say that the unexpected sentences contained in discovered unexpected sentences are valid, because the elimination of such sentences increases the accuracy of the classification task.

8.5. EXPERIMENTS 139 Figure 8.1 shows the number of the discovered sequential patterns with different sequence length. According to the figure, the numbers of 4-length and 5-length sequential patterns strongly deceases when the minimum support value increases, for instance, with min_supp = 0.05%, the numbers of 2-, 3-, 4-, and 5-length sequential patterns of the class “positive” are respectively 7013, 3677,705, and46. Therefore, in order to obtain signifiant results, we find the class patterns limited to 2- and 3-length sequential patterns for next steps of our experiments.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 10000 20000 30000 40000 50000 60000 70000

Number of sequential patterns

2-length sequential patterns 3-length sequential patterns 4-length sequential patterns 5-length sequential patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 10000 20000 30000 40000 50000 60000 70000

Number of sequential patterns

2-length sequential patterns 3-length sequential patterns 4-length sequential patterns 5-length sequential patterns

(a) The class “positive reviews”. (b) The class “negative reviews”.

Figure 8.1: Number of discovered sequential patterns with different sequence length.

We extract the sequential patterns consisting of the adjectives, adverbs, nouns, verbs, and negation identifiers as the class descriptor. Figure 8.2 shows the total numbers of 2-phrase and 3-phrase class patterns that contain at least and at most one adjective or/and adverb, since the adjectives and adverbs are essential in sentiment classification.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 10000 20000 30000 40000 50000 60000 70000

Number of class patterns

2-phrase class patterns 3-phrase class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 10000 20000 30000 40000 50000 60000 70000

Number of class patterns

2-phrase class patterns 3-phrase class patterns

(a) The class “positive reviews”. (b) The class “negative reviews”.

Figure 8.2: Number of 2-phrase and 3-phrase class patterns.

The appearance of discovered 2-phrase class pattern models are listed in Table 8.5, ordered by the alphabet of models and (∗|neg.) with respect to different minimum support values. In order to save paper size, we only list the models corresponding to theconfmin values 0.01%,0.03%, and 0.05%.

tel-00431117, version 1 - 10 Nov 2009

140 CHAPTER 8. VALIDATION OF UNEXPECTED SENTENCES IN TEXT DOCUMENTS

Class Pattern Models P-0.01% N-0.01% P-0.03% N-0.03% P-0.05% N-0.05%

ADJ.-ADV. 1089 892 134 134 34 32

ADJ.-N. 4049 3109 566 517 257 206

ADJ.-V. 2813 2474 581 558 321 276

ADV.-ADJ. 1654 1314 219 221 83 76

ADV.-N. 3348 3014 452 469 209 169

ADV.-V. 3084 2954 728 781 394 390

N.-ADJ. 2571 2045 292 286 127 100

N.-ADV. 2929 2729 438 478 194 189

V.-ADJ. 3841 3367 940 901 507 448

V.-ADV. 3157 2940 846 931 498 492

NEG-ADJ. 329 314 103 90 60 49

ADJ.-NEG 254 232 70 64 38 34

NEG-ADV. 166 147 79 83 66 62

ADV.-NEG 147 138 71 71 51 52

Table 8.5: 2-phrase class pattern models.

Number Models for class “positive” Number Models for class “negative”

2289 V.-V.-ADV. 2343 V.-V.-ADV.

2121 V.-ADV.-V. 2106 V.-ADV.-V.

1801 V.-V.-ADJ. 1689 V.-V.-ADJ.

1691 V.-ADJ.-N. 1616 ADV.-V.-V.

1607 ADV.-V.-V. 1433 V.-ADJ.-N.

1546 V.-ADJ.-V. 1362 V.-ADJ.-V.

1340 V.-ADV.-N. 1212 N.-V.-ADV.

1276 N.-V.-ADV. 1159 V.-ADV.-N.

1045 ADJ.-V.-V. 969 ADJ.-V.-V.

946 N.-V.-ADJ. 861 V.-N.-ADV.

Table 8.6: 10 most frequent 3-phrase class pattern models.

tel-00431117, version 1 - 10 Nov 2009

8.5. EXPERIMENTS 141 For discovered 3-phrase class pattern models, the top-10 most frequent ones corresponding to confmin = 0.01% are listed in Table 8.6.

The unexpected class patterns are generated from the semantic oppositions of class patterns.

In our experiments, the lexical database WordNet [Cog] is used for determining the antonyms of adjectives and adverbs for constructing semantic oppositions. For a class pattern, if there exist an adjective and an adverb together, then only the antonyms of the adjective will be considered;

if the adjective and adverb have no antonym, then this class pattern will be ignored; if there exist more than one antonym, than more than one unexpected class pattern will be generated from all antonyms. The total numbers of unexpected 2-phrase and 3-phrase class patterns are shown in Figure 8.3.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Number of unexpected class patterns

Unexpected 2-phrase class patterns Unexpected 3-phrase class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Number of unexpected class patterns

Unexpected 2-phrase class patterns Unexpected 3-phrase class patterns

(a) The class “positive reviews”. (b) The class “negative reviews”.

Figure 8.3: Number of 2-phrase and 3-phrase unexpected class patterns.

The total numbers of unexpected sentences determined from unexpected 2-phrase and 3-phrase class patterns are shown in Figure 8.4, and the total numbers of documents that contain unexpected sentences are shown in Figure 8.5.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 1000 2000 3000 4000 5000

Number of unexpected sentences

Discovered from 2-phrase class patterns Discovered from 3-phrase class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 1000 2000 3000 4000 5000

Number of unexpected sentences

Discovered from 2-phrase class patterns Discovered from 3-phrase class patterns

(a) The class “positive reviews”. (b) The class “negative reviews”.

Figure 8.4: Number of unexpected sentences discovered from 2-phrase and 3-phrase unexpected class patterns.

tel-00431117, version 1 - 10 Nov 2009

142 CHAPTER 8. VALIDATION OF UNEXPECTED SENTENCES IN TEXT DOCUMENTS

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 200 400 600 800 1000 1200 1400

Number of documents

Discovered from 2-phrase class patterns Discovered from 3-phrase class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

0 200 400 600 800 1000 1200 1400

Number of documents

Discovered from 2-phrase class patterns Discovered from 3-phrase class patterns

(a) The class “positive reviews”. (b) The class “negative reviews”.

Figure 8.5: Number of documents that contain unexpected sentences discovered from 2-phrase and 3-phrase unexpected class patterns.

Validation of Unexpected Sentences.

The goal of the evaluation is to use the text classification method to validate the unexpect-edness stated in the discovered unexpected sentences with respect to the document class. The unexpectedness is examined by the Bow toolkit [McC96] with comparing the average accuracy of text classification tasks with and without unexpected sentences.

Three methods,k-Nearest Neighbor (k-NN), Naive Bayes, and TFIDF are selected for testing our approach by using classification tasks. Thek-NN method [YC94] based classifiers are example-based that for deciding whether a document D |= Ci for a class Ci, it examines whether the k training documents most similar to D also are in Ci. The Naive Bayes based classifiers (see [Lew98]) compute the probability that a document D belongs to a class Ci by an application of Bayes’ theorem, which accounts for most of the probabilistic approaches in the text classification.

Nevertheless, the TFIDF (term frequency-inverse document frequency) [SB88] based classifiers compute the term frequency for deciding whether a document D belongs a class Ci, however an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely. Briefly, in order to learn a model, a prototype vector based on the TFIDF weight of terms is computed for each class, and then the cosine value of a new document between each prototype vector is calculated to assign the relevant class.

In our experiments, two groups of tests are performed, with and without pruning most frequent words common to all documents in the two classes by selecting words with highest average mutual information with the class variable. Each test is performed with 20 trials of a randomized test-train split 40%-60%, and we take into account the final average values of accuracy. All tests are based on the unexpected sentences extracted with 2-phrase and 3-phrase unexpected class patterns obtained by different min_suppvalues from 0.01% to0.05%.

The evaluation results on the change of accuracy are shown in Figure 8.6, Figure 8.7, and

tel-00431117, version 1 - 10 Nov 2009

8.5. EXPERIMENTS 143

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-6 -4 -2 0 2 4 6 8 10

Change of accuracy (%)

Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-6 -4 -2 0 2 4 6 8 10

Change of accuracy (%)

Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

(a) Without frequent word pruning. (b) With frequent word pruning.

Figure 8.6: Change of average accuracy before and after eliminating unexpected sentences by using k-NN method.

Figure 8.8. The results are compared with removing the same number of randomly selected sentences from the documents. In each figure, the average accuracy of the original documents α(D) is considered as the base line “0”, and the change of accuracy δR of the documents with randomly-removed sentences is considered as a reference line.

In the test results on the k-NN classifier shown in Figure 8.6(a), the change of accuracy is variant with respect to the min_supp value for extracting class patterns, however the results shown in Figure 8.6(b) well confirms Property 4. The behavior shown in Figure 8.6(a) also shows that although selecting frequent terms improves the accuracy of classification tasks, the frequent words common to all classes decrease the confidence of the accuracy of classification.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-4 -2 0 2 4 6 8 10

Change of accuracy (%)

Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-4 -2 0 2 4 6 8 10

Change of accuracy (%)

Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

(a) Without frequent word pruning. (b) With frequent word pruning.

Figure 8.7: Change of average accuracy before and after eliminating unexpected sentences by using Naive Bayes method.

Because Naive Bayes classifiers are probability based, Figure 8.7(a) is reasonable: the unex-pected class patterns contained in all eliminated unexunex-pected sentences weakly affect the probability whether a document belongs to a class since the eliminated terms are not frequent, but randomly

tel-00431117, version 1 - 10 Nov 2009

144 CHAPTER 8. VALIDATION OF UNEXPECTED SENTENCES IN TEXT DOCUMENTS selected sentences contains terms important to classify the documents. The prune of the most frequent common words enlarges the effects of unexpected sentences, thus the results shown in Figure 8.7(b) perfectly confirms Property 4.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-6 -4 -2 0 2 4 6 8 10

Change of accuracy (%)

Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-6 -4 -2 0 2 4 6 8 10

Change of accuracy (%) Random

From 2-phrase unexpected class patterns From 3-phrase unexpected class patterns

(a) Without frequent word pruning. (b) With frequent word pruning.

Figure 8.8: Change of average accuracy before and after eliminating unexpected sentences by using TFIDF method.

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-20 -10 0 10 20

Change of accuracy (%)

KNN Navie Bayes TFIDF

0.01% 0.02% 0.03% 0.04% 0.05%

Minimum support

-20 -10 0 10 20

Change of accuracy (%)

KNN Naive Bayes TFIDF

(a) Without frequent word pruning. (b) With frequent word pruning.

Figure 8.9: Change of average accuracy between original documents and the documents consisting of the unexpected sentences discovered from 2-phrase unexpected class patterns.

According to the principle of TFIDF weight, Figure 8.8(a) shows that the effect of comment frequent words in classification tasks is important, so that the elimination of limited number of sentences does not change the overall accuracy. Different from Naive Bayes classifiers, Figure 8.8(b) well confirms Property 4.(1) and Property 4.(2), however Property 4.(3) is not satisfied because the elimination of random selected sentences increases the overall accuracy of the classification.

We also test the accuracy of the classification tasks on the documents consisting of only un-expected sentences, to study the characteristics of unun-expected sentences, as shown in Figure 8.9 and Figure 8.10. Not difficult to see, the unexpected sentences are difficult to be classified with

tel-00431117, version 1 - 10 Nov 2009

Dans le document tel-00431117, version 1 - 10 Nov 2009 (Page 162-169)