• Aucun résultat trouvé

Example Application: Sentiment Classification

Dans le document Case Studies Using Open-Source Tools (Page 116-122)

Introduction to the KNIME Text Processing Extension

3.5 Example Application: Sentiment Classification

Sentiment analysis of free-text documents is a common task in the field of text mining. In sentiment analysis predefined sentiment labels, such as “positive” or “negative” are assigned to documents. This section demonstrates how a predictive model can be built in order to assign sentiment labels to documents with the KNIME Text Processing extension in combination with traditional KNIME learner and predictor nodes.

The data set used in this example is a set of 2,000 documents sampled from the training set of the Large Movie Review Dataset v1.0 [10]. The Large Movie Review Dataset v1.0 contains 50,000 English movie reviews along with their associated sentiment labels “posi-tive” and “nega“posi-tive”. For details about the data set see [10]. 1,000 documents of the positive

FIGURE 3.11: Chain of preprocessing nodes of thePreprocessing meta node.

3.5.1 Reading Textual Data

The workflow starts with aFile Reader. The node reads a csv file, which contains the review texts, the associated sentiment labels, the IMDb URL of the corresponding movie, and its index in the Large Movie Review Dataset v1.0. In addition to the sentiment column, the text column is also important here. In the first meta nodeDocument Creationdocument cells are created from string cells, using theStrings to Documentnode. The sentiment labels are stored in the category field of each document in order to extract the category afterwards.

All columns are filtered with the exception of the column that contains the document cells.

The output of the first meta node is a document list data table.

3.5.2 Preprocessing

The tagging step is skipped here since it is not necessary to recognize named entities or assign POS tags. All preprocessing nodes are contained in thePreprocessing meta node, shown in Figure 3.11.

First, punctuation marks are removed, numbers and stop words are filtered, and all terms are converted to lowercase. Then the word stem is extracted for each term.

In the lower part of the meta node all those terms are filtered from the bag-of-words that

FIGURE 3.12: Chain of preprocessing nodes inside thePreprocessing meta node.

occur in fewer than 20 documents. This is done by grouping by the terms, counting all the unique documents that contain these terms, filtering this list of terms, and finally filtering the bag-of-words with theReference Row Filter node. Thereby we reduce the feature space from 22,105 distinct terms to 1,500.

The minimum number of documents is set to 20 since we assume that a term has to occur in at least 1% of all documents in order to represent a useful feature for classification.

This is a rule of thumb and of course can be changed individually.

3.5.3 Transformation

Based on these extracted terms document vectors are then created and used in the following for classification by a decision tree classifier. In this example bit vectors were created by the Document vector node. The values of bit vectors are 1 or 0 depending on the presence of a term in a document. However, the previously computedtf values or any other scores or frequencies computed beforehand could be used as vector values as well.

3.5.4 Classification

Any of the traditional mining algorithms available in KNIME can be used for classifica-tion (e.g., decision trees, ensembles, support vector machines). As in all supervised learning algorithms a target variable is required (see, e.g., [11]). In this example the target is the sentiment label, which is stored as category in the documents. The target or class column is extracted from the documents and appended, as string column, using theCategory to class node. This category can then be used as the target class for the classification procedure.

Based on the category a color is assigned to each document by the Color Manager node.

Documents labeled “positive” are colored green; documents labeled “negative” are colored red.

FIGURE 3.13: Confusion matrix and accuracy scores of the sentiment decision tree model.

corresponding receiver operating characteristics curve can be seen in Figure 3.14.

FIGURE 3.14: ROC curve of the sentiment decision tree model.

The aim of this example application is to clarify and demonstrate the usage of the KNIME Text Processing extension rather than achieve the best accuracy. Other learner nodes, such as support vector machines, or decision tree ensembles can be applied easily in KNIME in order to build better models. Furthermore, scores, such as tf −idf could be used instead of bit vectors. In addition to features that represent single words, n-gram features representing multiple consecutive words can be used as well. Cross validation could be applied to achieve a more precise estimation of accuracy.

3.6 Summary

In this chapter the KNIME Text Processing extension has been introduced. The func-tionality as well as the philosophy and usage of the extension have been described. It has been shown which nodes need to be applied and in what order to read textual data, add semantic information by named entity recognition, preprocess the documents, and finally transform them into numerical vectors which can be used by regular KNIME nodes for clus-tering or classification. The KNIME data types provided by the Text Processing feature, as well as the document cell and the term cell have also been explained. These techniques were demonstrated as an application example, building a sentiment classification of movie reviews.

More information and example workflows for clustering and classification of documents, usage of tag clouds, and other visualizations can be found athttp://tech.knime.org/

examples.

All results were generated with the open-source KNIME Analytics Platform. KNIME workflows and data are provided with this chapter. The required KNIME extension is the Text Processing extension.

Bibliography

[1] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. K¨otter, T. Meinl, P. Ohl, C. Sieb, K. Thiel, and B. Wiswedel. KNIME: The Konstanz Information Miner. InStudies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer, 2007.

[2] Mitchell M., G. Kim, M. A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz, and B. Schasberger. The penn treebank: Annotating predicate argument structure. InProceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, 1994. Association for Computational Linguistics.

[3] Apache opennlp, 2014. http://opennlp.apache.org.

[4] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky.

The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstra-tions, pages 55–60, 2014.

[5] Stts stuttgart t¨ubingen tag set, 2014. http://www.isocat.org/rest/dcs/376.

[6] A. Abeill´e, L. Cl´ement, and booktitle=Treebanks pages=165–187 year=2003 pub-lisher=Springer Toussenel, F. Building a treebank for french.

[7] B. Settles. ABNER: An open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics, 21(14):3191–3192, 2005.

[8] L. Hawizy, D. M. Jessop, N. Adams, and P. Murray-Rust. Chemicaltagger: A tool for semantic text-mining in chemistry. Journal of cheminformatics, 3(1):17, 2011.

[9] Snowball stemming library, 2014. http://snowball.tartarus.org.

Chapter 4

Social media channels have become more and more important for many organizations in order to reach targeted groups of individuals as well as understand the needs and behaviors of their users and customers. Huge amounts of social media data are already available and growing rapidly from day to day. The challenge comes in accessing that data and creating usable and actionable insights from it. So far, there are three major approaches that are typically used to analyze social media data: channel reporting tools, overview score-carding systems, and predictive analytics with focus on sentiment analysis. Each has its useful aspects but also its limitations. In this chapter we will discuss a new approach that combines text mining and network analysis to overcome some of the limitations of the standard approaches and create actionable and fact based insights.

Sentiment analysis on the one hand is a common approach in text mining to estimate and predict the attitude of users with respect to products, services or certain topics in general. On the other hand centrality measures are used in the field of network analysis to identify important nodes (users), so-called influencers. In our approach we combine these two techniques in order to analyze the sentiments of influencers of a social media platform, as well as the impact of users with certain sentiments. We show that participants who are

81

Dans le document Case Studies Using Open-Source Tools (Page 116-122)