Chapter Authors - Case Studies Using Open-Source Tools

• Nelson Areal, Department of Management, University of Minho, Braga, Portugal

• Patrick Buckley, Institute of Technology, Blanchardstown, Ireland

• Brian Carter, IBM Analytics, Dublin, Ireland

• Andrew Chisholm, Information Gain Ltd., UK

• David Colton, IBM, Dublin, Ireland

• Paul Clough, Information School, University of Sheffield, UK

• Paulo Cortez, ALGORITMI Research Centre/Department of Information Systems, University of Minho, Guimar˜aes, Portugal

• Pavlina Davcheva, Chair of Information Systems II, Institute of Information Sys-tems, Friedrich-Alexander-University Erlangen-Nuremberg, Germany

• Kyle Goslin, Department of Computer Science, College of Computing Technology, Dublin, Ireland

• Tobias K¨otter, KNIME.com, Berlin, Germany

• Nuno Oliveira, ALGORITMI Research Centre, University of Minho, Guimar˜aes, Portugal

• Alexander Piazza, Chair of Information Systems II, Institute of Information Sys-tems, Friedrich-Alexander-University Erlangen-Nuremberg, Germany

• Tony Russell-Rose, UXLabs, UK

• John Ryan, Blanchardstown Institute of Technology, Dublin, Ireland

• Rosaria Silipo, KNIME.com, Zurich, Switzerland

• Kilian Thiel, KNIME.com, Berlin, Germany

• Christos Iraklis Tsatsoulis, Nodalpoint Systems, Athens, Greece

• Phil Winters, KNIME.com, Zurich, Switzerland

xxvii

Many people have contributed to making this book and the underlying open-source software solutions a reality. We are thankful to all of you.

We would like to thank the contributing authors of this book, who shared their ex-perience in the chapters and who thereby enable others to have a quick and successful text mining start with open-source tools, providing successful application examples and blueprints for the readers to tackle their text mining tasks and benefit from the strength of using open and freely available tools.

Many thanks to Dr. Brian Nolan, Head of School of Informatics, Institute of Technology Blanchardstown (ITB); and Dr. Anthony Keane, Head of Department of Informatics, ITB for continuously supporting projects such as this one.

Many thanks also to our families. MH: A special thanks goes to Glenda, Killian, Dar-ragh, Daniel, SiSi, and Judy for making my life fun; My parents, Gertrud and Karl-Heinz Hofmann, for continuously supporting my endeavours. Also a huge thank you to Hans Trautwein and Heidi Krauss for introducing me to computers and my first data related application, MultiPlan, in 1986. AC: To my parents for making it possible and to my wife for keeping it possible.

The entire team of the Taylor & Francis Group was very professional, responsive, and always helpful in guiding us through this project. Should any of you readers consider pub-lishing a book, we can highly recommend this publisher.

Open-source projects grow strong with their community. We are thankful to all contrib-utors, particularly, text analysis — related open source-tools and all supporters of these open-source projects. We are grateful not only for source code contributions, community support in the forum, and bug reports and fixes but also for those who spread the word with their blogs, videos, and word of mouth.

With best regards and appreciation to all contributors,

Dr. Markus Hofmann, Institute of Technology Blanchardstown, Dublin, Ireland

Andrew Chisholm, Information Gain Ltd., UK

xxix

1.1 Creating your repository: Overall process. . . 6 1.2 Creating your repository: Step A – Get Page operator. . . 7 1.3 Creating your repository: Step B. . . 8 1.4 Creating your repository: Step B – vector window. . . 8 1.5 Creating your repository: Step B – Cut Document operator. . . 9 1.6 Creating your repository: Step B – Extract Information operator. . . 9 1.7 Creating your repository: Step C – create attributes. . . 9 1.8 Creating your repository: Step C – attribute name. . . 10 1.9 Creating your repository: Step D. . . 10 1.10 Creating your repository: Step E – Write Excel operator. . . 10 1.11 Build a token repository: Process I. . . 11 1.12 Build a token repository: Process I Step A – Read Excel operator. . . 12 1.13 Build a token repository: Process I Step B – Get Pages operator. . . 12 1.14 Build a token repository: Process I Step C – Process Documents operator. 13 1.15 Build a token repository: Process I Step C – vector window. . . 13 1.16 Build a token repository: Process I Step C – Extract Information operator. 13 1.17 Build a token repository: Process I Step C – extracting date. . . 14 1.18 Build a token repository: Process I Step C – extracting president’s name. . 14 1.19 Build a token repository: Process I Step C – extracting regular expression. 15 1.20 Build a token repository: Process I Step C – regular region. . . 16 1.21 Build a token repository: Process I Step C – cutting speech content. . . 16 1.22 Build a token repository: Process I Step C – cut document nodes. . . 16

xxxi

1.28 Build a token repository: Process II Step C – Select Attributes operator. . 18 1.29 Build a token repository: Process II Step C – subset attributes. . . 19 1.30 Build a token repository: Process II Step D – Write Database operator. . . 19 1.31 Analyzing the corpus: Process I. . . 20 1.32 Analyzing the corpus: Process I Step A – Read Database operator. . . 21 1.33 Analyzing the corpus: Process I Step A – SQL. . . 21 1.34 Analyzing the corpus: Process I Step B – term occurrences. . . 22 1.35 Analyzing the corpus: Process I Step B – vector creation window. . . 22 1.36 Analyzing the corpus: Process I Step B – Extract Content operator. . . 23 1.37 Analyzing the corpus: Process I Step B – Tokenize operator. . . 23 1.38 Analyzing the corpus: Process I Step B – Filter Stopwords (English)

operator. . . 24 1.39 Analyzing the corpus: Process I Step B – custom dictionary. . . 24 1.40 Analyzing the corpus: Process I Step C – Write Excel operator. . . 24 1.41 Analyzing the corpus: Process I Step C – transposed report. . . 25 1.42 Analyzing the corpus: Process II Step B. . . 26 1.43 Analyzing the corpus: Process II Step B – Generate n-Grams (Terms)

operator. . . 26 1.44 Analyzing the corpus: Process II Step B – Filter Tokens (by Content)

operator. . . 26 1.45 Analyzing the corpus: Process II Step C. . . 26 1.46 Visualization: Layout of transposed worksheet. . . 28 1.47 Visualization: Wordle menu. . . 29 1.48 Visualization: Copy and paste data into create section. . . 30

1.49 Visualization: Speech represented as a word cloud. . . 30 1.50 Visualization: Word cloud layout. . . 31 1.51 Visualization: Word cloud colour. . . 31 1.52 Visualization: Word cloud token limit. . . 32 1.53 Visualization: Word cloud font types. . . 32 1.54 Visualization: Word cloud remove tokens (filtered to 20 tokens). . . 33 1.55 Visualization: Word cloud filter option (remove token). . . 33 1.56 Visualization: Word cloud filtered (specific tokens removed). . . 34 1.57 Visualization: Word cloud options (print, save, new window, randomize). . 34 1.58 Visualization: Layout of transposed bigram worksheet. . . 34 1.59 Visualization: Word cloud bigrams. . . 35 1.60 Visualization: Word cloud bigrams filtered (removal of tokens). . . 35

2.1 Observed variation for the word “the” for consecutive 5,000-word windows within the novelMoby Dick. . . 42 2.2 RapidMiner process to calculate word frequencies. . . 43 2.3 Process Section A within Figure 2.2. . . 44 2.4 Process Section B within Figure 2.2. . . 44 2.5 Process Section C within Figure 2.2. . . 45 2.6 Process Section D within Figure 2.2. . . 45 2.7 RapidMiner process to execute process for all attributes to fit

Zipf-Mandelbrot distribution. . . 48 2.8 Detail of RapidMiner process to execute Zipf-Mandelbrot distribution fit. . 48 2.9 RapidMiner process to fit Zipf-Mandelbrot distribution. . . 49 2.10 Configuration forOptimize Parameters (Evolutionary)operator. . . 50 2.11 Details for Optimize Parameters (Evolutionary)operator. . . 50 2.12 Details for macro-generation workaround to pass numerical parameters to

Optimize Parameters operator. . . 51 2.13 Calculation of Zipf-Mandelbrot probability and error from known

probability. . . 51

2.17 Zipf-Mandelbrot scatter plot for A and C parameters for random samples and sequential windows withinSense and Sensibility . . . 56 2.18 Zipf-Mandelbrot scatter plot for A and C parameters for random samples

and sequential windows withinMansfield Park . . . 56 2.19 Zipf-Mandelbrot scatter plot for A and C parameters for random samples

and sequential windows withinThe Return of Sherlock Holmes. . . 57 2.20 Zipf-Mandelbrot Scatter plot for A and C parameters for random samples

and sequential windows withinThe Adventures of Sherlock Holmes . . . . 57

3.1 An example workflow illustrating the basic philosophy and order of KNIME text processing nodes. . . 65 3.2 A data table with a column containing document cells. The documents are

reviews of Italian restaurants in San Francisco. . . 67 3.3 A column of a data table containing term cells. The terms have been assigned

POS tags (tag values and tag types). . . 68 3.4 Dialog of the OpenNLP NE Tagger node. The first checkbox allows for

specification as to whether or not the named entities should be flagged unmodifiable. . . 69 3.5 Dialog of the OpenNLP NE Tagger node. The number of parallel threads

to use for tagging can be specified here. . . 69 3.6 Typical chain of preprocessing nodes to remove punctuation marks,

num-bers, very small words, stop words, conversion to lowercase, and stemming. 70 3.7 The Preprocessing tab of theStop word Filter node. Deep preprocessing is

applied, original documents are appended, andunmodifiable terms are not filtered. . . 71 3.8 Bag-of-words data table with one term column and two documents columns.

The column, “Orig Document” contains original documents. The “Docu-ment” column contains preprocessed documents. . . 72 3.9 Bag-of-words data table with an additional column with absolute term

frequencies. . . 73

3.10 Document vectors of 10 documents. The documents are stored in the left-most column. The other columns represent the terms of the whole set of documents, one for each unique term. . . 75 3.11 Chain of preprocessing nodes of thePreprocessing meta node. . . 76 3.12 Chain of preprocessing nodes inside thePreprocessing meta node. . . 77 3.13 Confusion matrix and accuracy scores of the sentiment decision tree model. 78 3.14 ROC curve of the sentiment decision tree model. . . 78

4.1 The text mining workflow used to compute the sentiment score for each user. 84 4.2 Distribution of the level of attitudeλby user, with−20 as minimum attitude

and 50 as maximum attitude. . . 84 4.3 Scatter plot of frequency of negative words vs. frequency of positive words

for all users. . . 85 4.4 Tag cloud of user “dada21”. . . 86 4.5 Tag cloud of user “pNutz”. . . 86 4.6 Example of a network extracted from Slashdot where vertices represent

users, and edges comments. . . 87 4.7 Scatter plot of leader vs. follower score for all users. . . 88 4.8 KNIME workflow that combines text and network mining. . . 90 4.9 Leader vs. follower score colored by attitude for all users. Users with a

positive attitude are marked green, users with a negative attitude red. . . 91

5.1 Pillreports.net standard report. . . 96 5.2 Code: Check Python setup. . . 99 5.3 Code: Creating a sparse matrix in Python. . . 102 5.4 Code: Confusing output with text in Python. . . 103 5.5 Original byte encoding — 256 characters. . . 104 5.6 Unicode encoding paradigm. . . 104 5.7 Character bytes and code points. . . 104 5.8 Code: Python encoding for text correct use. . . 105 5.9 Code: Scraping webpages. . . 107 5.10 Code: Connecting to a database in MongoDB. . . 107

5.16 Code: matplotlib subplots. . . 116 5.17 Weekly count of reports. . . 117 5.18 Warning: Column cross-tabulated withCountry: column. . . 117 5.19 Code: Weekly count of reports submitted. . . 118 5.20 String length ofDescription: column. . . 120 5.21 Code: Setting up vectorizer and models for classification. . . 121 5.22 Code: Splitting into train and test sets. . . 121 5.23 Code:sklearn pipeline. . . 122 5.24 Code: Model metrics. . . 122 5.25 Code: Feature selection. . . 123 5.26 Scatter plot of top predictive features. . . 125 5.27 Code: Clustering and PCA models. . . 126 5.28 Principal components scatter plot. . . 127 5.29 Code: Tagging words usingnltklibrary. . . 128 5.30 Code: Counting word frequency with the collections module. . . 129 5.31 User report: Word cloud. . . 129

6.1 Sentiment classification and visualization process. . . 135 6.2 Word cloud for the positive and negative reviews of the mobilephone

category. . . 148 6.3 Jigsaw’s welcome screen. . . 148 6.4 Import screen in Jigsaw. . . 149 6.5 Entity identification screen in Jigsaw. . . 149 6.6 Word tree view for “screen”. . . 150

6.7 Word tree view for “screen is”. . . 151

7.1 A sample of records from the AOL log. . . 155 7.2 A sample from the AOL log divided into sessions. . . 156 7.3 A set of feature vectors from the AOL log. . . 158 7.4 The Weka GUI chooser. . . 159 7.5 Loading the data into Weka. . . 160 7.6 Configuring the EM algorithm. . . 160 7.7 Configuring the visualization. . . 161 7.8 100,000 AOL sessions, plotted as queries vs. clicks. . . 162 7.9 Four clusters based on six features. . . 162 7.10 Three clusters based on seven features. . . 163 7.11 Applying EM using Wolfram et al.’s 6 features to 10,000 sessions from AOL. 165 7.12 Applying EM using Wolfram et al.’s 6 features to 100,000 sessions from AOL. 166 7.13 Applying XMeans (k <= 10) and Wolfram et al.’s 6 features to 100,000

sessions from AOL. . . 166 7.14 Applying EM and Wolfram et al.’s 6 features to 100,000 filtered sessions

from AOL. . . 167 7.15 Sum of squared errors bykfor 100,000 filtered sessions from AOL. . . 168 7.16 Applying kMeans (k= 4) and Wolfram et al.’s 6 features to 100,000 sessions

from AOL. . . 169

8.1 Windows command prompt. . . 176 8.2 Contents of the SigmaJS folder. . . 178 8.3 XAMPP control panel. . . 193 8.4 News stories. . . 195 8.5 Closer view of news stories. . . 195

9.1 Verifying your Python environment on a Windows machine. . . 201 9.2 Using the NLTK built-in NLTK Downloader tool to retrieve the movie

review corpus. . . 202

9.8 Performance of NLTK model using the prepare review function. . . 214 9.9 Performance of the first Na¨ıve Bayes scikit-learn model. . . 217 9.10 Na¨ıve Bayes scikit-learn most informative features. . . 218 9.11 Performance of the SVM scikit-learn model. . . 218 9.12 SVM scikit-learn model most informative features. . . 219

11.1 News extract topics. . . 242 11.2 Word-cloud of the DTM. . . 250 11.3 Cross-validation — optimum number of topics. . . 254 11.4 Cross-validation — optimum number of topics (2 to 5). . . 255 11.5 Term distribution. . . 258

12.1 Tag frequency distribution for the first 100 tags. . . 269 12.2 Communities revealed by Infomap, with corresponding sizes. . . 280 12.3 Visualization of our communities graph. . . 286 12.4 Visualization of the “R” community. . . 287 12.5 The “R” community, with thernode itself removed. . . 288 12.6 The “Big Data” and “Machine Learning” communities (excluding these

terms themselves). . . 289

1.1 Creating your repository: Step E – report snapshot I. . . 10 1.2 Creating your repository: Step E – report snapshot II. . . 10 1.3 Creating your repository: Step E – report snapshot III. . . 11 1.4 Speechstop.txt content. . . 24

2.1 Variation of z-score for the most common words in sequential 5,000-word windows for the novelMoby Dick. . . 41 2.2 RapidMiner processes and sections where they are described. . . 43 2.3 Process sections for RapidMiner process to calculate rank–frequency

distributions. . . 46 2.4 Details of texts used in this chapter. . . 52 2.5 Details of parameter ranges used in this chapter. . . 52

5.1 Column descriptions and statistics. . . 98 5.2 Example dense matrix. . . 101 5.3 Example sparse matrix. . . 101 5.4 Geo-codingState/Province: column. . . 113 5.5 Summary ofCountry: andLanguage:columns. . . 114 5.6 Summary of language prediction confidence. . . 114 5.7 Suspected contents (SC Category:)andWarning: label. . . 118 5.8 User Report: string length grouped byCountry: andWarning:. . . 119 5.9 Top 5 models: binary classification on Warning:column. . . 123 5.10 Classification accuracy using feature selection. . . 124

6.1 Table of used libraries. . . 136

xxxix

6.7 Resulting average confusion matrix for the mobile phone category. . . 145

8.1 Time windows. . . 180

9.1 Performance comparison of unigrams, bigrams, and trigrams when stop words have been removed. . . 211 9.2 Confusion matrix for a Na¨ıve Bayes scikit-learn classifier. . . 218

10.1 Examples of sentiment analysis according to each model (pos = positive, neg = negative, neu = neutral). . . 234 10.2 Sentiment analysis results (in %, best values in bold). . . 236 10.3 Sentiment analysis results excluding messages without lexicon items (in %,

best values inbold). . . 236 10.4 Sentiment analysis results with emoticons and hashtags features (in %, best

values inbold). . . 238

11.1 Topic document distributions. . . 259

12.1 Statistics of raw data. . . 267

Part I

RapidMiner

Chapter 1

The objectives of this chapter are twofold: to introduce you to the fundamentals of building a text-based repository; and to explain how you can apply text mining techniques to documents, such as speeches, in order to discover any valuable insights and patterns that may be contained within their content. This chapter is strictly focused towards a novice or start-up level; thus, techniques and processes are explained in a more detailed and visual manner than would otherwise be necessary.

Dans le document Case Studies Using Open-Source Tools (Page 28-44)