Processing Textual Data

(1)

Data Mining Pastes

Finding a needle in a haystack How to analyze unstructured data sets?

Alexandre Dulaunoy

February 27, 2015

(2)

An Introduction to the Dataset

I One day of pastes (31MB of compressed unstructured documents)

I Format (Paste-id.gz)

I Various users pasted information with different objectives

I Some of the pastes include sensitive information (e.g.

password leaks, vulnerability)

(3)

Strategies for Analysis

I Sampling and human-analysis

I File-type detection

1 l s −1 | p a r a l l e l −−gnu ’ z c a t {1} | f i l e −’

I Terms searching

I What else?

(4)

Processing Textual Data

I Python TextBlob (using Python NLTK) is a simple library for processing textual data. Extracting nouns, sentences or even sentiment analysis, translation....

1 p i p 2 i n s t a l l −U t e x t b l o b

p y t h o n 2 −m t e x t b l o b . d o w n l o a d c o r p o r a

I The corpora is installed in your home directory /nltk data to support the natural language processing functionalities.

(5)

Processing Textual Data - A Minimal Example

f r o m t e x t b l o b i m p o r t T e x t B l o b

2 w = T e x t B l o b (” T h i s i s an i n t e r e s t i n g p r o j e c t b u t t h e r e i s s t i l l a l o t o f work ”)

w . n o u n p h r a s e s

4 w . s e n t i m e n t