Data Mining Pastes
Finding a needle in a haystack How to analyze unstructured data sets?
Alexandre Dulaunoy
February 27, 2015
An Introduction to the Dataset
I One day of pastes (31MB of compressed unstructured documents)
I Format (Paste-id.gz)
I Various users pasted information with different objectives
I Some of the pastes include sensitive information (e.g.
password leaks, vulnerability)
Strategies for Analysis
I Sampling and human-analysis
I File-type detection
1 l s −1 | p a r a l l e l −−gnu ’ z c a t {1} | f i l e −’
I Terms searching
I What else?
Processing Textual Data
I Python TextBlob (using Python NLTK) is a simple library for processing textual data. Extracting nouns, sentences or even sentiment analysis, translation....
1 p i p 2 i n s t a l l −U t e x t b l o b
p y t h o n 2 −m t e x t b l o b . d o w n l o a d c o r p o r a
I The corpora is installed in your home directory /nltk data to support the natural language processing functionalities.
Processing Textual Data - A Minimal Example
f r o m t e x t b l o b i m p o r t T e x t B l o b
2 w = T e x t B l o b (” T h i s i s an i n t e r e s t i n g p r o j e c t b u t t h e r e i s s t i l l a l o t o f work ”)
w . n o u n p h r a s e s
4 w . s e n t i m e n t