• Aucun résultat trouvé

Processing Textual Data

N/A
N/A
Protected

Academic year: 2022

Partager "Processing Textual Data"

Copied!
5
0
0

Texte intégral

(1)

Data Mining Pastes

Finding a needle in a haystack How to analyze unstructured data sets?

Alexandre Dulaunoy

February 27, 2015

(2)

An Introduction to the Dataset

I One day of pastes (31MB of compressed unstructured documents)

I Format (Paste-id.gz)

I Various users pasted information with different objectives

I Some of the pastes include sensitive information (e.g.

password leaks, vulnerability)

(3)

Strategies for Analysis

I Sampling and human-analysis

I File-type detection

1 l s −1 | p a r a l l e l −−gnu ’ z c a t {1} | f i l e −’

I Terms searching

I What else?

(4)

Processing Textual Data

I Python TextBlob (using Python NLTK) is a simple library for processing textual data. Extracting nouns, sentences or even sentiment analysis, translation....

1 p i p 2 i n s t a l l U t e x t b l o b

p y t h o n 2 m t e x t b l o b . d o w n l o a d c o r p o r a

I The corpora is installed in your home directory /nltk data to support the natural language processing functionalities.

(5)

Processing Textual Data - A Minimal Example

f r o m t e x t b l o b i m p o r t T e x t B l o b

2 w = T e x t B l o b (” T h i s i s an i n t e r e s t i n g p r o j e c t b u t t h e r e i s s t i l l a l o t o f work ”)

w . n o u n p h r a s e s

4 w . s e n t i m e n t

Références

Documents relatifs

● Le code source est ensuite soit traduit vers un langage compréhensible par la machine (langage binaire), on parle de compilation, soit il est interprété par un interpréteur

This module searches a file of code or documentation for blocks of text that look like an interactive Python session, of the form you have already seen many times in this book.

Our contribution to the described challenge in scientometrics is a Python library - ”Take it personally” (TIP) - that aims to facilitate a more author- related view on

The objectives of a business architecture (e.g., maxim- ize quantity, maximize price) can be represented as goals as well as softgoals the impact of different alternatives for

Using Machine Learning Techniques, Textual and Visual Processing in Scalable Concept Image

This paper presents how traffic accesses different sources of data, leverages processing methods to clean, filter, clip or resample trajectories, and compares trajectory

Incorrect assignment is undesirable and inefficient as it delays the bug resolution due to reassignments. While there has been recent advancements in solutions for automatic bug

IbmdbPy is an open-source python package, developed by IBM, which provides a Python interface for data manipulation and machine learn- ing algorithms such as Kmeans or Linear