• Aucun résultat trouvé

PoS Tagging with a Perceptron

N/A
N/A
Protected

Academic year: 2022

Partager "PoS Tagging with a Perceptron"

Copied!
2
0
0

Texte intégral

(1)

M1 LI

PoS Tagging with a Perceptron

Guillaume Wisniewski guillaume.wisniewski@u-paris.fr

September 2021

The goal of this lab is to implement a tagger that relies on the perceptron algorithm to predict the PoS of the words of a sentence and evaluate it on the corpora of theUD project. This lab is made of 4 parts that correspond to the usual steps of a ML project:

1. reading the corpora;

2. features extraction;

3. parameter estimation;

4. evaluation.

In all our experiments we will consider the French GSD corpus.

1 Corpus reading

• Write a function that takes a filename as parameter and returns a list of examples. Each example is made of pairs (observation, label) in which observationis a sentence (a sequence of words) andlabelis a sequence of labels (there is one label for each word).

• Plot the distribution of labels in the train and test corpus. How many examples are there in the train set ? in the test set ?

A description of conlluformat can be found at https://universaldependencies.org/format.

html. When reading data, it is necessary to pay attention to multword tokens (indexed with integer ranges)

1

(2)

2 Feature extraction

We will consider X simple feature templates to describe the i-th word of a sentence w = w1, ...,w0:

• the current wordwi;

• the previous wordwi−1;

• the following wordwi+1;

• the wordwi+2;

• the wordwi−2;

• a biais (i.e. a feature that is always present);

• a binary feature that is true when the word starts with a capital letter;

• a binary feature that is true when the word contain at least one number.

As usual the feature vector will be represented by a sparse vector and each word will be described by a list of strings corresponding to the non-zero features. For instance, the list of features describing the 3rd word of the sentence “M´elina et M´elio dorment dans leur

chambre.” is:"curr_word_M´elio","prev_word_et","prev_prev_words_M´elina","next_word_dorment",

"next_next_word_dans","biais","starts_with_upper"

1. Write a function that takes a corpus (list of sentences and their label) as input and return a list of pairs (feature vector, label).

2. What is a dimension of feature vector? How many non-zero features are there in general?

3 Averaged perceptron implementation

We will now implement the averaged perceptron algorithm to

2

Références

Documents relatifs

versions of vector spaces, in which you can not only add and scalar multiply vectors, but can also compute lengths, angles, inner products, etc. One particularly common such

1 The problem of identifying logics is not only of interest for proof theory, but for the whole area of logic, including mathematical logic as well as philosophical logic.. This

With respect to previous studies, the proposed approach consists of 2 steps : first, the definition of similarity between journals and second, these similarities are used

The first problem has been solved by Brion in the following strong sense: any connected algebraic group G over a perfect field is the neutral component of the automorphism group

[r]

Busemann’s coefficient reveals different types of relationship of the adjectival and nominal attributes in the texts of 6 Russian female authors.. At the same time it was found

In MetaDepth a Model is the con- tainer for all model elements in a single classification level, whereas in Melanee the term Model, or more specifically Deep Model, encompasses

The main feature of our approach is the fact, that we take into account the set of requirements, which are important for business application of a recommender. Thus, we construct