SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations

(1)

HAL Id: hal-02977455

https://hal.archives-ouvertes.fr/hal-02977455

Submitted on 25 Oct 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations

Kyungtae Lim, Cheoneum Park, Changki Lee, Thierry Poibeau

To cite this version:

Kyungtae Lim, Cheoneum Park, Changki Lee, Thierry Poibeau. SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Oct 2018, Bruxelles, Belgium.

pp.143-152, �10.18653/v1/K18-2014�. �hal-02977455�

(2)

Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 143–152 Brussels, Belgium, October 31 – November 1, 2018. c2018 Association for Computational Linguistics

https://doi.org/10.18653/v1/K18-2014

143 SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations

KyungTae Lim

^*

, Cheoneum Park

⁺

, Changki Lee

⁺

and Thierry Poibeau

^*

*

LATTICE (CNRS & ENS / PSL & Universit´e Sorbonne nouvelle / USPC)

+

Department of Computer Science, Kangwon National University {kyungtae.lim, thierry.poibeau}@ens.fr,

{parkce, leeck}@kangwon.ac.kr

Abstract

We describe the SEx BiST parser (Seman- tically EXtended Bi-LSTM parser) de- veloped at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies).

The main characteristic of our work is the encoding of three different modes of con- textual information for parsing: (i) Tree- bank feature representations, (ii) Multi- lingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end- to-end evaluation (73.02 LAS – 4

^th

/26 teams, and 78.72 UAS – 2

^nd

/26); remark- ably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representa- tions. Finally, we were also ranked 1

^st

at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation cam- paign.

1 Introduction

Feature representation methods are an essential element for neural dependency parsing. Meth- ods such as Feed Forward Neural Network (FFN) (Chen and Manning, 2014) or LSTM-based word representations (Kiperwasser and Goldberg, 2016;

Ballesteros et al., 2016) have been proposed to provide fine-grained token representations, and these methods provide state of the art perfor- mance. However, learning efficient feature repre- sentations is still challenging, especially for under- resourced languages.

One way to cope with the lack of training data is a multilingual approach, which makes it possi- ble to use different corpora in different languages

as training data. In most cases, for instance in the CoNLL 2017 shared task (Zeman et al., 2017), the teams that have adopted this approach used a mul- tilingual delexicalized parser (i.e. a multi-source parser trained without taking into account lexical features). However, it is evident that delexicalized parsing cannot capture contextual features that de- pend on the meaning of words within the sentence.

Following previous proposals promoting a model-transfer approach with lexicalized feature representations (Guo et al., 2016; Ammar et al., 2016; Lim and Poibeau, 2017), we have developed the SEx BiST parser (Semantically EXtended Bi- LSTM parser), a multi-source trainable parser us- ing three different contextualized lexical represen- tations:

• Corpus representation: a vector representa- tion of each training corpus.

• Multilingual word representation: a multi- lingual word representation obtained by the projection of several pre-trained monolingual embeddings into a unique semantic space (following a linear transformation of each embedding).

• ELMo representation: token-based repre- sentation integrating abundant contexts gath- ered from external resources (Peters et al., 2018).

In this paper, we extend the multilingual graph- based parser proposed by Lim and Poibeau (2017) with the three above representations.

Our parser is open source and available at:

https://github.com/CoNLL-UD-2018/

LATTICE/.

Our parser performed well in the official end-to-

end evaluation (73.02 LAS – 4

^th

out of 26 teams,

and 78.72 UAS – 2

^nd

out of 26). We obtained very

(3)

144 good results for French, English and Korean where we were able to extensively exploit the three above features (for these languages, we obtained the best UAS performance on all the treebanks, and among the best LAS performance as well). Unfortunately we were not able to exploit the same strategy for all the languages due to a lack of a GPU and, cor- respondingly, time for training, and also due a lack of training data for some languages.

The structure of the paper is as follows. We first describe the feature extraction and represen- tation methods (Section 2 and 3) and then present our POS tagger and our parser based on multi-task learning (Section 4). We then give some details on our implementation (Section 5) and we finally provide an analysis of our official results (Section 6).

2 Deep Contextualized Token Representations

The architecture of our parser follows the mul- tilingual L

ATTICE

parser presented in Lim and Poibeau (2017), with the addition of the three fea- ture representations presented in the introduction.

The basic token representations is as follows.

Given a sentence of tokens s=(t

₁

,t

₂

,..t

_n

), the i

^th

token t

i

can be represented by a vector x

i

, which is the result of the concatenation (◦) of a word vector w

_i

and a character-level vector c

_i

of t

_i

:

x

i

= c

i

◦ w

i

c

_i

= Char(t

i

; θ

_c

) w

i

= Word(t

i

; θ

_w

)

When the approach is monolingual, w

_i

corre- sponds to the external word embeddings provided by Facebook (Bojanowski et al., 2016). Otherwise we used our own multilingual strategy based on multilingual embeddings (see Section 3.2) 2.1 Character-Level Word Representation Token t

i

can be decomposed as a vector of characters (ch

₁

, ch

₂

,.. ch

_m

) where ch

_j

is the j

^th

character of t

i

. The function Char (that generates the character-level word vector c

i

) corresponds to a vector obtained from the hidden state represen- tation h

j

of the LSTM, with an initial state h

0

(m is the length of token t

i

)

¹

:

1Note thatirefers to the i^thtoken in the sentence and thatj refers to the j^thcharacter of thei^thtoken. Here, we use lower- case italics for vectors and uppercase italics for matrices. So

h

j

= LSTM

^(ch)

(h

0

, (ch

1

,ch

2

,..ch

m

))

j

c

_i

= w

_c

h

_m

For LSTM-based character-level represen- tations, previous studies have shown that the last hidden layer h

m

represents a summary of all the information based on the input character sequences (Shi et al., 2017). It is then possible to linearly transform this with a parameter w

c

so as to get the desired dimensionality. Another representation method involves applying an attention-based linear transformation of the hid- den layer matrix H

i

, for which attention weights a

_i

are calculated as follows:

a

i

= Sofmax(w

att

H

iT

) c

_i

= a

_i

H

_i

Since we apply the Softmax function, making weights sum up to 1 after a linear transforma- tion of H

i

with attention parameter w

att

, the self- attention weight a

_i

intuitively corresponds to the most informational characters of token t

i

for pars- ing. Finally, by summing up the hidden state H

i

of each word according to its attention weights a

_i

, we obtain our character-level word representation vector for token t

i

. Most recently, Dozat et al.

(2017) suggested an enhanced character-level rep- resentation based on the concatenation of h

_m

and a

i

H

i

so as to capture both the summary and con- text information in one go for parsing. This is an option that could be explored in the future.

After some empirical experiments, we chose bidirectional LSTM encoders rather than a single directional one and then introduced the hidden state H

i

into the two-layered Multi-Layer Percep- tron (MLP) without bias terms for computing the attention weight a

i

:

a

_i

= Sofmax(w

att2

tanh(W

att1

H

_i^T

)) c

i

= a

i

H

i

For training, we used the charter-level word repre- sentations for all the languages except Kazakh and Thai (see Section 5).

2.2 Corpus Representation

Following Lim and Poibeau (2017), we used a one-hot treebank representation strategy to encode

a set of hidden stateHiis a matrix stacked onmcharacters.

In this paper, all the letterswandWdenote parameters that the system has to learn.

(4)

145 language-specific features. In other words, each language has its own set of specific lexical fea- tures.

For languages with several training corpora (e.g., French-GSD and French-Spoken), our parser computes an additional feature vector taking into account corpus specificities at word level. Following the recent work of Stymne et al.

(2018), who proposed a similar approach for treebank representations, we chose to use a 12 dimensional vector for corpus representation.

This representation tr

i

is concatenated with the token representation x

i

:

tr

i

= Treebank(t

i

; θ

_tr

) x

_i

= c

_i

◦ w

_i

◦ tr

_i

We used this approach (corpus representation) for 24 corpora, and its effectiveness will be discussed in Section 5.

2.3 Contextualized Representation

ELMo (Embedding from Language Model (Peters et al., 2018)) is a function that provides a represen- tation based on the entire input sentence. ELMo contextualized embedding is a new technique for word representation that has achieved state-of-the- art performance across a wide range of language understanding tasks. This approach is able to cap- ture both subword and contextual information. As stated in the original paper by Peters et al. (2018), the goal is to “learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer”.

We trained our language model with bidirec- tional LSTM using ELMo as an intermediate layer in the bidirectional language model (biLM), and we used ELMo embeddings to improve again the performance of our model.

R

_i

= {x

^LM_i

, ← →

h

^LM_i,j

| = 1, ..., L}

= {h

^LM_i,j

| = 0, ..., L} (1)

ELM o

i

= E(R

_i

; Θ) = γ

L

X

j=0

s

_j

h

^LM_i,j

(2)

In (1), x

^LM_i

and h

^LM_i,0

are word embedding vec- tors corresponding to the token layer. ← →

h

^LM_i,j

is

a hidden LSTM vector consisting of a multi-layer and a bidirectional LSTM layer. h

^LM_i,j

is a con- catenated vector composed of x

^LM_i

and ← →

h

^LM_i,j

. We computed our model with all the biLM layers weighted. In (2), s

_j

is softmax weight that is train- able to normalize multi-layer LSTM layers. γ is the scalar parameter to efficiently train the model.

We used a 1024 dimensions ELMo embedding.

3 Multilingual Feature Representations The supervised, monolingual approach to parsing, based on syntactically annotated corpora, has long been the most common one. However, thanks to recent developments involving powerful word rep- resentation methods (a.k.a. word embeddings), it is now possible to develop accurate multilingual lexical models by mapping several monolingual embeddings into a single vector space. This mul- tilingual approach to parsing has yielded encour- aging results for both low- (Guo et al., 2015) and high-resource languages (Ammar et al., 2016). In this work, we extend the recent multilingual de- pendency parsing approach proposed by Lim and Poibeau (2017) that achieved state-of-the-art per- formance during the last CoNLL shared task by using multilingual embeddings mapped based on bilingual dictionaries.

3.1 Embedding Projection

There are different strategies to produce multilin- gual word embeddings (Ruder et al., 2018), but a very efficient one consists in simply project- ing one word embedding on top of the other to make both representations share the same seman- tic space (Artetxe et al., 2016). The alternative in- volves directly generating bilingual word embed- dings from bilingual corpora (Gouws et al., 2015;

Gouws and Sgaard, 2015), but this requires a large amount of bilingual data aligned at sentence or document level. This kind of resource is not avail- able for most language pairs, especially for under- resourced languages.

We thus chose to train independently monolin- gual word embeddings and then map these word embeddings one to another. This approach is pow- erful since monolingual word embeddings gener- ally share a similar structure (especially if they have been trained on similar corpora) and so can be superimposed with little information loss.

To project embeddings, we applied the linear

transformation method using bilingual dictionar-

(5)

146 ies proposed by Artetxe et al. (2017). We took the bilingual dictionaries from OPUS

²

and Wikipedia.

The projection method can be described as fol- lows. Let X and Y be the source and target word embedding matrix so that x

i

refers to i

^th

word em- bedding of X and y

_j

refers to j

^th

word embedding of Y. And let D be a binary matrix where D

ij

= 1, if x

_i

and y

_j

are aligned. Our goal is to find a trans- formation matrix W such that Wx approximates y.

This is done by minimizing the sum of squared er- rors:

arg min

W m

X

i=1 n

X

j=1

D

ij

kx

_i

W − y

i

k

²

The method is relatively simple since convert- ing a bilingual dictionary into D is quite straight- forward. The size of the dictionary used for train- ing is around 250 pairs, and the projected word embedding is around 1.8GB. The dictionaries and the projected word embeddings are publicly avail- able on Github.

³

3.2 Training with Multilingual Embedding After having trained multilingual embeddings, we associate them with word representation w

i

as follows:

w

i

= Word(t

i

; θ

_mw

)

We applied the multilingual embedding mostly to train the nine low-resource languages of the 2018 CoNLL evaluation, for which only a hand- ful of annotated sentences were provided.

4 Multi-Task Learning for Tagging and Parsing

In this section, we describe our Part-Of-Speech (POS) tagger and dependency parser using the en- coded token representation x

_i

based on Multi-Task Learning (MTL) (Zhang and Yang, 2017).

4.1 Part-Of-Speech Tagger

As presented in Section 2 and 3, our parser is based on models trained with a combination of features, encoding different contextual informa- tion. However, the attention mechanism for the character-level word vector c

_i

is focusing only on a limited number of features within the token, and

2http://opus.nlpl.eu/

3https://github.com/jujbob/

multilingual-models

the word representation element w

i

is thus needed to transform a bidirectional LSTM, as a way to capture the overall context of a sentence. Finally, a token is encoded as a vector g

i

:

g

_i

= BiLSTM

^(pos)

(g

0

, (x

₁

,x

₂

,..x

_n

))

i

We transform the token vector g

i

to a vector of the desired dimensionality by two-layered MLP with a bias term to classify the best candidate of universal part-of-speech (UPOS):

p

⁰_i

= W

pos2

leaky relu(W

pos1

g

iT

) + b

pos

y

⁰_i

= arg max

j

p

⁰_ij

Finally, we randomly initialize the UPOS embedding as p

i

and map the predicted UPOS y

⁰_i

as a POS vector:

p

i

= Pos(y

⁰_i

; θ

_pos

)

4.2 Dependency Parser

To take into account the predicted POS vector on the main target task (i.e. parsing), we concatenate the predicted POS vector p

_i

with the word rep- resentation w

i

and then we encode the resulting vector via BiLSTM. This enriches the syntactic representations of the token by back-propagation during training:

v

_i

= BiLSTM

^(dep)

(v

0

, (x

₁

,x

₂

,..x

_n

))

i

Following Dozat and Manning (2016), we used a deep bi-affine classifier to score all the possi- ble head and modifier pairs Y = (h,m). We then selected the best dependency graph based on Eis- ner’s algorithm (Eisner and Satta, 1999). This al- gorithm tries to find the maximum spanning tree among all the possible graphs:

arg max

validY

X

(h,m)∈Y

Score

^{M ST}

(h, m)

With this algorithm, it has been observed that pars-

ing results (for some sentences) can have multiple

roots, which is not a desirable feature. We thus fol-

lowed an empirical method that selects a unique

root based on the word order of the sentence, as

already proposed by Lim and Poibeau (2017) to

ensure tree well-formedness. After the selection

(6)

147 of the best-scored tree, another bi-affine classifier is applied for the classification of relation labels, based on the predicted tree.

We trained our tagger and parser simultaneously using a single objective function with penalized terms:

loss = αCrossEntropy(p

⁰

, p

^(gold)

) + βCrossEntropy(arc

⁰

, arc

^(gold)

) + γCrossEntropy(dep

⁰

, dep

^(gold)

)

where arc

⁰

and dep

⁰

refer to the predicted arc (head) and dependency (modifier) results.

Since UAS directly affects LAS, we assumed that UAS would be crucial for parsing unseen cor- pora such as Finnish PUD, as well as other cor- pora from low-resource languages. Therefore, we gave more weight to the parameters predicting arc

⁰

than rel

⁰

and p

⁰

, since arc

⁰

directly affects UAS. We set α = 0.1, β = 0.7 and γ = 0.2. Un- fortunately, during the testing phase, we did not adjust weight parameters that would have bene- fited LAS for the 61 big treebanks, and this made our results on big treebanks suffer a bit (7

^th

) com- pared to those we obtained on Small and PUD treebanks (3

^th

) regarding LAS. This also explains the gap between the UAS and LAS scores in our overall results.

5 Implementation Details

In this section, we provide some details on our im- plementation for the CoNLL 2018 shared task (Ze- man et al., 2018b).

5.1 Training

We have trained both monolingual and multilin- gual models for parsing. In the first case, we sim- ply used the available Universal Dependency 2.2 corpora for training (Zeman et al., 2018a). In the second case, for the multilingual approach, as both multilingual word embeddings and corresponding training corpora (in the Universal Dependency 2.2 format) were required, we concatenated the corre- sponding available Universal Dependency 2.2 cor- pora to artificially create multilingual training cor- pora.

The number of epochs was set to 200, with one epoch processing the entire training corpus in each language and with a batch size of 32. We then picked the best five performing models to parse the test corpora on TIRA (Potthast et al., 2014).

The five models were used as an ensemble run (de- scribed in Section 5.2).

Hyperparameters. Each deep learning parser has a number of hyperparameters that can boost the overall performance of the system. In our im- plementation, most hyperparameter settings were identical to Dozat et al. (2017), except of course those concerning the additional features we have introduced before. We used 100 dimensional character-level word representations with a 200 di- mensional MLP, as presented in Section 2, and for corpus representation, we used a 12 dimen- sional vector. We set the learning-rate to 0.002 with Adam optimization.

Multilingual Embeddings. As described in Section 3, we specifically trained multilingual em- bedding models for nine low-resource languages.

Table 2 gives the list of languages for which we adopted this approach, along with the language used for knowledge transfer. We selected language pairs based on previous studies (Lim and Poibeau, 2017; Lim et al., 2018; Partanen et al., 2018) for bxr, kk, kmr, sme, and hsb, and the others where chosen based on the public availability of bilin- gual dictionaries (this explains why we chose to map several languages with English, even when there was no real linguistically motivated reason to do so). Since we could not find any pre-trained embeddings for pcm nsc, we applied a delexical- ized parsing approach based on an English mono- lingual model.

ELMo. We used ELMo weights to train spe- cific models for five languages: Korean, French, English, Japanese and Chinese. ELMo weights were pre-trained using the CoNLL resources pro- vided

⁴

. We used AllenNLP

⁵

for training, and used the default hyperparameters. We included ELMo only at the level of the input layer for both training and inference (we set up dropout to 0.5 and used 1024 dimensions for the ELMo embedding layer in our model). All the other hyper-parameters are the same as for our other models (without ELMo).

5.2 Testing

All the tests were done on the TIRA platform pro- vided by the shared task organizers. During the test phase, we applied an ensemble mechanism us- ing five models trained with two different “seeds”.

The seeds are integers randomly produced by the

4http://hdl.handle.net/11234/1-1989

5https://github.com/allenai/allennlp

(7)

148

Corpus UAS LAS Rank(UAS) Rank(LAS) Baseline(LAS)

Overall (82) 78.71 73.02 2 4 65.80

Big treebanks only (61) 85.36 80.97 4 7 74.14

PUD treebanks only (5) 76.81 72.34 3 3 66.63

Small treebanks only (7) 75.67 68.12 2 3 55.01

Low-resource only (9) 37.03 23.39 4 5 17.17

Corpus Method UAS(Rank) LAS(Rank)

af afribooms 87.42 (7) 83.72 (8)

grc perseus tr 79.15 (4) 71.63 (8)

grc proiel tr 79.53 (5) 74.46 (8)

ar padt 75.96 (8) 71.13 (10)

hy armtdp tr, mu 53.56 (1) 37.01 (1)

eu bdt 85.72 (7) 81.13 (8)

br keb tr, mu 43.78 (3) 23.65 (5)

bg btb 92.1 (9) 88.02 (11)

bxr bdt tr, mu 36.89 (3) 17.16 (4)

ca ancora 92.83 (6) 89.56 (9)

hr set 90.18 (8) 84.67 (9)

cs cac tr 93.43 (2) 91 (2)

cs fictree tr 94.78 (1) 91.62 (3)

cs pdt tr 92.73 (2) 90.13 (7)

cs pud tr 89.49 (7) 83.88 (9)

da ddt 85.36 (8) 80.49 (11)

nl alpino tr 90.59 (2) 86.13 (5)

nl lassysmall tr 87.83 (2) 84.02 (4)

en ewt tr, el 86.9 (1) 84.02 (2)

en gum tr, el 88.57 (1) 85.05 (1) en lines tr, el 86.01 (1) 81.44 (2) en pud tr, el 90.83 (1) 87.89 (1)

et edt 86.25 (7) 82.33 (7)

fo oft tr, mu 48.64 (9) 25.17 (17)

fi ftb tr 89.74 (4) 86.54 (6)

fi pud tr 90.91 (4) 88.12 (6)

fi tdt tr 88.39 (6) 85.42 (7)

fr gsd tr, el 89.5 (1) 86.17 (3)

fr sequoia tr, el 91.81 (1) 89.89 (1) fr spoken tr, el 79.47 (2) 73.62 (3)

gl ctg tr 84.05 (7) 80.63 (10)

gl treegal tr 78.71 (2) 73.13 (3)

de gsd 82.09 (8) 76.86 (11)

got proiel 73 (6) 65.3 (8)

el gdt 89.29 (8) 86.02 (11)

he htb 66.54 (9) 62.29 (9)

hi hdtb 94.44 (8) 90.4 (12)

hu szeged 80.49 (8) 74.21 (10)

zh gsd tr, el 71.48 (5) 68.09 (5)

id gsd 85.03 (3) 77.61 (10)

ga idt 79.13 (2) 69.1 (4)

Corpus Method UAS(Rank) LAS(Rank)

it isdt tr 92.41 (6) 89.96 (8)

it postwita tr 77.52 (6) 72.66 (7)

ja gsd tr, el 76.4 (6) 74.82 (6)

ja modern 29.36 (8) 22.71 (8)

kk ktb tr, mu 39.24 (15) 23.97 (9)

ko gsd tr, el 88.03 (2) 84.31 (2)

ko kaist tr, el 88.92 (1) 86.32 (4)

kmr mg tr, mu 38.64 (3) 27.94 (4)

la ittb tr 87.88 (8) 84.72 (8)

la perseus tr 75.6 (3) 64.96 (3)

la proiel tr 73.97 (6) 67.73 (8)

lv lvtb tr 82.99 (8) 76.91 (11)

pcm nsc tr, mu 18.15 (21) 11.63 (18)

sme giella tr, mu 76.66 (1) 69.87 (1)

no bokmaal 91.4 (5) 88.43 (11)

no nynorsk tr 90.78 (8) 87.8 (11)

no nynorsklia tr 76.17 (2) 68.71 (2)

cu proiel 77.49 (6) 70.48 (8)

fro srcmf 91.35 (5) 85.51 (7)

fa seraji 89.1 (7) 84.8 (10)

pl lfg tr 95.69 (8) 92.86 (11)

pl sz tr 92.24 (9) 88.95 (10)

pt bosque 89.77 (5) 86.84 (7)

ro rrt 89.8 (8) 84.33 (10)

ru syntagrus tr 93.1 (4) 91.14 (6)

ru taiga tr 79.77 (1) 74 (2)

sr set 90.48 (10) 85.74 (11)

sk snk 86.81 (11) 82.4 (11)

sl ssj tr 87.18 (10) 84.68 (10)

sl sst tr 63.64 (3) 57.07 (3)

es ancora 91.81 (6) 89.25 (7)

sv lines tr 85.65 (4) 80.88 (6)

sv pud tr 83.44 (3) 79.1 (4)

sv talbanken tr 89.02 (4) 85.24 (7)

th pud tr, mu 0.33 (21) 0.12 (21)

tr imst 69.06 (7) 60.9 (11)

uk iu 85.36 (10) 81.33 (9)

hsb ufal tr, mu 54.01 (2) 43.83 (2)

ur udtb 87.4 (7) 80.74 (10)

ug udt 75.11 (6) 62.25 (9)

vi vtb 49.65 (6) 43.31 (8)

Table 1: Official experiment results for each corpus, where tr (Treebank), mu (Multilingual) and el (ELMo) in the column Method denote the feature representation methods used (see Section 2 and 3).

Corpus Projected languages UAS LAS

hy armntdp Greek 1 1

br keb English 3 5

bxr bdt Russian 3 4

fo oft English 9 17

kk ktb Turkish 15 9

kmr mg English 3 4

pcm nsc - 21 18

sme giella Finnish+Russian 1 1

th giella English 21 21

hsb ufal Polish 2 2

Table 2: Languages trained with multilingual word embeddings and their ranking.

Representation Methods UAS LAS

baseline 81.79 78.45

+em 83.39 80.15

+em, tr 83.67 80.64

+em, el 85.47 82.72

+em, tr, el 85.49 82.93

Table 3: Relative contribution of the different rep-

resentation methods on the overall results.

(8)

149 Python random library and are used to initialize the two parameters W and w (see Section 2). Gen- erally, an ensemble mechanism combines the best performing models obtained from different seeds, so as to ensure robustness and efficiency. In our case, due to a lack of a GPU, different models have been trained simply based on the use of two differ- ent seeds. Finally, the five best performing models produced by the two seeds were put together to form the ensemble model. This improved the per- formances by up to 0.6%, but other improvements could be expected by testing with a larger set of seeds.

5.3 Hardware Resources

The training process for all the language models with the ensemble and ELMo was done using 32 CPUs and 7 GPUs (Geforce 1080Ti) in approx- imately two weeks. The memory usage of each model depends on the size of external word em- beddings (3GB RAM by default plus the amount needed for loading the external embeddings). In the testing phase on the TIRA platform, we sub- mitted our models separately, since testing with a model trained with ELMo takes around three hours. Testing took 46.2 hours for the 82 corpora using 16 CPUs and 16GB RAM.

6 Results

In this section, we discuss the results of our sys- tem and the relative contributions of the different features to the global results.

Overall results. The official evaluation results are given in Table1. Our system achieved 73.02 LAS (4

^th

out of 26 teams) and 78.71 UAS (2

^nd

out of 26).

The comparison of our results with those ob- tained by other teams shows that there is room for improvement regarding preprocessing. For exam- ple, our system is 0.86 points below HIT-SCIR (Harbin) for sentence segmentation and 1.03 for tokenization (HIT-SCIR obtained the best overall results). Those two preprocessing tasks (sentence segmentation and tokenization) affect tagging and parsing performance directly. As a result, our parser ranked second on small treebanks (LAS), where most teams used the default segmenter and tokenizer, avoiding the differences on this aspect.

In contrast, we achieved 7

^th

on the big treebanks, probably because there is a more significant gap (1.72) here at the tokenization level.

Corpus Representation. Results with cor- pus representation (corpora marked tr in column Method of Table 1) exhibit relatively better perfor- mance than those without it, since tr makes it pos- sible to capture corpus-oriented features. Results were positive not only for small treebanks (e.g., cs fictree and ru taiga) but also for big treebanks (e.g., cs cac and ru syntagrus). Corpus represen- tation with ELMo shows the best performance for parsing English and French.

Multilinguality. As described in Section 3, we applied the multilingual approach to most of the low-resource languages. The best result is obtained for hy armtdp, while sme giella and hsb ufal also gave satisfactory results. We only applied the delexicalized approach to pcm nsc since we could not find any pre-trained embed- dings for this language. We got a relatively poor result for pcm nsc, despite testing different strate- gies and different feature combinations (we as- sume that the English model is not fit for it).

Additionally, we found that character-level rep- resentation is not always helpful, even in the case of some low-resource languages. When we tested kk ktb (Kazakh) trained with a Turkish corpus, with multilingual word embeddings and character- level representations, the performance dramati- cally decreased. We suspect this has to do with the writing systems (Arabic versus Latin), but this theory should be further investigated.

sme giella is another exceptional case since we chose to use a multilingual model trained with three different languages. Although Russian and Finnish do not use the same writing system, apply- ing character and corpus representation improve the results. This is because the size of the train- ing corpus for sme giella is around 900 sentences, which seems to be enough to capture its main char- acteristics.

Language Model (ELMo). We used ELMo embeddings for five languages: Korean, French, English, Japanese and Chinese (they are marked with el in the method column in Table 1). The experiments with ELMo models showed excel- lent overall performance. All the English cor- pora, fr gsd and fr sequoia in French, and Korean ko kaist obtained the best UAS. We also obtained the best LAS for English en gum and en pud, and for fr sequoia in French.

Contributions of the Different System Com-

ponents to the General results. To analyze the

(9)

150

Task Precision Recall F1(Rank) Event Extraction 58.93 43.12 49.80 (1) Negation Resolution 99.08 41.06 58.06 (12)

Opinion Analysis 63.91 56.88 60.19 (9)

Task LAS MLAS BLEX

Intrinsic Evaluation 84.66 (1) 72.93 (3) 77.62 (1)

Table 4: Official evaluation results on three EPE task (see https://goo.gl/3Fmjke).

effect of the proposed representation methods on parsing, we evaluated four different models with different components. We set our baseline model with a token representation as x

i

= w

i

◦ c

i

◦ p

i

, where w

i

is a randomly initialized word vector, c

i

is a character-level word vector and p

_i

is a POS vector predicted by UDpipe1.1 (note that we did not apply our 2018 POS tagger here, since it is trained jointly with the parser and that affects the overall feature representation). We then initialized the word vector w

i

with external word embeddings as provided by the CoNLL shared organizers. We also re-run the experiment by adding treebank and ELMo representations. The results are shown in Table 3 (em denotes the use of the external word embedding and tr and el denotes treebank and ELMo representations respectively.). We observe that each representation improves the overall re- sults. This is especially true regarding LAS when using ELMo (el), which means this representation has a positive effect on relation labeling.

Extrinsic Parser Evaluation (EPE 2018). Par- ticipants in the CoNLL shared task were invited to also participate in the 2018 Extrinsic Parser Eval- uation (EPE) campaign

⁶

(Fares et al., 2018), as a way to confirm the applicability of the developed methods on practical tasks. Three downstream tasks were proposed this year in the EPE: biomed- ical event extraction, negation resolution and opin- ion analysis (and each task was run independently from the others). For this evaluation, participants were only required to send a parsed version of the different corpora received as input back to the or- ganizers using a UD-type format (the organizers then ran the different scripts related to the dif- ferent tasks and computed the corresponding re- sults). We trained one single English model for the three tasks using the three English corpora pro- vided (en lines, en ewt, en gum) without treebank embeddings (tr), since we did not know which cor- pus embedding would perform better. In addition,

6http://epe.nlpl.eu/

we did not apply our ensemble process on TIRA since it would have been too time consuming.

Our results are listed in Table 4. They in- clude an intrinsic evaluation (overall performance of the parser on the different corpora considered as a whole) (Nivre and Fang, 2017) and task- specific evaluations (i.e. results for the three dif- ferent tasks). In the intrinsic evaluation, we ob- tained the best LAS among all the participating systems, which confirms the portability of our ap- proach across different domains. As for the task- specific evaluations, we obtained the best result for event extraction, but our parser did not perform so well on negation resolution and opinion analy- sis. This means that specific developments would be required to properly address the two tasks un- der consideration, taking semantics into consider- ation.

7 Conclusion

In this paper, we described the SEx BiST parser (Semantically EXtended Bi-LSTM parser) devel- oped at Lattice for the CoNLL 2018 Shared Task.

Our system was an extention of our 2017 parser (Lim and Poibeau, 2017) with three deep contex- tual representations (multilingual word represen- tation, corpus representations, ELMo representa- tion). It also included a multi-task learning pro- cess able to simultaneously handle tagging and parsing. SEx BiST achieved 73.02 LAS (4

^th

over 26 teams), and 78.72 UAS (2

^nd

out of 26), over the 82 test corpora of the evaluation. In the future, we hope to improve our sentence segmenter and our tokenizer since this seems to be the most ob- vious target for improvements to our system. The generalization of ELMo representation to new lan- guages (beyond what we could do for the 2018 evaluation) should also have a positive effect on the results.

Acknowledgments

KyungTae Lim is supported by the ANR ERA-

NET ATLANTIS project. This work is also sup-

ported by the LAKME project funded by IDEX

PSL (ANR-10-IDEX-0001-02). Lastly, we want

to thank the National Science & Technology Infor-

mation (NTIS) for providing computing resources

and the Korean national R&D corpora.

(10)

151 References

Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A. Smith.

2016. One parser, many languages. CoRR http://arxiv.org/abs/1602.01595.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.

Learning principled bilingual mappings of word embeddings while preserving monolingual invariance.

In Proceedings of the 2016 Conference on Em- pirical Methods in Natural Language Processing (EMNLP), Austin, USA. pages 2289–2294.

Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017.

Learning bilingual word embeddings with (almost) no bilingual data. InProceedings of the 55th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers). pages 451–462.

aclweb.org/anthology/P17-1042.

Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Noah A. Smith. 2016. Training with exploration improves a greedy stack lstm parser. InProceedings of the 2016 Conference on Empirical Methods in Natu- ral Language Processing. Association for Computa- tional Linguistics, Austin, Texas, pages 2005–2010.

https://aclweb.org/anthology/D16-1211.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Danqi Chen and Christopher D Manning. 2014. A fast and accurate dependency parser using neural net- works. InEMNLP. pages 740–750.

Timothy Dozat and Christopher D. Manning.

2016. Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734.

http://arxiv.org/abs/1611.01734.

Timothy Dozat, Peng Qi, and Christopher D. Man- ning. 2017. Stanford’s graph-based neural dependency parser at the conll 2017 shared task.

In Proceedings of the CoNLL 2017 Shared Task:

Multilingual Parsing from Raw Text to Uni- versal Dependencies. Association for Computa- tional Linguistics, Vancouver, Canada, pages 20–

30. http://www.aclweb.org/anthology/K/K17/K17- 3002.pdf.

Jason Eisner and Giorgio Satta. 1999. Efficient parsing for bilexical context-free grammars and head au- tomaton grammars. InProceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Associa- tion for Computational Linguistics, pages 457–464.

Murhaf Fares, Stephan Oepen, Lilja vrelid, Jari Bj¨orne, and Richard Johansson. 2018. The 2018 Shared Task on Extrinsic Parser Evaluation. On the downstream utility of English Universal Dependency parsers. InProceedings of the 22nd Conference on Natural Language Learning. Brussels, Belgia.

Stephan Gouws, Yoshua Bengio, and Greg Corrado.

2015. Bilbowa: Fast bilingual distributed representations without word alignments. In Proceedings of the 32nd International Conference on Machine Learning. Lille, France, volume 37 ofProceedings of Machine Learning Research, pages 748–756.

Stephan Gouws and Anders Sgaard. 2015. Simple task-specific bilingual word embeddings. InHLT- NAACL. The Association for Computational Lin- guistics, pages 1386–1390.

Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations.

InProceedings of ACL.

Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A representation learning framework for multi-source transfer parsing. In AAAI. pages 2734–2740.

Eliyahu Kiperwasser and Yoav Goldberg. 2016.

Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics 4:313–327.

https://transacl.org/ojs/index.php/tacl/article/view/885.

KyungTae Lim, Niko Partanen, and Thierry Poibeau.

2018. Multilingual Dependency Parsing for Low- Resource Languages: Case Studies on North Saami and Komi-Zyrian. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan.

KyungTae Lim and Thierry Poibeau. 2017. A system for multilingual dependency parsing based on bidirectional lstm feature representations. In Proceedings of the CoNLL 2017 Shared Task:

Multilingual Parsing from Raw Text to Uni- versal Dependencies. Association for Computa- tional Linguistics, Vancouver, Canada, pages 63–

70. http://www.aclweb.org/anthology/K/K17/K17- 3006.pdf.

Joakim Nivre and Chiao-Ting Fang. 2017. Univer- sal dependency evaluation. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependen- cies (UDW 2017). pages 86–95.

Niko Partanen, KyungTae Lim, Michael Rießler, and Thierry Poibeau. 2018. Dependency parsing of code-switching data with cross-lingual feature representations. In Proceedings of the Fourth Inter- national Workshop on Computatinal Linguistics of Uralic Languages. pages 1–17.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. CoRR abs/1802.05365.

http://arxiv.org/abs/1802.05365.

(11)

152

Martin Potthast, Tim Gollub, Francisco Rangel, Paolo

Rosso, Efstathios Stamatatos, and Benno Stein.

2014. Improving the reproducibility of PAN’s shared tasks: Plagiarism detection, author iden- tification, and author profiling. In Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sander- son, Mark Hall, Allan Hanbury, and Elaine Toms, editors, Information Access Evaluation meets Mul- tilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). Springer, Berlin Heidelberg New York, pages 268–299. https://doi.org/10.1007/978-3-319- 11382-1 22.

Sebastian Ruder, Ivan Vuli, and Anders Sgaard. 2018.

A survey of cross-lingual word embedding models . Tianze Shi, Felix G. Wu, Xilun Chen, and Yao Cheng. 2017. Combining global models for parsing universal dependencies. In Proceed- ings of the CoNLL 2017 Shared Task: Mul- tilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Vancouver, Canada, pages 31–39.

http://www.aclweb.org/anthology/K/K17/K17- 3003.pdf.

Sara Stymne, Miryam de Lhoneux, Aaron Smith, and Joakim Nivre. 2018. Parser training with heteroge- neous treebanks. InProceedings of the 56th Annual Meeting of the Association for Computational Lin- guistics (Short Papers). Melbourne, Australia.

Dan Zeman et al. 2018a. Universal Dependencies 2.2 CoNLL 2018 shared task development and test data. LINDAT/CLARIN digital library at the Insti- tute of Formal and Applied Linguistics, Charles Uni- versity, Prague, http://hdl.handle.net/

11234/1-2184. http://hdl.handle.net/11234/1- 2184.

Daniel Zeman, Filip Ginter, Jan Hajiˇc, Joakim Nivre, Martin Popel, Milan Straka, and et al. 2017. CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. InProceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Associa- tion for Computational Linguistics, pages 1–20.

Daniel Zeman, Jan Hajiˇc, Martin Popel, Martin Pot- thast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018b. CoNLL 2018 Shared Task:

Multilingual Parsing from Raw Text to Universal Dependencies. InProceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computa- tional Linguistics, Brussels, Belgium, pages 1–20.

Yu Zhang and Qiang Yang. 2017. A survey on multi- task learning.arXiv preprint arXiv:1707.08114.