NATURAL LANGUAGE PROCESSING: RECENT ADVANCES & DEEP LEARNING ARCHITECTURES

(1)

NATURAL LANGUAGE PROCESSING:

RECENT ADVANCES &

DEEP LEARNING ARCHITECTURES

Vincent Guigue

(2)

Introduction

(3)

Textual data = different analysis level At the document scale:

⇒ Several efficient algorithm to classify/categorize

At the sentence scale:

Segmentation POS, NER, SRL...

⇒ Sequential approaches (HMM, CRF)

At the word scale:

Defining / understanding the word

Linking the word with antonym, synonym, etc, ...

Binding the word to a lexical field

⇒ Latent semantics / linguistic resources

⇒ How deep learning / representation learning can improve our previous solutions?

NLP & deep learning 2/39

(4)

History, data sources & paradigms 1980-2000 From NLP to Machine learning

TREC / MUC : cheap access to large datasets

⇒Several breakthroughs in IR, Info. Extraction, ...

Not so cheap linguistic resources

ML algorithms progressively become crucial in NLP Algo. resources = article description + re-implementation

2000-2010 Internet era

New resources: large open & rich corpora, wikipedia New computing resources: cheap clusters, GPU Efficient implementations sharing (SVM, boosting) ML becomes pervasive

2010- Github era

Easy / encouraged source sharing

Embedding sharing⇒real interest of representation learning

(5)

Deep learning & representation learning

General philosophy : building a virtuous cycle Tackling directly raw data

Learn a representation that can be useful for several tasks...

Important keyword = embedding ... Each task improving the representation

From a standard chain...

Raw textual

data

Expert know-how

Pre-processing

High dimensional

vector representation

SVM, trees, Reg Log, ...

Machine learning algorithm

Learning mechanism Task

(6)

Deep learning & representation learning

General philosophy : building a virtuous cycle Tackling directly raw data

Learn a representation that can be useful for several tasks...

Important keyword = embedding ... Each task improving the representation

... to a deep learning architecture:

Raw textual

data

Convolution, fully connected, ...

Representation learning step

Learned representation

Embedding

Decision

step(s) Tasks

Learning mechanism

(7)

NLP & representation learning 1990 Matrix factorization :

PCA is a historical way to learn representations Criterion = reconstruction

In NLP⇒SVD / PCA [Deerwester, 1990]

2005 First neural architecture for text representation

an alternative to PLSA [Keller, 2005]

2008 Convolutional Neural architecture for text

precursor of modern architectures

multi-tasks [Collobert, 2008]

2012 The Word2Vec wave

Qualitative & cheap word embeddings [Mikolov, 2012]

2013 Manifest for representation learning [Bengio, 2013]

2015 Seq2seq paradigm [Luong, 2015]

⇒ Also an approximate schedule of the presentation

(8)

Deep architectures

(9)

Introduction Deep architectures Exploitation Knowledge Conclusion

Representation learning in text Task =

Is the word w present is document d ? Close to LSA / PLSA paradigm:

compressing the original data matrix through embeddings

⇒ Learning a language model

of words (resp. documents); these two representations are then concatenated and transformed non- linearly in order to obtain the target score using MLPT, as summarized in (2):

NNTR(wj, di) = MLPT{[MLPW(wj),MLPD(dj)]} . (2) All the parameters of the model are trained jointly on a text corpus, assigning high scores to pairs (wj,di) corresponding to documentsdi containing wordswj and low scores for all the other pairs.

A naive criterion would be to maximize the likelihood of the correct class, however, doing so would give the same weight to each seen example. Note that our data presents a huge imbalance between the number ofpositivepairs and the number ofnegativepairs (each document only contains a fraction of words of the vocabulary). Thus, the model would quickly be biased towards answering negatively and would then have diﬃculties in learning anything else. Another kind of imbalance specific to our data is that, among the positive examples, a few words tend to appear really often, while a lot appear only in few documents, which would bias the model to give lower probabilities to pairs with infrequent words independently of the document.

One Document One Word

One-Hot Encoding TF-IDF Encoding Word Repres. Document Repres.

Document MLP Word/Document

MLP

Word MLP

Figure 1: The NNTR model

Task TFIDF PLSA NNTR

Retrieval 0.170 0.199 0.215 Filtering 0.185 0.189 0.192 Figure 2: Compared mean Averaged Precisions (the higher the better) for Document Retrieval and Batch Docu- ment Filtering tasks.

The approach we thus propose does not try to obtain probabilities but simply unnormalized scores.

Thanks to this additional freedom, we can help the optimization process by weighting the training examples in order to balance the total number of positive examples (words w_j that are indeed in documentsdi) with the total number of negative examples. We can also balance each positive example independently of itsdocument frequency(the number of documents into which the word appears). The criterion we thus optimize can be expressed as follows:

C= 1 L⁻

L⁻

!

l=1

Q(xl,−1) + 1 M

!M

j=1

1 df_j

dfj

!

i=1

Q((di, wj),1), (3)

where L⁻ is the number of negative examples,M the number of words in the dictionary extracted from the training set, dfj the number of documents the word wj appears in, and Q(x, y) the cost function for an examplex= (d, w) and its targety. Note that using this weighting technique we do not need to present the whole negative example set but a sub-sampling of it at each iteration, which makes stochastic gradient descent training much faster. Furthermore, we use a margin-based cost functionQ(x, y) =|1−y·NNTR(x)|+, as proposed in [4], where|z|+= max(0, z).

4 Experiments

TDT2 is a database of transcripted broadcast news in American English. For this experiment we used 24 823 documents from a manually produced transcription and segmentation, referred to in

Conclusion:

Words & documents representations are more efficient than PLSA ones for several TREC information retrieval tasks

M. Keller, S. Bengio,ICANN 2005 A neural network for text representation

(10)

Convolutional Neural Architecture

Lookup table concept

= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...

... But open source + open embeddings Based on torch...

By Collobert

R. Collobert, J. WestonICML 2008

A unified architecture for natural language processing: Deep neural networks with multitask learning

(11)

Convolutional Neural Architecture

Lookup table concept

= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...

... But open source + open embeddings Based on torch...

By Collobert

(12)

Convolutional Neural Architecture

Lookup table concept

= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...

... But open source + open embeddings Based on torch...

By Collobert

(13)

Convolutional Neural Architecture

Lookup table concept

= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...

... But open source + open embeddings Based on torch...

By Collobert

(14)

Convolutional Neural Architecture (2) Several important informations

Embedding are learned keeping the sentence structure Embeddings benefit from multi-tasks

Learning is slow...

Our embeddings have been trained for about 2 months, over Wikipedia.

https://ronan.collobert.com/senna/

... But inference is fast & efficient

+ Costly embeddings have been made available to the community

(15)

All modern ingredients

Raw textual

data

Convolution, fully connected, ...

Representation learning step

Learned representation

Embedding

Decision

step(s) Tasks

language modeling

Learning mechanism

First task = modeling language

Embeddings are made available Several tasks can be improved

(16)

Word2Vec : extracting local semantics

the cat sat on the mat

sat Projection

w(t)

w(t-2) w(t-1) w(t+1) w(t+2)

Skip-gram

the cat sat on the mat

Sliding window (size = 5)

SUM

w(t)

w(t-2) w(t-1) w(t+1) w(t+2)

sat

CBOW

Sliding window (size = 5)

Skip-Gram (& negative sampling implementation) : easy learning Local analysis

predictive criterion : estimating missing value Sliding window = local context C of word w

Mikolov, Sutskever, Chen, Corrado, Dean,NIPS 2013 (arXiv 2012) Distributed representations of words and phrases and their compositionality

(17)

W2V : detailed explaination

Given word w and local contexts C : Idée SG: arg max

θ

Y

C

Y

w∈C

p(C | w ; θ)

p (D = 1 | w

_i

, w

_j

; θ) ⇒ proba. that w

_i

and w

j

occur in the same context

Espace vectoriel

C

arg max

θ

Y

i,j∈C

p(D = 1 | w

i

, w

j

; θ) + Y

i,j∈C¯

p(D = 0 | w

i

, w

j

; θ)

| {z }

Negative Sampling

Goldberg, Levy, arXiv 2014

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method

Hammer, NN 2002

Generalized Relevance Learning Vector Quantization

(18)

W2V : Skip-gram & negative sampling

arg max

θ

Y

i,j∈C

p(D = 1 | w

_i

, w

_j

; θ) + Y

i,j∈C¯

p(D = 0 | w

_i

, w

_j

; θ)

| {z }

Negative Sampling Using logistic function : p(D = 1 | w

i

, w

j

) =

_1+exp(¹₋_z

iz_j)

Global log-likelihood : Z

^?

= arg max

Z



 X

i,j∈C

log σ(z

i

· z

_j

) + X

i,j∈C¯

log σ( − z

_i

· z

_j

)



 σ : fct sigmoide, C : Set of Cooccurences, C ¯ : Set of Non-Cooc

Stochastic Gradient Descent + triplet loss

Frequent word subsampling trick : picking words with p(w

i

) = 1 − q

_t

freq(wi)

, t = 10

⁻⁵

(19)

W2V: Parametrization & results

Choosing a large latent space: z ∈ [500, 1000]

(vs PLSA/LDA z ∈ [10, 100]) Operate on large & small corpora...

... very fast !

(20)

W2V: Playing in the latent space

a is to b what c is to ??? ⇔ z

_b

− z

_a

+ z

_c

Syntactical property (1):

z

_woman

− z

_man

≈ z

_queen

− z

_king

z

_kings

− z

_king

≈ z

_queens

− z

_kings

Query:

z

_woman

− z

_man

+ z

_king

= z

_req

Nearest neighbor:

arg min

_i

k z

_req

− z

_i

k = queen

(21)

W2V: Playing in the latent space

a is to b what c is to ??? ⇔ z

_b

− z

_a

+ z

_c

Syntactical property (2):

Query:

z

_easy

− z

_easiest

+ z

_luckiest

= z

_req

Nearest neighbor:

arg min

_i

k z

_req

− z

_i

k = lucky

(22)

W2V: Playing in the latent space

a is to b what c is to ??? ⇔ z

_b

− z

_a

+ z

_c

Semantic Property (1)

(23)

W2V: Playing in the latent space

a is to b what c is to ??? ⇔ z

_b

− z

_a

+ z

_c

Semantic Property (2)

NEG-15 with10⁻⁵subsampling HS with10⁻⁵subsampling

Vasco de Gama Lingsugur Italian explorer

Lake Baikal Great Rift Valley Aral Sea

Alan Bean Rebbeca Naomi moonwalker

Ionian Sea Ruegen Ionian Islands

chess master chess grandmaster Garry Kasparov

Table 4: Examples of the closest entities to the given short phrases, using two different models.

Czech + currency Vietnam + capital German + airlines Russian + river French + actress

koruna Hanoi airline Lufthansa Moscow Juliette Binoche

Check crown Ho Chi Minh City carrier Lufthansa Volga River Vanessa Paradis Polish zolty Viet Nam flag carrier Lufthansa upriver Charlotte Gainsbourg

CTK Vietnamese Lufthansa Russia Cecile De

Table 5: Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.

To maximize the accuracy on the phrase analogy task, we increased the amount of the training data by using a dataset with about 33 billion words. We used the hierarchical softmax, dimensionality of 1000, and the entire sentence for the context. This resulted in a model that reached an accuracy of72%. We achieved lower accuracy 66% when we reduced the size of the training dataset to 6B words, which suggests that the large amount of the training data is crucial.

To gain further insight into how different the representations learned by different models are, we did inspect manually the nearest neighbours of infrequent phrases using various models. In Table 4, we show a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.

5 Additive Compositionality

We demonstrated that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Interestingly, we found that the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5.

The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.

6 Comparison to Published Word Representations

Many authors who previously worked on the neural network based representations of words have published their resulting models for further use and comparison: amongst the most well known authors are Collobert and Weston [2], Turian et al. [17], and Mnih and Hinton [10]. We downloaded their word vectors from the web³. Mikolov et al. [8] have already evaluated these word representations on the word analogy task, where the Skip-gram models achieved the best performance with a huge margin.

3http://metaoptimize.com/projects/wordreprs/

7

(24)

Why is it working?

Predicting instead of counting... Not the good explanation!

Learning a local semantics !

⇒ GloVe ≈ PLSA + local context + embedding based implem.

X ∈R^V^×^V word co-occurrence matrix

Xij frequency of wordi co-occurring with wordj Xi =P

kXik total number of occurrences of wordi in corpus Pij=P(j|i) = ^X_X^ij

i a.k.a. probability of word j occurring within the context of wordi

w ∈R^z a word embedding of dimension z

˜

w ∈R^z a context word embedding of dimension z

Baroni, Dinu & Kruszewski, ACL 2014

Don’t count, predict! A systematic comparison of context-counting vs.

context-predicting semantic vectors

Pennington, Socher & Manning, EMNLP 2014 Glove: Global Vectors for Word Representation

(25)

W2V : Strengths & weaknesses

Incredible impact for a badly written article...

... Associated to strong results

... Lightweight algorithm and powerful implementation ... & available pretrained embeddings

One remaining question: we learned a powerful semantics at theword level... How scaling to thesentence ordocument level?

(26)

W2V : Strengths & weaknesses

Incredible impact for a badly written article...

... Associated to strong results

... Lightweight algorithm and powerful implementation ... & available pretrained embeddings

Q. Le, T. Mikolov,ICML 2014

Distributed representations of sentences and documents

(27)

W2V : Strengths & weaknesses

Incredible impact for a badly written article...

... Associated to strong results

... Lightweight algorithm and powerful implementation ... & available pretrained embeddings

Or simple averaging of word embeddings:

+ great results on small word groups

− poor results on larger groups

quickly converge to a central abstract point of the latent space

(28)

W2V : Strengths & weaknesses

Incredible impact for a badly written article...

... Associated to strong results

... Lightweight algorithm and powerful implementation ... & available pretrained embeddings

1 Aggregate multiple words associated to a single entity

Pointwise Mu-

tual Informationthreshold:

New York New York Times Baltimore Baltimore Sun San Jose San Jose Mercury News Cincinnati Cincinnati Enquirer

NHL Teams

Boston Boston Bruins Montreal Montreal Canadiens Phoenix Phoenix Coyotes Nashville Nashville Predators

NBA Teams

Detroit Detroit Pistons Toronto Toronto Raptors Oakland Golden State Warriors Memphis Memphis Grizzlies

Airlines

Austria Austrian Airlines Spain Spainair

Belgium Brussels Airlines Greece Aegean Airlines Company executives

Steve Ballmer Microsoft Larry Page Google

Samuel J. Palmisano IBM Werner Vogels Amazon

Table 2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples).

The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.

This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using

score(wi, wj) = count(wiwj)−δ

count(wi)×count(wj). (6)

Theδis used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with score above the chosen threshold are then used as phrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allow- ing longer phrases that consists of several words to be formed. We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task. This dataset is publicly available on the web². 4.1 Phrase Skip-Gram Results

Starting with the same news data as in the previous experiments, we first constructed the phrase based training corpus and then we trained several Skip-gram models using different hyper- parameters. As before, we used vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens.

The results are summarized in Table 3.

The results show that while Negative Sampling achieves a respectable accuracy even withk= 5, usingk= 15achieves considerably better performance. Surprisingly, while we found the Hierar- chical Softmax to achieve lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words. This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.

2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt

Method Dimensionality No subsampling [%] 10⁻⁵subsampling [%]

NEG-5 300 24 27

NEG-15 300 27 42

HS-Huffman 300 19 47

Table 3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.

6 2 Include new terms in the

dictionary before running word2vec

Newspapers

New York New York Times Baltimore Baltimore Sun San Jose San Jose Mercury News Cincinnati Cincinnati Enquirer

NHL Teams

Boston Boston Bruins Montreal Montreal Canadiens Phoenix Phoenix Coyotes Nashville Nashville Predators

NBA Teams

Detroit Detroit Pistons Toronto Toronto Raptors Oakland Golden State Warriors Memphis Memphis Grizzlies

Airlines

Austria Austrian Airlines Spain Spainair

Belgium Brussels Airlines Greece Aegean Airlines Company executives

Steve Ballmer Microsoft Larry Page Google

Samuel J. Palmisano IBM Werner Vogels Amazon

Table 2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples).

The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.

This way, we can form many reasonable phrases without greatly increasing the size of the vocabulary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using

score(wi, wj) = count(wiwj)−δ

count(wi)×count(wj). (6)

Theδis used as a discounting coefficient and prevents too many phrases consisting of very infrequent words to be formed. The bigrams with score above the chosen threshold are then used as phrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allow- ing longer phrases that consists of several words to be formed. We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task. This dataset is publicly available on the web². 4.1 Phrase Skip-Gram Results

Starting with the same news data as in the previous experiments, we first constructed the phrase based training corpus and then we trained several Skip-gram models using different hyper- parameters. As before, we used vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens.

The results are summarized in Table 3.

The results show that while Negative Sampling achieves a respectable accuracy even withk= 5, usingk= 15achieves considerably better performance. Surprisingly, while we found the Hierar- chical Softmax to achieve lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words. This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.

2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt

Method Dimensionality No subsampling [%] 10⁻⁵subsampling [%]

NEG-5 300 24 27

NEG-15 300 27 42

HS-Huffman 300 19 47

Table 3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.

6

(29)

Recurrent Neural Network & LSTM The next step:

providing a better modeling of the link between words

⇒ Learning a latent representation of a specific word group ... but sentences have different lengths

⇒ using a recurrent aggregator ... ie a RNN

Chris Olah’s bloghttp://colah.github.io/posts/2015-08-Understanding-LSTMs/

(30)

Recurrent Neural Network & LSTM The next step:

providing a better modeling of the link between words Classical RNN :

Gradient vanishes & long term dependancies are not modeled....

(31)

Recurrent Neural Network & LSTM The next step:

providing a better modeling of the link between words

The phenomenon has been understood & (partially) overcome:

Neuronslearnwhat should bekept in memoryand what should beforgotten

Gated architecture S. Hochreiter, J. Schmidhuber,Neural computation 1997

Long short-term memory

(32)

RNN architecture : different settings

From non recurrent modeling...

... to sequential aggregation... [doc2vec]

... through predicting N future steps

... to continuous supervision [POS/NER tagging]

Karpathy’s bloghttp://karpathy.github.io/2015/05/21/rnn-effectiveness/

(33)

State Of The Art : Bi-LSTM LSTM

+ Sequential modeling

− Sequential dependencies ! = partial modeling

Bi-dimensional representation [S

1

, S

₁⁰

] is more powerful

representation of the sentence S than each single representation.

Classical notation: s = [ − → s , ← − s ]

(34)

ElMo : Deep contextualized word representations

Static word embeddings ⇒ Mapping function = dynamic embeddings

Method

With long short term memory (LSTM) network, predicting the next words in both directions to build biLMs

36

have a nice one …

Output layer

Hidden layers (LSTMs)

Embedding layer

…

a nice one

xk ok

k− 1 k− 1

Expanded in the forward direction of k The forward LM architecture

• Overview

• Method

• Evaluation

• Analysis

• Comments

t_k h^LM_k1 h^LM_k2

bi-LSTM architecture + double hidden layer

M. E. Peters et al., arXiv 2018

Deep contextualized word representations

(35)

ElMo : Deep contextualized word representations

Static word embeddings ⇒ Mapping function = dynamic embeddings

Method

ELMo represents a word as a linear combination of corresponding hidden layers (inc. its embedding)

37

t_k

xk ok

k− 1

k− 1 h^LMk1

ok k+ 1 k+ 1 h^LMk2

h^LMk1 h^LM_k2 Forward LM Backward LM

t_k t_k

s₀^task s₁^task

s₂^task Concatenate

hidden layers h^LMk1

h^LMk2

h^LM_k0 ([xk;xk])

×

× ^[^h^LMkj;h^LMkj]

{

ELMo^taskk = γ^task ×_∑

Unlike usual word embeddings, ELMo is assigned to every token instead of a type

biLMs ELMo is a task specific

representation. A down-stream task learns weighting parameters

(36)

ElMo : Deep contextualized word representations

Static word embeddings ⇒ Mapping function = dynamic embeddings

Many linguistic tasks are improved by using ELMo

39

Evaluation

• Overview

• Method

• Evaluation

• Analysis

• Comments

Q&A Textual entailment

Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition

(37)

ElMo : Deep contextualized word representations

Static word embeddings ⇒ Mapping function = dynamic embeddings

Analysis

The higher layer seemed to learn semantics while the lower layer probably captured syntactic features

40

• Overview

• Method

• Evaluation

• Analysis

• Comments

Word sense disambiguation PoS tagging

(38)

Next step: removing the recurrent architecture RNN = computation bottleneck

Vaswani et al., NIPS 2017 Attention Is All You Need Devlin et al., arXiv 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

(39)

Next step: removing the recurrent architecture RNN = computation bottleneck

Vaswani et al., NIPS 2017 Attention Is All You Need Devlin et al., arXiv 2018

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

(40)

fastText : syntactic robustness

How to deal with unknown/rare words ? Word2Vec : creating a specific embedding

to keep the global structure of the sentence fastText :

Word embeddings

Subword embeddings =N-grams of characters Implementation

relies on special bounding characters: <>

+ skip-gram & negative sampling = W2V formulation

Word where / N = 3 ⇒ <where> + <wh, whe, her, ere, re>

⇒ Rare/unknown words can be represented by their N-grams of characters

P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Trans. ACL 2017 Enriching Word Vectors with Subword Information

(41)

fastText : syntactic robustness

How to deal with unknown/rare words ?

Few words appear a lot...

many words appear rarely Great issue !

(42)

fastText : syntactic robustness

How to deal with unknown/rare words ? Typical result with fastText:

query tiling tech-rich english-born micromanaging eateries dendritic sisg tile tech-dominated british-born micromanage restaurants dendrite

flooring tech-heavy polish-born micromanaged eaterie dendrites sg bookcases technology-heavy most-capped defang restaurants epithelial

built-ins .ixic ex-scotland internalise delis p53

Table 7: Nearest neighbors of rare words using our representations and skipgram. These hand picked examples are for illustration.

Figure 2: Illustration of the similarity between character n -grams in out-of-vocabulary words. For each pair, only one word is OOV, and is shown on the x axis. Red indicates positive cosine, while blue negative.

Cherry on the cake: fastText... is fast

(43)

Exploiting new

representations in

Applications

(44)

Karpathy’s demonstration on char2char

c l a s s RNN :

# . . .

d e f s t e p ( s e l f , x ) :

# u p d a t e t h e h i d d e n s t a t e

s e l f . h = np . t a n h ( np . d o t ( s e l f . W_hh, s e l f . h ) + np . d o t ( s e l f . W_xh, x ) )

# compute t h e o u t p u t v e c t o r y = np . d o t ( s e l f . W_hy, s e l f . h ) r e t u r n y

+ multilayer architecture:

y1 = r n n 1 . s t e p ( x ) y = r n n 2 . s t e p ( y1 )

A. Karpathy’s blog : The Unreasonable Effectiveness of Recurrent Neural Networks

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

(45)

Karpathy’s demonstration on char2char Sample of Shakespeare generation

PANDARUS:

Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death,

I should not sleep.

Second Senator:

They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.

(46)

Karpathy’s demonstration on char2char Wikipedia sample

Naturalism and decision for the majority of Arab

countries’ capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham’s sovereignty. His generals were the powerful ruler of the Portugal in the

[[Protestant Immineners]], which could be said to be directly in Cantonese Communication, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]]

(47)

Karpathy’s demonstration on char2char Linux code generation

*

* Increment the size file of the new incorrect UI_FILTER group information

* of the size generatively.

*/

static int indicate_policy(void) {

int error;

if (fd == MARN_EPT) { /*

* The kernel blank will coeld it to userspace.

*/

if (ss->segment < mem_total) unblock_graph_and_set_blocked();

else ret = 1;

goto bail;

}

(48)

RNN & Context modeling

Still char2char architecture

(a)

(b)

(c)

Figure 3: (a) Standard generative RNN; (b) encoder- decoder RNN; (c) concatenated input RNN.

3.2 Generative Concatenative RNNs

Our goal is to generate text in a supervised fashion, conditioned on an auxiliary inputxaux. This has been done at the word-level with encoder-decoder models (Figure 3b), in which the auxiliary input is encoded and passed as the initial state to a decoder, which then must preserve this input signal across many sequence steps (Sutskever et al., 2014;

Karpathy and Fei-Fei, 2014). Such models have success- fully produced (short) image captions, but seem impracti- cal for generating full reviews at the character level because signal fromxaux must survive for hundreds of sequence steps.

We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.

The value of re-iterating high-level material is borne out in one study, Surber and Schroeder (2007), which showed that repetitive subject headings in textbooks resulted in faster learning, less rereading and more accurate answers to high- level questions.

Thus we propose the generative concatenative network, a simple architecture in which inputxaux is concatenated with the character representationx^(t)_char. Given this new inputx⁰^(t) = [x^(t)_char;x_aux]we can train the model pre-

training time,xauxis a feature of the training set. At pre- diction time, we fix somexaux, concatenating it with each character sampled fromyˆ^(t). One might reasonably note that this replicated input information is redundant. How- ever, since it is fixed over the course of the review, we see no reason to require the model to transmit this signal across hundreds of time steps. By replicatingxauxat each input, we free the model to focus on learning the complex interac- tion between the auxiliary input and language, rather than memorizing the input.

3.3 Weight Transplantation

Models with even modestly sized auxiliary input representations are considerably harder to train than a typical un- supervised character model. To overcome this problem, we first train a character model to convergence. Then we trans- plant these weights into a concatenated input model, initial- izing the extra weights (between the input layer and the first hidden layer) to zero. Zero initialization is not problem- atic here because symmetry in the hidden layers is already broken. Thus we guarantee that the model will achieve a strictly lower loss than a character model, saving (days of) repeated training. This scheme bears some resemblance to the pre-training common in the computer vision community (Yosinski et al., 2014). Here, instead of new output weights, we train new input weights.

3.4 Running the Model in Reverse

Many common document classification models, like tf-idf logistic regression, maximize the likelihood of the training labels given the text. Given our generative model, we can produce a predictor by reversing the order of inference, that is, by maximizing the likelihood of the text, given a classification. The relationship between these two tasks (P(xaux|Review)andP(Review|xaux)) follows from Bayes’ rule. That is, our model predicts the con- ditional probability P(Review|xaux) of an entire review given somexaux(such as a star rating). The normalizing term can be disregarded in determining the most probable rating and when the classes are balanced, as they are in our test cases, the prior also vanishes from the decision rule leavingP(xaux|Review)/P(Review|xaux).

In the specific case of author identification, we are solving the problem

argmax

u2users P(Review|u)·P(u)

= argmax

u

YT t=1

P(yt|u, y1, ..., yt 1)

!

·P(u)

= argmax

u

XT

t=1

log(P(yt|u, y1, ..., yt 1))

!

+log(P(u))

Input = char + context Inference = test of all contexts + max. likelyhood

Lipton, Vikram, McAuley, arXiv, 2016 Gen. Concatenative Nets Jointly Learn to Write and Classify Reviews Demo: http://deepx.ucsd.edu/

Figure 5: Most likely star rating as each letter is encoun- tered. The GCN learns to tilt positive by the ‘c’ in ‘excel- lent’ and that the ‘f’ in ‘awful’ reveals negative sentiment.

4.5 Learning Nonlinear Dynamics of Negation By qualitatively evaluating the beliefs of the network about the sentiment (rating) corresponding to various phrases, we can easily show many clear cases where the GCN is able to model the nonlinear dynamics in text. To demonstrate this capacity, we plot the likelihoods of the review conditioned on each setting of the rating for the phrases “this beer is great”, “this beer is not great”, “this beer is bad”, and “this beer is not bad” Figure 6.

5 Related Work

The prospect of capturing meaning in character-level text has long captivated neural network researchers. In the seminal work, “Finding Structure in Time”, Elman (1990) speculated, “one can ask whether the notion ‘word’ (or something which maps on to this concept) could emerge as a consequence of learning the sequential structure of letter sequences that form words and sentences (but in which word boundaries are not marked).” In this work, an ‘El- man RNN’ was trained with5input nodes,5output nodes, and a single hidden layer of20nodes, each of which had a corresponding context unit to predict the next character in a sequence. At each step, the network received a binary encoding (not one-hot) of a character and tried to predict the next character’s binary encoding. Elman plots the error

Predicting User from Review

Accuracy AUC Recall@10% GCN-Character .9190 .9979 .9700 TF-IDF n-gram .9133 .9979 .9756

Predicting Item from Review

Accuracy AUC Recall@10% GCN-Character .1280 .9620 .4370 TF-IDF n-gram .2427 .9672 .5974 Table 2: Information retrieval performance for predicting the author of a review and item described.

Predicted

F/V Lager Stout Porter IPA

True

F/V 910 28 7 14 41

Lager 50 927 3 3 17

Stout 16 1 801 180 2

Porter 22 3 111 856 8

IPA 19 12 4 12 953

(a) GCN

Predicted

True

F/V 923 36 9 10 22

Lager 16 976 0 1 7

Stout 9 4 920 65 2

Porter 11 6 90 887 6

IPA 18 13 1 2 966

(b) ngram tf-idf.

Table 3: Beer category classification confusion matrices.

of the net character by character, showing that it is typically high at the onset of words, but decreasing as it becomes clear what each word is. While these nets do not possess the size or capabilities of large modern LSTM networks trained on GPUs, this work lays the foundation for much of our research. Subsequently, in 2011, Sutskever et al. (2011) introduced the model of text generation on which we build. In that paper, the authors generate text resembling Wikipedia articles and New York Times articles. They sanity check the model by showing that it can perform adebaggingtask in which it unscrambles bag-of- words representations of sentences by determining which unscrambling has the highest likelihood. Also relevant to our work is Zhang and LeCun (2015), which trains a strictly discriminative model of text at the character level using convolutional neural networks (LeCun et al., 1989, 1998).

Demonstrating success on both English and Chinese language datasets, their models achieve high accuracy on a number of classification tasks. Dai and Le (2015) train a character level LSTM to perform document classification, using LSTM RNNs pretrained as either language models

(49)

RNN & Context modeling

Still char2char architecture

(a)

(b)

(c)

Figure 3: (a) Standard generative RNN; (b) encoder- decoder RNN; (c) concatenated input RNN.

3.2 Generative Concatenative RNNs

Our goal is to generate text in a supervised fashion, conditioned on an auxiliary inputxaux. This has been done at the word-level with encoder-decoder models (Figure 3b), in which the auxiliary input is encoded and passed as the initial state to a decoder, which then must preserve this input signal across many sequence steps (Sutskever et al., 2014;

Karpathy and Fei-Fei, 2014). Such models have success- fully produced (short) image captions, but seem impracti- cal for generating full reviews at the character level because signal fromxaux must survive for hundreds of sequence steps.

We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.

The value of re-iterating high-level material is borne out in one study, Surber and Schroeder (2007), which showed that repetitive subject headings in textbooks resulted in faster learning, less rereading and more accurate answers to high- level questions.

Thus we propose the generative concatenative network, a simple architecture in which inputxaux is concatenated with the character representationx^(t)_char. Given this new inputx⁰^(t) = [x^(t)_char;x_aux]we can train the model pre-

training time,xauxis a feature of the training set. At pre- diction time, we fix somexaux, concatenating it with each character sampled fromyˆ^(t). One might reasonably note that this replicated input information is redundant. How- ever, since it is fixed over the course of the review, we see no reason to require the model to transmit this signal across hundreds of time steps. By replicatingxauxat each input, we free the model to focus on learning the complex interac- tion between the auxiliary input and language, rather than memorizing the input.

3.3 Weight Transplantation

Models with even modestly sized auxiliary input representations are considerably harder to train than a typical un- supervised character model. To overcome this problem, we first train a character model to convergence. Then we trans- plant these weights into a concatenated input model, initial- izing the extra weights (between the input layer and the first hidden layer) to zero. Zero initialization is not problem- atic here because symmetry in the hidden layers is already broken. Thus we guarantee that the model will achieve a strictly lower loss than a character model, saving (days of) repeated training. This scheme bears some resemblance to the pre-training common in the computer vision community (Yosinski et al., 2014). Here, instead of new output weights, we train new input weights.

3.4 Running the Model in Reverse

Many common document classification models, like tf-idf logistic regression, maximize the likelihood of the training labels given the text. Given our generative model, we can produce a predictor by reversing the order of inference, that is, by maximizing the likelihood of the text, given a classification. The relationship between these two tasks (P(xaux|Review)andP(Review|xaux)) follows from Bayes’ rule. That is, our model predicts the con- ditional probability P(Review|xaux) of an entire review given somexaux(such as a star rating). The normalizing term can be disregarded in determining the most probable rating and when the classes are balanced, as they are in our test cases, the prior also vanishes from the decision rule leavingP(xaux|Review)/P(Review|xaux).

In the specific case of author identification, we are solving the problem

argmax

u2users P(Review|u)·P(u)

= argmax

u

YT t=1

P(yt|u, y1, ..., yt 1)

!

·P(u)

= argmax

u

XT

t=1

log(P(yt|u, y1, ..., yt 1))

!

+log(P(u))

Input = char + context Inference = test of all contexts + max. likelyhood

Lipton, Vikram, McAuley, arXiv, 2016 Gen. Concatenative Nets Jointly Learn to Write and Classify Reviews Demo: http://deepx.ucsd.edu/

Figure 5: Most likely star rating as each letter is encoun- tered. The GCN learns to tilt positive by the ‘c’ in ‘excel- lent’ and that the ‘f’ in ‘awful’ reveals negative sentiment.

4.5 Learning Nonlinear Dynamics of Negation By qualitatively evaluating the beliefs of the network about the sentiment (rating) corresponding to various phrases, we can easily show many clear cases where the GCN is able to model the nonlinear dynamics in text. To demonstrate this capacity, we plot the likelihoods of the review conditioned on each setting of the rating for the phrases “this beer is great”, “this beer is not great”, “this beer is bad”, and “this beer is not bad” Figure 6.

5 Related Work

The prospect of capturing meaning in character-level text has long captivated neural network researchers. In the seminal work, “Finding Structure in Time”, Elman (1990) speculated, “one can ask whether the notion ‘word’ (or something which maps on to this concept) could emerge as a consequence of learning the sequential structure of letter sequences that form words and sentences (but in which word boundaries are not marked).” In this work, an ‘El- man RNN’ was trained with5input nodes,5output nodes, and a single hidden layer of20nodes, each of which had a corresponding context unit to predict the next character in a sequence. At each step, the network received a binary encoding (not one-hot) of a character and tried to predict the next character’s binary encoding. Elman plots the error

Predicting User from Review

Accuracy AUC Recall@10% GCN-Character .9190 .9979 .9700 TF-IDF n-gram .9133 .9979 .9756

Predicting Item from Review

Accuracy AUC Recall@10% GCN-Character .1280 .9620 .4370 TF-IDF n-gram .2427 .9672 .5974 Table 2: Information retrieval performance for predicting the author of a review and item described.

Predicted

True

F/V 910 28 7 14 41

Lager 50 927 3 3 17

Stout 16 1 801 180 2

Porter 22 3 111 856 8

IPA 19 12 4 12 953

(a) GCN

Predicted

True

F/V 923 36 9 10 22

Lager 16 976 0 1 7

Stout 9 4 920 65 2

Porter 11 6 90 887 6

IPA 18 13 1 2 966

(b) ngram tf-idf.

Table 3: Beer category classification confusion matrices.

of the net character by character, showing that it is typically high at the onset of words, but decreasing as it becomes clear what each word is. While these nets do not possess the size or capabilities of large modern LSTM networks trained on GPUs, this work lays the foundation for much of our research. Subsequently, in 2011, Sutskever et al. (2011) introduced the model of text generation on which we build. In that paper, the authors generate text resembling Wikipedia articles and New York Times articles. They sanity check the model by showing that it can perform adebaggingtask in which it unscrambles bag-of- words representations of sentences by determining which unscrambling has the highest likelihood. Also relevant to our work is Zhang and LeCun (2015), which trains a strictly discriminative model of text at the character level using convolutional neural networks (LeCun et al., 1989, 1998).

Demonstrating success on both English and Chinese language datasets, their models achieve high accuracy on a number of classification tasks. Dai and Le (2015) train a character level LSTM to perform document classification, using LSTM RNNs pretrained as either language models