NATURAL LANGUAGE PROCESSING:
RECENT ADVANCES &
DEEP LEARNING ARCHITECTURES
Vincent Guigue
Introduction
Textual data = different analysis level At the document scale:
⇒ Several efficient algorithm to classify/categorize
At the sentence scale:
Segmentation POS, NER, SRL...
⇒ Sequential approaches (HMM, CRF)
At the word scale:
Defining / understanding the word
Linking the word with antonym, synonym, etc, ...
Binding the word to a lexical field
⇒ Latent semantics / linguistic resources
⇒ How deep learning / representation learning can improve our previous solutions?
NLP & deep learning 2/39
History, data sources & paradigms 1980-2000 From NLP to Machine learning
TREC / MUC : cheap access to large datasets
⇒Several breakthroughs in IR, Info. Extraction, ...
Not so cheap linguistic resources
ML algorithms progressively become crucial in NLP Algo. resources = article description + re-implementation
2000-2010 Internet era
New resources: large open & rich corpora, wikipedia New computing resources: cheap clusters, GPU Efficient implementations sharing (SVM, boosting) ML becomes pervasive
2010- Github era
Easy / encouraged source sharing
Embedding sharing⇒real interest of representation learning
NLP & deep learning 3/39
Deep learning & representation learning
General philosophy : building a virtuous cycle Tackling directly raw data
Learn a representation that can be useful for several tasks...
Important keyword = embedding ... Each task improving the representation
From a standard chain...
Raw textual
data
Expert know-how
Pre-processing
High dimensional
vector representation
SVM, trees, Reg Log, ...
Machine learning algorithm
Learning mechanism Task
NLP & deep learning 4/39
Deep learning & representation learning
General philosophy : building a virtuous cycle Tackling directly raw data
Learn a representation that can be useful for several tasks...
Important keyword = embedding ... Each task improving the representation
... to a deep learning architecture:
Raw textual
data
Convolution, fully connected, ...
Representation learning step
Learned representation
Embedding
Decision
step(s) Tasks
Learning mechanism
NLP & deep learning 4/39
NLP & representation learning 1990 Matrix factorization :
PCA is a historical way to learn representations Criterion = reconstruction
In NLP⇒SVD / PCA [Deerwester, 1990]
2005 First neural architecture for text representation
an alternative to PLSA [Keller, 2005]
2008 Convolutional Neural architecture for text
precursor of modern architecturesmulti-tasks [Collobert, 2008]
2012 The Word2Vec wave
Qualitative & cheap word embeddings [Mikolov, 2012]
2013 Manifest for representation learning [Bengio, 2013]
2015 Seq2seq paradigm [Luong, 2015]
⇒ Also an approximate schedule of the presentation
NLP & deep learning 5/39
Deep architectures
Introduction Deep architectures Exploitation Knowledge Conclusion
Representation learning in text Task =
Is the word w present is document d ? Close to LSA / PLSA paradigm:
compressing the original data matrix through embeddings
⇒ Learning a language model
of words (resp. documents); these two representations are then concatenated and transformed non- linearly in order to obtain the target score using MLPT, as summarized in (2):
NNTR(wj, di) = MLPT{[MLPW(wj),MLPD(dj)]} . (2) All the parameters of the model are trained jointly on a text corpus, assigning high scores to pairs (wj,di) corresponding to documentsdi containing wordswj and low scores for all the other pairs.
A naive criterion would be to maximize the likelihood of the correct class, however, doing so would give the same weight to each seen example. Note that our data presents a huge imbalance between the number ofpositivepairs and the number ofnegativepairs (each document only contains a fraction of words of the vocabulary). Thus, the model would quickly be biased towards answering negatively and would then have difficulties in learning anything else. Another kind of imbalance specific to our data is that, among the positive examples, a few words tend to appear really often, while a lot appear only in few documents, which would bias the model to give lower probabilities to pairs with infrequent words independently of the document.
One Document One Word
One-Hot Encoding TF-IDF Encoding Word Repres. Document Repres.
Document MLP Word/Document
MLP
Word MLP
Figure 1: The NNTR model
Task TFIDF PLSA NNTR
Retrieval 0.170 0.199 0.215 Filtering 0.185 0.189 0.192 Figure 2: Compared mean Averaged Precisions (the higher the better) for Document Retrieval and Batch Docu- ment Filtering tasks.
The approach we thus propose does not try to obtain probabilities but simply unnormalized scores.
Thanks to this additional freedom, we can help the optimization process by weighting the training examples in order to balance the total number of positive examples (words wj that are indeed in documentsdi) with the total number of negative examples. We can also balance each positive example independently of itsdocument frequency(the number of documents into which the word appears). The criterion we thus optimize can be expressed as follows:
C= 1 L−
L−
!
l=1
Q(xl,−1) + 1 M
!M
j=1
1 dfj
dfj
!
i=1
Q((di, wj),1), (3)
where L− is the number of negative examples,M the number of words in the dictionary extracted from the training set, dfj the number of documents the word wj appears in, and Q(x, y) the cost function for an examplex= (d, w) and its targety. Note that using this weighting technique we do not need to present the whole negative example set but a sub-sampling of it at each iteration, which makes stochastic gradient descent training much faster. Furthermore, we use a margin-based cost functionQ(x, y) =|1−y·NNTR(x)|+, as proposed in [4], where|z|+= max(0, z).
4 Experiments
TDT2 is a database of transcripted broadcast news in American English. For this experiment we used 24 823 documents from a manually produced transcription and segmentation, referred to in
Conclusion:
Words & documents representations are more efficient than PLSA ones for several TREC information retrieval tasks
M. Keller, S. Bengio,ICANN 2005 A neural network for text representation
NLP & deep learning 6/39
Convolutional Neural Architecture
Lookup table concept
= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...
... But open source + open embeddings Based on torch...
By Collobert
R. Collobert, J. WestonICML 2008
A unified architecture for natural language processing: Deep neural networks with multitask learning
NLP & deep learning 7/39
Convolutional Neural Architecture
Lookup table concept
= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...
... But open source + open embeddings Based on torch...
By Collobert
R. Collobert, J. WestonICML 2008
A unified architecture for natural language processing: Deep neural networks with multitask learning
NLP & deep learning 7/39
Convolutional Neural Architecture
Lookup table concept
= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...
... But open source + open embeddings Based on torch...
By Collobert
R. Collobert, J. WestonICML 2008
A unified architecture for natural language processing: Deep neural networks with multitask learning
NLP & deep learning 7/39
Convolutional Neural Architecture
Lookup table concept
= table of embeddings State of the art on : POS, NER, SRL Quite difficult to set...
... But open source + open embeddings Based on torch...
By Collobert
R. Collobert, J. WestonICML 2008
A unified architecture for natural language processing: Deep neural networks with multitask learning
NLP & deep learning 7/39
Convolutional Neural Architecture (2) Several important informations
Embedding are learned keeping the sentence structure Embeddings benefit from multi-tasks
Learning is slow...
Our embeddings have been trained for about 2 months, over Wikipedia.
https://ronan.collobert.com/senna/
... But inference is fast & efficient
+ Costly embeddings have been made available to the community
NLP & deep learning 8/39
All modern ingredients
Raw textual
data
Convolution, fully connected, ...
Representation learning step
Learned representation
Embedding
Decision
step(s) Tasks
language modeling
Learning mechanism
First task = modeling language
Embeddings are made available Several tasks can be improved
NLP & deep learning 9/39
Word2Vec : extracting local semantics
the cat sat on the mat
sat Projection
w(t)
w(t-2) w(t-1) w(t+1) w(t+2)
Skip-gram
the cat sat on the mat
Sliding window (size = 5)
SUM
w(t)
w(t-2) w(t-1) w(t+1) w(t+2)
sat
CBOW
Sliding window (size = 5)
Skip-Gram (& negative sampling implementation) : easy learning Local analysis
predictive criterion : estimating missing value Sliding window = local context C of word w
Mikolov, Sutskever, Chen, Corrado, Dean,NIPS 2013 (arXiv 2012) Distributed representations of words and phrases and their compositionality
NLP & deep learning 10/39
W2V : detailed explaination
Given word w and local contexts C : Idée SG: arg max
θ
Y
C
Y
w∈C
p(C | w ; θ)
p (D = 1 | w
i, w
j; θ) ⇒ proba. that w
iand w
joccur in the same context
Espace vectoriel
C
arg max
θ
Y
i,j∈C
p(D = 1 | w
i, w
j; θ) + Y
i,j∈C¯
p(D = 0 | w
i, w
j; θ)
| {z }
Negative Sampling
Goldberg, Levy, arXiv 2014
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method
Hammer, NN 2002
Generalized Relevance Learning Vector Quantization
NLP & deep learning 11/39
W2V : Skip-gram & negative sampling
arg max
θ
Y
i,j∈C
p(D = 1 | w
i, w
j; θ) + Y
i,j∈C¯
p(D = 0 | w
i, w
j; θ)
| {z }
Negative Sampling Using logistic function : p(D = 1 | w
i, w
j) =
1+exp(1−zizj)
Global log-likelihood : Z
?= arg max
Z
X
i,j∈C
log σ(z
i· z
j) + X
i,j∈C¯
log σ( − z
i· z
j)
σ : fct sigmoide, C : Set of Cooccurences, C ¯ : Set of Non-Cooc
Stochastic Gradient Descent + triplet loss
Frequent word subsampling trick : picking words with p(w
i) = 1 − q
tfreq(wi)
, t = 10
−5NLP & deep learning 12/39
W2V: Parametrization & results
Choosing a large latent space: z ∈ [500, 1000]
(vs PLSA/LDA z ∈ [10, 100]) Operate on large & small corpora...
... very fast !
NLP & deep learning 13/39
W2V: Playing in the latent space
a is to b what c is to ??? ⇔ z
b− z
a+ z
cSyntactical property (1):
z
woman− z
man≈ z
queen− z
kingz
kings− z
king≈ z
queens− z
kingsQuery:
z
woman− z
man+ z
king= z
reqNearest neighbor:
arg min
ik z
req− z
ik = queen
NLP & deep learning 14/39
W2V: Playing in the latent space
a is to b what c is to ??? ⇔ z
b− z
a+ z
cSyntactical property (2):
Query:
z
easy− z
easiest+ z
luckiest= z
reqNearest neighbor:
arg min
ik z
req− z
ik = lucky
NLP & deep learning 14/39
W2V: Playing in the latent space
a is to b what c is to ??? ⇔ z
b− z
a+ z
cSemantic Property (1)
NLP & deep learning 14/39
W2V: Playing in the latent space
a is to b what c is to ??? ⇔ z
b− z
a+ z
cSemantic Property (2)
NEG-15 with10−5subsampling HS with10−5subsampling
Vasco de Gama Lingsugur Italian explorer
Lake Baikal Great Rift Valley Aral Sea
Alan Bean Rebbeca Naomi moonwalker
Ionian Sea Ruegen Ionian Islands
chess master chess grandmaster Garry Kasparov
Table 4: Examples of the closest entities to the given short phrases, using two different models.
Czech + currency Vietnam + capital German + airlines Russian + river French + actress
koruna Hanoi airline Lufthansa Moscow Juliette Binoche
Check crown Ho Chi Minh City carrier Lufthansa Volga River Vanessa Paradis Polish zolty Viet Nam flag carrier Lufthansa upriver Charlotte Gainsbourg
CTK Vietnamese Lufthansa Russia Cecile De
Table 5: Vector compositionality using element-wise addition. Four closest tokens to the sum of two vectors are shown, using the best Skip-gram model.
To maximize the accuracy on the phrase analogy task, we increased the amount of the training data by using a dataset with about 33 billion words. We used the hierarchical softmax, dimensionality of 1000, and the entire sentence for the context. This resulted in a model that reached an accuracy of72%. We achieved lower accuracy 66% when we reduced the size of the training dataset to 6B words, which suggests that the large amount of the training data is crucial.
To gain further insight into how different the representations learned by different models are, we did inspect manually the nearest neighbours of infrequent phrases using various models. In Table 4, we show a sample of such comparison. Consistently with the previous results, it seems that the best representations of phrases are learned by a model with the hierarchical softmax and subsampling.
5 Additive Compositionality
We demonstrated that the word and phrase representations learned by the Skip-gram model exhibit a linear structure that makes it possible to perform precise analogical reasoning using simple vector arithmetics. Interestingly, we found that the Skip-gram representations exhibit another kind of linear structure that makes it possible to meaningfully combine words by an element-wise addition of their vector representations. This phenomenon is illustrated in Table 5.
The additive property of the vectors can be explained by inspecting the training objective. The word vectors are in a linear relationship with the inputs to the softmax nonlinearity. As the word vectors are trained to predict the surrounding words in the sentence, the vectors can be seen as representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. The product works here as the AND function: words that are assigned high probabilities by both word vectors will have high probability, and the other words will have low probability. Thus, if “Volga River” appears frequently in the same sentence together with the words “Russian” and “river”, the sum of these two word vectors will result in such a feature vector that is close to the vector of “Volga River”.
6 Comparison to Published Word Representations
Many authors who previously worked on the neural network based representations of words have published their resulting models for further use and comparison: amongst the most well known au- thors are Collobert and Weston [2], Turian et al. [17], and Mnih and Hinton [10]. We downloaded their word vectors from the web3. Mikolov et al. [8] have already evaluated these word representa- tions on the word analogy task, where the Skip-gram models achieved the best performance with a huge margin.
3http://metaoptimize.com/projects/wordreprs/
7
NLP & deep learning 14/39
Why is it working?
Predicting instead of counting... Not the good explanation!
Learning a local semantics !
⇒ GloVe ≈ PLSA + local context + embedding based implem.
X ∈RV×V word co-occurrence matrix
Xij frequency of wordi co-occurring with wordj Xi =P
kXik total number of occurrences of wordi in corpus Pij=P(j|i) = XXij
i a.k.a. probability of word j occurring within the context of wordi
w ∈Rz a word embedding of dimension z
˜
w ∈Rz a context word embedding of dimension z
Baroni, Dinu & Kruszewski, ACL 2014
Don’t count, predict! A systematic comparison of context-counting vs.
context-predicting semantic vectors
Pennington, Socher & Manning, EMNLP 2014 Glove: Global Vectors for Word Representation
NLP & deep learning 15/39
W2V : Strengths & weaknesses
Incredible impact for a badly written article...
... Associated to strong results
... Lightweight algorithm and powerful implementation ... & available pretrained embeddings
One remaining question: we learned a powerful semantics at theword level... How scaling to thesentence ordocument level?
NLP & deep learning 16/39
W2V : Strengths & weaknesses
Incredible impact for a badly written article...
... Associated to strong results
... Lightweight algorithm and powerful implementation ... & available pretrained embeddings
One remaining question: we learned a powerful semantics at theword level... How scaling to thesentence ordocument level?
Q. Le, T. Mikolov,ICML 2014
Distributed representations of sentences and documents
NLP & deep learning 16/39
W2V : Strengths & weaknesses
Incredible impact for a badly written article...
... Associated to strong results
... Lightweight algorithm and powerful implementation ... & available pretrained embeddings
One remaining question: we learned a powerful semantics at theword level... How scaling to thesentence ordocument level?
Or simple averaging of word embeddings:
+ great results on small word groups
− poor results on larger groups
quickly converge to a central abstract point of the latent space
NLP & deep learning 16/39
W2V : Strengths & weaknesses
Incredible impact for a badly written article...
... Associated to strong results
... Lightweight algorithm and powerful implementation ... & available pretrained embeddings
One remaining question: we learned a powerful semantics at theword level... How scaling to thesentence ordocument level?
1 Aggregate multiple words associated to a single entity
Pointwise Mu-
tual Informationthreshold:
New York New York Times Baltimore Baltimore Sun San Jose San Jose Mercury News Cincinnati Cincinnati Enquirer
NHL Teams
Boston Boston Bruins Montreal Montreal Canadiens Phoenix Phoenix Coyotes Nashville Nashville Predators
NBA Teams
Detroit Detroit Pistons Toronto Toronto Raptors Oakland Golden State Warriors Memphis Memphis Grizzlies
Airlines
Austria Austrian Airlines Spain Spainair
Belgium Brussels Airlines Greece Aegean Airlines Company executives
Steve Ballmer Microsoft Larry Page Google
Samuel J. Palmisano IBM Werner Vogels Amazon
Table 2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples).
The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.
This way, we can form many reasonable phrases without greatly increasing the size of the vocabu- lary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using
score(wi, wj) = count(wiwj)−δ
count(wi)×count(wj). (6)
Theδis used as a discounting coefficient and prevents too many phrases consisting of very infre- quent words to be formed. The bigrams with score above the chosen threshold are then used as phrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allow- ing longer phrases that consists of several words to be formed. We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task. This dataset is publicly available on the web2. 4.1 Phrase Skip-Gram Results
Starting with the same news data as in the previous experiments, we first constructed the phrase based training corpus and then we trained several Skip-gram models using different hyper- parameters. As before, we used vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens.
The results are summarized in Table 3.
The results show that while Negative Sampling achieves a respectable accuracy even withk= 5, usingk= 15achieves considerably better performance. Surprisingly, while we found the Hierar- chical Softmax to achieve lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words. This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.
2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt
Method Dimensionality No subsampling [%] 10−5subsampling [%]
NEG-5 300 24 27
NEG-15 300 27 42
HS-Huffman 300 19 47
Table 3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.
6 2 Include new terms in the
dictionary before running word2vec
Newspapers
New York New York Times Baltimore Baltimore Sun San Jose San Jose Mercury News Cincinnati Cincinnati Enquirer
NHL Teams
Boston Boston Bruins Montreal Montreal Canadiens Phoenix Phoenix Coyotes Nashville Nashville Predators
NBA Teams
Detroit Detroit Pistons Toronto Toronto Raptors Oakland Golden State Warriors Memphis Memphis Grizzlies
Airlines
Austria Austrian Airlines Spain Spainair
Belgium Brussels Airlines Greece Aegean Airlines Company executives
Steve Ballmer Microsoft Larry Page Google
Samuel J. Palmisano IBM Werner Vogels Amazon
Table 2: Examples of the analogical reasoning task for phrases (the full test set has 3218 examples).
The goal is to compute the fourth phrase using the first three. Our best model achieved an accuracy of 72% on this dataset.
This way, we can form many reasonable phrases without greatly increasing the size of the vocabu- lary; in theory, we can train the Skip-gram model using all n-grams, but that would be too memory intensive. Many techniques have been previously developed to identify phrases in the text; however, it is out of scope of our work to compare them. We decided to use a simple data-driven approach, where phrases are formed based on the unigram and bigram counts, using
score(wi, wj) = count(wiwj)−δ
count(wi)×count(wj). (6)
Theδis used as a discounting coefficient and prevents too many phrases consisting of very infre- quent words to be formed. The bigrams with score above the chosen threshold are then used as phrases. Typically, we run 2-4 passes over the training data with decreasing threshold value, allow- ing longer phrases that consists of several words to be formed. We evaluate the quality of the phrase representations using a new analogical reasoning task that involves phrases. Table 2 shows examples of the five categories of analogies used in this task. This dataset is publicly available on the web2. 4.1 Phrase Skip-Gram Results
Starting with the same news data as in the previous experiments, we first constructed the phrase based training corpus and then we trained several Skip-gram models using different hyper- parameters. As before, we used vector dimensionality 300 and context size 5. This setting already achieves good performance on the phrase dataset, and allowed us to quickly compare the Negative Sampling and the Hierarchical Softmax, both with and without subsampling of the frequent tokens.
The results are summarized in Table 3.
The results show that while Negative Sampling achieves a respectable accuracy even withk= 5, usingk= 15achieves considerably better performance. Surprisingly, while we found the Hierar- chical Softmax to achieve lower performance when trained without subsampling, it became the best performing method when we downsampled the frequent words. This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases.
2code.google.com/p/word2vec/source/browse/trunk/questions-phrases.txt
Method Dimensionality No subsampling [%] 10−5subsampling [%]
NEG-5 300 24 27
NEG-15 300 27 42
HS-Huffman 300 19 47
Table 3: Accuracies of the Skip-gram models on the phrase analogy dataset. The models were trained on approximately one billion words from the news dataset.
6
NLP & deep learning 16/39
Recurrent Neural Network & LSTM The next step:
providing a better modeling of the link between words
⇒ Learning a latent representation of a specific word group ... but sentences have different lengths
⇒ using a recurrent aggregator ... ie a RNN
Chris Olah’s bloghttp://colah.github.io/posts/2015-08-Understanding-LSTMs/
NLP & deep learning 17/39
Recurrent Neural Network & LSTM The next step:
providing a better modeling of the link between words Classical RNN :
Gradient vanishes & long term dependancies are not modeled....
Chris Olah’s bloghttp://colah.github.io/posts/2015-08-Understanding-LSTMs/
NLP & deep learning 17/39
Recurrent Neural Network & LSTM The next step:
providing a better modeling of the link between words
The phenomenon has been understood & (partially) overcome:
Neuronslearnwhat should bekept in memoryand what should beforgotten
Gated architecture S. Hochreiter, J. Schmidhuber,Neural computation 1997
Long short-term memory
Chris Olah’s bloghttp://colah.github.io/posts/2015-08-Understanding-LSTMs/
NLP & deep learning 17/39
RNN architecture : different settings
From non recurrent modeling...
... to sequential aggregation... [doc2vec]
... through predicting N future steps
... to continuous supervision [POS/NER tagging]
Karpathy’s bloghttp://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLP & deep learning 18/39
State Of The Art : Bi-LSTM LSTM
+ Sequential modeling
− Sequential dependencies ! = partial modeling
Bi-dimensional representation [S
1, S
10] is more powerful
representation of the sentence S than each single representation.
Classical notation: s = [ − → s , ← − s ]
NLP & deep learning 19/39
ElMo : Deep contextualized word representations
Static word embeddings ⇒ Mapping function = dynamic embeddings
Method
With long short term memory (LSTM) network, predicting the next words in both directions to build biLMs
36
have a nice one …
Output layer
Hidden layers (LSTMs)
Embedding layer
…
a nice one
xk ok
k− 1 k− 1
Expanded in the forward direction of k The forward LM architecture
• Overview
• Method
• Evaluation
• Analysis
• Comments
tk hLMk1 hLMk2
bi-LSTM architecture + double hidden layer
M. E. Peters et al., arXiv 2018
Deep contextualized word representations
NLP & deep learning 20/39
ElMo : Deep contextualized word representations
Static word embeddings ⇒ Mapping function = dynamic embeddings
Method
ELMo represents a word as a linear combination of corresponding hidden layers (inc. its embedding)
37
tk
xk ok
k− 1
k− 1 hLMk1
ok k+ 1 k+ 1 hLMk2
hLMk1 hLMk2 Forward LM Backward LM
tk tk
s0task s1task
s2task Concatenate
hidden layers hLMk1
hLMk2
hLMk0 ([xk;xk])
×
×
× [hLMkj;hLMkj]
{
ELMotaskk = γtask ×∑
Unlike usual word embeddings, ELMo is assigned to every token instead of a type
biLMs ELMo is a task specific
representation. A down-stream task learns weighting parameters
M. E. Peters et al., arXiv 2018
Deep contextualized word representations
NLP & deep learning 20/39
ElMo : Deep contextualized word representations
Static word embeddings ⇒ Mapping function = dynamic embeddings
Many linguistic tasks are improved by using ELMo39
Evaluation
• Overview
• Method
• Evaluation
• Analysis
• Comments
Q&A Textual entailment
Sentiment analysis Semantic role labelling Coreference resolution Named entity recognition
M. E. Peters et al., arXiv 2018
Deep contextualized word representations
NLP & deep learning 20/39
Introduction Deep architectures Exploitation Knowledge Conclusion
ElMo : Deep contextualized word representations
Static word embeddings ⇒ Mapping function = dynamic embeddings
Analysis
The higher layer seemed to learn semantics while the lower layer probably captured syntactic features
40
• Overview
• Method
• Evaluation
• Analysis
• Comments
Word sense disambiguation PoS tagging
M. E. Peters et al., arXiv 2018
Deep contextualized word representations
NLP & deep learning 20/39
Next step: removing the recurrent architecture RNN = computation bottleneck
Vaswani et al., NIPS 2017 Attention Is All You Need Devlin et al., arXiv 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
NLP & deep learning 21/39
Next step: removing the recurrent architecture RNN = computation bottleneck
Vaswani et al., NIPS 2017 Attention Is All You Need Devlin et al., arXiv 2018
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
NLP & deep learning 21/39
fastText : syntactic robustness
How to deal with unknown/rare words ? Word2Vec : creating a specific embedding
to keep the global structure of the sentence fastText :
Word embeddings
Subword embeddings =N-grams of characters Implementation
relies on special bounding characters: <>
+ skip-gram & negative sampling = W2V formulation
Word where / N = 3 ⇒ <where> + <wh, whe, her, ere, re>
⇒ Rare/unknown words can be represented by their N-grams of characters
P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Trans. ACL 2017 Enriching Word Vectors with Subword Information
NLP & deep learning 22/39
fastText : syntactic robustness
How to deal with unknown/rare words ?
Few words appear a lot...
many words appear rarely Great issue !
P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Trans. ACL 2017 Enriching Word Vectors with Subword Information
NLP & deep learning 22/39
Introduction Deep architectures Exploitation Knowledge Conclusion
fastText : syntactic robustness
How to deal with unknown/rare words ? Typical result with fastText:
query tiling tech-rich english-born micromanaging eateries dendritic sisg tile tech-dominated british-born micromanage restaurants dendrite
flooring tech-heavy polish-born micromanaged eaterie dendrites sg bookcases technology-heavy most-capped defang restaurants epithelial
built-ins .ixic ex-scotland internalise delis p53
Table 7: Nearest neighbors of rare words using our representations and skipgram. These hand picked examples are for illustration.
Figure 2: Illustration of the similarity between character n -grams in out-of-vocabulary words. For each pair, only one word is OOV, and is shown on the x axis. Red indicates positive cosine, while blue negative.
Cherry on the cake: fastText... is fast
P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, Trans. ACL 2017 Enriching Word Vectors with Subword Information
NLP & deep learning 22/39
Exploiting new
representations in
Applications
Karpathy’s demonstration on char2char
c l a s s RNN :
# . . .
d e f s t e p ( s e l f , x ) :
# u p d a t e t h e h i d d e n s t a t e
s e l f . h = np . t a n h ( np . d o t ( s e l f . W_hh, s e l f . h ) + np . d o t ( s e l f . W_xh, x ) )
# compute t h e o u t p u t v e c t o r y = np . d o t ( s e l f . W_hy, s e l f . h ) r e t u r n y
+ multilayer architecture:
y1 = r n n 1 . s t e p ( x ) y = r n n 2 . s t e p ( y1 )
A. Karpathy’s blog : The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLP & deep learning 23/39
Karpathy’s demonstration on char2char Sample of Shakespeare generation
PANDARUS:
Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states.
A. Karpathy’s blog : The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLP & deep learning 24/39
Karpathy’s demonstration on char2char Wikipedia sample
Naturalism and decision for the majority of Arab
countries’ capitalide was grounded by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated with Guangzham’s sovereignty. His generals were the powerful ruler of the Portugal in the
[[Protestant Immineners]], which could be said to be directly in Cantonese Communication, which followed a ceremony and set inspired prison, training. The emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom of Costa Rica, unsuccessful fashioned the [[Thrales]]
A. Karpathy’s blog : The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLP & deep learning 25/39
Karpathy’s demonstration on char2char Linux code generation
*
* Increment the size file of the new incorrect UI_FILTER group information
* of the size generatively.
*/
static int indicate_policy(void) {
int error;
if (fd == MARN_EPT) { /*
* The kernel blank will coeld it to userspace.
*/
if (ss->segment < mem_total) unblock_graph_and_set_blocked();
else ret = 1;
goto bail;
}
A. Karpathy’s blog : The Unreasonable Effectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
NLP & deep learning 26/39
Introduction Deep architectures Exploitation Knowledge Conclusion
RNN & Context modeling
Still char2char architecture
(a)
(b)
(c)
Figure 3: (a) Standard generative RNN; (b) encoder- decoder RNN; (c) concatenated input RNN.
3.2 Generative Concatenative RNNs
Our goal is to generate text in a supervised fashion, condi- tioned on an auxiliary inputxaux. This has been done at the word-level with encoder-decoder models (Figure 3b), in which the auxiliary input is encoded and passed as the ini- tial state to a decoder, which then must preserve this input signal across many sequence steps (Sutskever et al., 2014;
Karpathy and Fei-Fei, 2014). Such models have success- fully produced (short) image captions, but seem impracti- cal for generating full reviews at the character level because signal fromxaux must survive for hundreds of sequence steps.
We take inspiration from an analogy to human text gen- eration. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.
The value of re-iterating high-level material is borne out in one study, Surber and Schroeder (2007), which showed that repetitive subject headings in textbooks resulted in faster learning, less rereading and more accurate answers to high- level questions.
Thus we propose the generative concatenative network, a simple architecture in which inputxaux is concatenated with the character representationx(t)char. Given this new inputx0(t) = [x(t)char;xaux]we can train the model pre-
training time,xauxis a feature of the training set. At pre- diction time, we fix somexaux, concatenating it with each character sampled fromyˆ(t). One might reasonably note that this replicated input information is redundant. How- ever, since it is fixed over the course of the review, we see no reason to require the model to transmit this signal across hundreds of time steps. By replicatingxauxat each input, we free the model to focus on learning the complex interac- tion between the auxiliary input and language, rather than memorizing the input.
3.3 Weight Transplantation
Models with even modestly sized auxiliary input represen- tations are considerably harder to train than a typical un- supervised character model. To overcome this problem, we first train a character model to convergence. Then we trans- plant these weights into a concatenated input model, initial- izing the extra weights (between the input layer and the first hidden layer) to zero. Zero initialization is not problem- atic here because symmetry in the hidden layers is already broken. Thus we guarantee that the model will achieve a strictly lower loss than a character model, saving (days of) repeated training. This scheme bears some resemblance to the pre-training common in the computer vision commu- nity (Yosinski et al., 2014). Here, instead of new output weights, we train new input weights.
3.4 Running the Model in Reverse
Many common document classification models, like tf-idf logistic regression, maximize the likelihood of the train- ing labels given the text. Given our generative model, we can produce a predictor by reversing the order of in- ference, that is, by maximizing the likelihood of the text, given a classification. The relationship between these two tasks (P(xaux|Review)andP(Review|xaux)) follows from Bayes’ rule. That is, our model predicts the con- ditional probability P(Review|xaux) of an entire review given somexaux(such as a star rating). The normalizing term can be disregarded in determining the most probable rating and when the classes are balanced, as they are in our test cases, the prior also vanishes from the decision rule leavingP(xaux|Review)/P(Review|xaux).
In the specific case of author identification, we are solving the problem
argmax
u2users P(Review|u)·P(u)
= argmax
u
YT t=1
P(yt|u, y1, ..., yt 1)
!
·P(u)
= argmax
u
XT
t=1
log(P(yt|u, y1, ..., yt 1))
!
+log(P(u))
Input = char + context Inference = test of all contexts + max. likelyhood
Lipton, Vikram, McAuley, arXiv, 2016 Gen. Concatenative Nets Jointly Learn to Write and Classify Reviews Demo: http://deepx.ucsd.edu/
Figure 5: Most likely star rating as each letter is encoun- tered. The GCN learns to tilt positive by the ‘c’ in ‘excel- lent’ and that the ‘f’ in ‘awful’ reveals negative sentiment.
4.5 Learning Nonlinear Dynamics of Negation By qualitatively evaluating the beliefs of the network about the sentiment (rating) corresponding to various phrases, we can easily show many clear cases where the GCN is able to model the nonlinear dynamics in text. To demonstrate this capacity, we plot the likelihoods of the review conditioned on each setting of the rating for the phrases “this beer is great”, “this beer is not great”, “this beer is bad”, and “this beer is not bad” Figure 6.
5 Related Work
The prospect of capturing meaning in character-level text has long captivated neural network researchers. In the seminal work, “Finding Structure in Time”, Elman (1990) speculated, “one can ask whether the notion ‘word’ (or something which maps on to this concept) could emerge as a consequence of learning the sequential structure of let- ter sequences that form words and sentences (but in which word boundaries are not marked).” In this work, an ‘El- man RNN’ was trained with5input nodes,5output nodes, and a single hidden layer of20nodes, each of which had a corresponding context unit to predict the next character in a sequence. At each step, the network received a binary encoding (not one-hot) of a character and tried to predict the next character’s binary encoding. Elman plots the error
Predicting User from Review
Accuracy AUC Recall@10% GCN-Character .9190 .9979 .9700 TF-IDF n-gram .9133 .9979 .9756
Predicting Item from Review
Accuracy AUC Recall@10% GCN-Character .1280 .9620 .4370 TF-IDF n-gram .2427 .9672 .5974 Table 2: Information retrieval performance for predicting the author of a review and item described.
Predicted
F/V Lager Stout Porter IPA
True
F/V 910 28 7 14 41
Lager 50 927 3 3 17
Stout 16 1 801 180 2
Porter 22 3 111 856 8
IPA 19 12 4 12 953
(a) GCN
Predicted
F/V Lager Stout Porter IPA
True
F/V 923 36 9 10 22
Lager 16 976 0 1 7
Stout 9 4 920 65 2
Porter 11 6 90 887 6
IPA 18 13 1 2 966
(b) ngram tf-idf.
Table 3: Beer category classification confusion matrices.
of the net character by character, showing that it is typi- cally high at the onset of words, but decreasing as it be- comes clear what each word is. While these nets do not possess the size or capabilities of large modern LSTM net- works trained on GPUs, this work lays the foundation for much of our research. Subsequently, in 2011, Sutskever et al. (2011) introduced the model of text generation on which we build. In that paper, the authors generate text resembling Wikipedia articles and New York Times arti- cles. They sanity check the model by showing that it can perform adebaggingtask in which it unscrambles bag-of- words representations of sentences by determining which unscrambling has the highest likelihood. Also relevant to our work is Zhang and LeCun (2015), which trains a strictly discriminative model of text at the character level using convolutional neural networks (LeCun et al., 1989, 1998).
Demonstrating success on both English and Chinese lan- guage datasets, their models achieve high accuracy on a number of classification tasks. Dai and Le (2015) train a character level LSTM to perform document classification, using LSTM RNNs pretrained as either language models
NLP & deep learning 27/39
Introduction Deep architectures Exploitation Knowledge Conclusion
RNN & Context modeling
Still char2char architecture
(a)
(b)
(c)
Figure 3: (a) Standard generative RNN; (b) encoder- decoder RNN; (c) concatenated input RNN.
3.2 Generative Concatenative RNNs
Our goal is to generate text in a supervised fashion, condi- tioned on an auxiliary inputxaux. This has been done at the word-level with encoder-decoder models (Figure 3b), in which the auxiliary input is encoded and passed as the ini- tial state to a decoder, which then must preserve this input signal across many sequence steps (Sutskever et al., 2014;
Karpathy and Fei-Fei, 2014). Such models have success- fully produced (short) image captions, but seem impracti- cal for generating full reviews at the character level because signal fromxaux must survive for hundreds of sequence steps.
We take inspiration from an analogy to human text gen- eration. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused.
The value of re-iterating high-level material is borne out in one study, Surber and Schroeder (2007), which showed that repetitive subject headings in textbooks resulted in faster learning, less rereading and more accurate answers to high- level questions.
Thus we propose the generative concatenative network, a simple architecture in which inputxaux is concatenated with the character representationx(t)char. Given this new inputx0(t) = [x(t)char;xaux]we can train the model pre-
training time,xauxis a feature of the training set. At pre- diction time, we fix somexaux, concatenating it with each character sampled fromyˆ(t). One might reasonably note that this replicated input information is redundant. How- ever, since it is fixed over the course of the review, we see no reason to require the model to transmit this signal across hundreds of time steps. By replicatingxauxat each input, we free the model to focus on learning the complex interac- tion between the auxiliary input and language, rather than memorizing the input.
3.3 Weight Transplantation
Models with even modestly sized auxiliary input represen- tations are considerably harder to train than a typical un- supervised character model. To overcome this problem, we first train a character model to convergence. Then we trans- plant these weights into a concatenated input model, initial- izing the extra weights (between the input layer and the first hidden layer) to zero. Zero initialization is not problem- atic here because symmetry in the hidden layers is already broken. Thus we guarantee that the model will achieve a strictly lower loss than a character model, saving (days of) repeated training. This scheme bears some resemblance to the pre-training common in the computer vision commu- nity (Yosinski et al., 2014). Here, instead of new output weights, we train new input weights.
3.4 Running the Model in Reverse
Many common document classification models, like tf-idf logistic regression, maximize the likelihood of the train- ing labels given the text. Given our generative model, we can produce a predictor by reversing the order of in- ference, that is, by maximizing the likelihood of the text, given a classification. The relationship between these two tasks (P(xaux|Review)andP(Review|xaux)) follows from Bayes’ rule. That is, our model predicts the con- ditional probability P(Review|xaux) of an entire review given somexaux(such as a star rating). The normalizing term can be disregarded in determining the most probable rating and when the classes are balanced, as they are in our test cases, the prior also vanishes from the decision rule leavingP(xaux|Review)/P(Review|xaux).
In the specific case of author identification, we are solving the problem
argmax
u2users P(Review|u)·P(u)
= argmax
u
YT t=1
P(yt|u, y1, ..., yt 1)
!
·P(u)
= argmax
u
XT
t=1
log(P(yt|u, y1, ..., yt 1))
!
+log(P(u))
Input = char + context Inference = test of all contexts + max. likelyhood
Lipton, Vikram, McAuley, arXiv, 2016 Gen. Concatenative Nets Jointly Learn to Write and Classify Reviews Demo: http://deepx.ucsd.edu/
Figure 5: Most likely star rating as each letter is encoun- tered. The GCN learns to tilt positive by the ‘c’ in ‘excel- lent’ and that the ‘f’ in ‘awful’ reveals negative sentiment.
4.5 Learning Nonlinear Dynamics of Negation By qualitatively evaluating the beliefs of the network about the sentiment (rating) corresponding to various phrases, we can easily show many clear cases where the GCN is able to model the nonlinear dynamics in text. To demonstrate this capacity, we plot the likelihoods of the review conditioned on each setting of the rating for the phrases “this beer is great”, “this beer is not great”, “this beer is bad”, and “this beer is not bad” Figure 6.
5 Related Work
The prospect of capturing meaning in character-level text has long captivated neural network researchers. In the seminal work, “Finding Structure in Time”, Elman (1990) speculated, “one can ask whether the notion ‘word’ (or something which maps on to this concept) could emerge as a consequence of learning the sequential structure of let- ter sequences that form words and sentences (but in which word boundaries are not marked).” In this work, an ‘El- man RNN’ was trained with5input nodes,5output nodes, and a single hidden layer of20nodes, each of which had a corresponding context unit to predict the next character in a sequence. At each step, the network received a binary encoding (not one-hot) of a character and tried to predict the next character’s binary encoding. Elman plots the error
Predicting User from Review
Accuracy AUC Recall@10% GCN-Character .9190 .9979 .9700 TF-IDF n-gram .9133 .9979 .9756
Predicting Item from Review
Accuracy AUC Recall@10% GCN-Character .1280 .9620 .4370 TF-IDF n-gram .2427 .9672 .5974 Table 2: Information retrieval performance for predicting the author of a review and item described.
Predicted
F/V Lager Stout Porter IPA
True
F/V 910 28 7 14 41
Lager 50 927 3 3 17
Stout 16 1 801 180 2
Porter 22 3 111 856 8
IPA 19 12 4 12 953
(a) GCN
Predicted
F/V Lager Stout Porter IPA
True
F/V 923 36 9 10 22
Lager 16 976 0 1 7
Stout 9 4 920 65 2
Porter 11 6 90 887 6
IPA 18 13 1 2 966
(b) ngram tf-idf.
Table 3: Beer category classification confusion matrices.
of the net character by character, showing that it is typi- cally high at the onset of words, but decreasing as it be- comes clear what each word is. While these nets do not possess the size or capabilities of large modern LSTM net- works trained on GPUs, this work lays the foundation for much of our research. Subsequently, in 2011, Sutskever et al. (2011) introduced the model of text generation on which we build. In that paper, the authors generate text resembling Wikipedia articles and New York Times arti- cles. They sanity check the model by showing that it can perform adebaggingtask in which it unscrambles bag-of- words representations of sentences by determining which unscrambling has the highest likelihood. Also relevant to our work is Zhang and LeCun (2015), which trains a strictly discriminative model of text at the character level using convolutional neural networks (LeCun et al., 1989, 1998).
Demonstrating success on both English and Chinese lan- guage datasets, their models achieve high accuracy on a number of classification tasks. Dai and Le (2015) train a character level LSTM to perform document classification, using LSTM RNNs pretrained as either language models
NLP & deep learning 27/39