Word representations:
A simple and general method for semi-supervised learning
Joseph Turian
with Lev Ratinov and Yoshua Bengio
Sup model Sup
data
Supervised training
Sup model Sup
data
Semi-sup training?
Sup model Sup
data
Supervised training
Semi-sup training?
Sup model Sup
data
Semi-sup training?
More feats
Sup model Sup
data More
feats
Sup Sup More
feats
sup task 1
sup task 2
Semi-sup model Unsup
data
Sup
Joint semi-sup
Semi-sup model
Unsup model Unsup
data
Sup data unsup
pretraining
semi-sup
Unsup model Unsup
data
unsup training
unsup feats
Semi-sup model
Unsup data
unsup training
Sup data unsup
feats
Unsup data
unsup training
unsup feats
sup task 1 sup task 2 sup task 3
What unsupervised features are most useful in NLP?
Natural language processing
• Words, words, words
• Words, words, words
• Words, words, words
How do we handle words?
• Not very well
“One-hot” word representation
• |V| = |vocabulary|, e.g. 50K for PTB2
word -1, word 0, word +1 Pr dist over labels
(3*|V|) x m
3*|V|
m
One-hot word representation
• 85% of vocab words occur as only 10% of corpus tokens
• Bad estimate of Pr(label|rare word)
word 0
|V| x m
|V|
m
Approach
Approach
• Manual feature engineering
Approach
• Manual feature engineering
Approach
• Induce word reprs over large corpus, unsupervised
• Use word reprs as word features for supervised task
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
Distributional representations
W F (size of vocab)
C
e.g. Fw,v = Pr(v follows word w)
Distributional representations
W F (size of vocab)
C d
g( ) = f
g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
Class-based word repr
• |C| classes, hard clustering
word 0
(|V|+|C|) x m
|V|+|C|
m
Class-based word repr
• Hard vs. soft clustering
• Hierarchical vs. flat clustering
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering
• Distributed word reprs
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering
• Distributed word reprs
Brown clustering
• Hard, hierarchical class-based LM
• Brown et al. (1992)
• Greedy technique for maximizing bigram mutual information
• Merge words by contextual similarity
Brown clustering
cluster(chairman) = `0010’
2-prefix(cluster(chairman)) =
Brown clustering
• Hard, hierarchical class-based LM
• 1000 classes
• Use prefixes = 4, 6, 10, 20
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
– Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering
• Distributed word reprs
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
Distributed word repr
• k- (low) dimensional, dense representation
• “word embedding” matrix E of size |V| x k
k x m k
m
Sequence labeling w/ embeddings
word -1, word 0, word +1
(3*k) x m
|V| x k, tied
weights
m
“word embedding” matrix E of size |V| x k
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
– Collobert + Weston (2008)
– HLBL embeddings (Mnih + Hinton, 2007)
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
– Collobert + Weston (2008)
– HLBL embeddings (Mnih + Hinton, 2007)
Collobert + Weston 2008
w1 w2 w3 w4 w5 50*5
1 100
w1 w2 w3 w4 w5 score > μ + score
50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)
Less sparse word reprs?
• Distributional word reprs
• Class-based (clustering) word reprs
• Distributed word reprs
– Collobert + Weston (2008)
– HLBL embeddings (Mnih + Hinton, 2007)
Log bilinear Language Model (LBL)
w1 w2 w3 w4 w5
Linear prediction of w5
} Pr exp(predictZ target)
HLBL
• HLBL = hierarchical (fast) training of LBL
• Mnih + Hinton (2009)
Approach
• Induce word reprs over large corpus, unsupervised
– Brown: 3 days
– HLBL: 1 week, 100 epochs – C&W: 4 weeks, 50 epochs
• Use word reprs as word features for supervised task
Unsupervised corpus
• RCV1 newswire
• 40M tokens (vocab = all 270K types)
Supervised Tasks
• Chunking (CoNLL, 2000)
– CRF (Sha + Pereira, 2003)
• Named entity recognition (NER)
– Averaged perceptron (linear classifier) – Based upon Ratinov + Roth (2009)
Unsupervised word reprs as features
Word = “the”
Embedding = [-0.2, …, 1.6]
Brown cluster = 1010001100 (cluster 4-prefix = 1010,
cluster 6-prefix = 101000,
Unsupervised word reprs as features
Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...}
X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1, class-2-pre4=“1010”: 1,
class-2-pre6=“101000”: 1}
X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1, word-2-dim00: -0.2, …,
Embeddings: Normalization
E = σ * E / stddev(E)
Embeddings: Normalization (Chunking)
Embeddings: Normalization (NER)
Repr capacity (Chunking)
Repr capacity (NER)
Test results (Chunking)
93.5 94 94.5 95
95.5 baseline
HLBL C&W Brown
C&W+Brown Suzuki+Isozaki (08), 15M
Test results (NER)
86 87 88 89 90
91 Baseline
Baseline+nonloc al
Gazeteers C&W
HLBL Brown
MUC7 (OOD) results (NER)
68 70 72 74 76 78 80 82 84
Baseline
Baseline+nonloc al
Gazeteers C&W
HLBL Brown All
Test results (NER)
88.5 89 89.5 90 90.5 91
Lin+Wu (09), 3.4B
Suzuki+Isozaki (08), 37M
Suzuki+Isozaki (08), 1B
All+nonlocal, 37M
Lin+Wu (09),
Test results
• Chunking: C&W = Brown
• NER: C&W < Brown
• Why?
Word freq vs word error (Chunking)
Word freq vs word error (NER)
Summary
• Both Brown + word emb can increase acc of near-SOTA system
• Combining can improve accuracy further
• On rare words, Brown > word emb
• Scale parameter σ = 0.1
• Goodies:
http://metaoptimize.com/projects/wordreprs/
Difficulties with word embeddings
• No stopping criterion during unsup training
• More active features (slower sup training)
• Hyperparameters
– Learning rate for model
– (optional) Learning rate for embeddings – Normalization constant
• vs. Brown clusters, few hyperparams
HMM approach
• Soft, flat class-based repr
• Multinomial distribution over hidden states
= word representation
• 80 hidden states
• Huang and Yates (2009)
• No results with HMM approach yet