Word representations: A simple and general method for semi-supervised learning

(1)

Word representations:

A simple and general method for semi-supervised learning

Joseph Turian

with Lev Ratinov and Yoshua Bengio

(2)

Sup model Sup

data

Supervised training

(3)

data

Semi-sup training?

(4)

data

Supervised training

(5)

data

More feats

(6)

data More

feats

Sup Sup More

feats

sup task 1

sup task 2

(7)

Semi-sup model Unsup

data

Sup

Joint semi-sup

(8)

Semi-sup model

Unsup model Unsup

data

Sup data unsup

pretraining

semi-sup

(9)

Unsup model Unsup

data

unsup training

unsup feats

(10)

Semi-sup model

Unsup data

Sup data unsup

feats

(11)

Unsup data

unsup feats

sup task 1 sup task 2 sup task 3

(12)

What unsupervised features are most useful in NLP?

(13)

Natural language processing

• Words, words, words

(14)

How do we handle words?

• Not very well

(15)

“One-hot” word representation

• |V| = |vocabulary|, e.g. 50K for PTB2

word -1, word 0, word +1 Pr dist over labels

(3*|V|) x m

3*|V|

m

(16)

One-hot word representation

• 85% of vocab words occur as only 10% of corpus tokens

• Bad estimate of Pr(label|rare word)

word 0

|V| x m

|V|

m

(17)

Approach

(18)

Approach

• Manual feature engineering

(19)

Approach

• Manual feature engineering

(20)

Approach

• Induce word reprs over large corpus, unsupervised

• Use word reprs as word features for supervised task

(21)

Less sparse word reprs?

• Distributional word reprs

• Class-based (clustering) word reprs

• Distributed word reprs

(22)

(23)

Distributional representations

W F (size of vocab)

C

e.g. F_w,v = Pr(v follows word w)

(24)

Distributional representations

W F (size of vocab)

C d

g( )^{= f}

g(F) = f, e.g. g = LSI/LSA, LDA, PCA, ICA, rand trans

(25)

(26)

Class-based word repr

• |C| classes, hard clustering

word 0

(|V|+|C|) x m

|V|+|C|

m

(27)

Class-based word repr

• Hard vs. soft clustering

• Hierarchical vs. flat clustering

(28)

– Brown (hard, hierarchical) clustering – HMM (soft, flat) clustering

(29)

(30)

Brown clustering

• Hard, hierarchical class-based LM

• Brown et al. (1992)

• Greedy technique for maximizing bigram mutual information

• Merge words by contextual similarity

(31)

Brown clustering

cluster(chairman) = `0010’

2-prefix(cluster(chairman)) =

(32)

Brown clustering

• Hard, hierarchical class-based LM

• 1000 classes

• Use prefixes = 4, 6, 10, 20

(33)

(34)

(35)

Distributed word repr

• k- (low) dimensional, dense representation

• “word embedding” matrix E of size |V| x k

k x m k

m

(36)

Sequence labeling w/ embeddings

word -1, word 0, word +1

(3*k) x m

|V| x k, tied

weights

m

“word embedding” matrix E of size |V| x k

(37)

– Collobert + Weston (2008)

– HLBL embeddings (Mnih + Hinton, 2007)

(38)

(39)

Collobert + Weston 2008

w1 w2 w3 w4 w5 50*5

1 100

w1 w2 w3 w4 w5 score > μ + score

(40)

50-dim embeddings: Collobert + Weston (2008) t-SNE vis by van der Maaten + Hinton (2008)

(41)

(42)

Log bilinear Language Model (LBL)

w1 w2 w3 w4 w5

Linear prediction of w5

} ^Pr ^ ^exp(^predict^Z ^ ^target⁾

(43)

HLBL

• HLBL = hierarchical (fast) training of LBL

• Mnih + Hinton (2009)

(44)

Approach

• Induce word reprs over large corpus, unsupervised

– Brown: 3 days

– HLBL: 1 week, 100 epochs – C&W: 4 weeks, 50 epochs

• Use word reprs as word features for supervised task

(45)

Unsupervised corpus

• RCV1 newswire

• 40M tokens (vocab = all 270K types)

(46)

Supervised Tasks

• Chunking (CoNLL, 2000)

– CRF (Sha + Pereira, 2003)

• Named entity recognition (NER)

– Averaged perceptron (linear classifier) – Based upon Ratinov + Roth (2009)

(47)

Unsupervised word reprs as features

Word = “the”

Embedding = [-0.2, …, 1.6]

Brown cluster = 1010001100 (cluster 4-prefix = 1010,

cluster 6-prefix = 101000,

(48)

Unsupervised word reprs as features

Orig X = {pos-2=“DT”: 1, word-2=“the”: 1, ...}

X w/ Brown = {pos-2=“DT”: 1 , word-2=“the”: 1, class-2-pre4=“1010”: 1,

class-2-pre6=“101000”: 1}

X w/ emb = {pos-2=“DT”: 1 , word-2=“the”: 1, word-2-dim00: -0.2, …,

(49)

Embeddings: Normalization

E = σ * E / stddev(E)

(50)

Embeddings: Normalization (Chunking)

(51)

Embeddings: Normalization (NER)

(52)

Repr capacity (Chunking)

(53)

Repr capacity (NER)

(54)

Test results (Chunking)

93.5 94 94.5 95

95.5 baseline

HLBL C&W Brown

C&W+Brown Suzuki+Isozaki (08), 15M

(55)

Test results (NER)

86 87 88 89 90

91 ^Baseline

Baseline+nonloc al

Gazeteers C&W

HLBL Brown

(56)

MUC7 (OOD) results (NER)

68 70 72 74 76 78 80 82 84

Baseline

Baseline+nonloc al

Gazeteers C&W

HLBL Brown All

(57)

Test results (NER)

88.5 89 89.5 90 90.5 91

Lin+Wu (09), 3.4B

Suzuki+Isozaki (08), 37M

Suzuki+Isozaki (08), 1B

All+nonlocal, 37M

Lin+Wu (09),

(58)

Test results

• Chunking: C&W = Brown

• NER: C&W < Brown

• Why?

(59)

Word freq vs word error (Chunking)

(60)

Word freq vs word error (NER)

(61)

Summary

• Both Brown + word emb can increase acc of near-SOTA system

• Combining can improve accuracy further

• On rare words, Brown > word emb

• Scale parameter σ = 0.1

• Goodies:

http://metaoptimize.com/projects/wordreprs/

(62)

Difficulties with word embeddings

• No stopping criterion during unsup training

• More active features (slower sup training)

• Hyperparameters

– Learning rate for model

– (optional) Learning rate for embeddings – Normalization constant

• vs. Brown clusters, few hyperparams

(63)

HMM approach

• Soft, flat class-based repr

• Multinomial distribution over hidden states

= word representation

• 80 hidden states

• Huang and Yates (2009)

• No results with HMM approach yet