• Aucun résultat trouvé

Improving Distributional Similarity with Lessons Learned from Word Embeddings

N/A
N/A
Protected

Academic year: 2022

Partager "Improving Distributional Similarity with Lessons Learned from Word Embeddings"

Copied!
16
0
0

Texte intégral

(1)

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Omer Levy Yoav Goldberg Ido Dagan Computer Science Department

Bar-Ilan University Ramat-Gan, Israel

{omerlevy,yogo,dagan}@cs.biu.ac.il

Abstract

Recent trends suggest that neural- network-inspired word embedding models outperform traditional count-based distri- butional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter op- timizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

1 Introduction

Understanding the meaning of a word is at the heart of natural language processing (NLP). While a deep, human-like, understanding remains elu- sive, many methods have been successful in recov- ering certain aspects of similarity between words.

Recently, neural-network based approaches in which words are “embedded” into a low- dimensional space were proposed by various au- thors (Bengio et al., 2003; Collobert and Weston, 2008). These models represent each word as ad- dimensional vector of real numbers, and vectors that are close to each other are shown to be se- mantically related. In particular, a sequence of pa- pers by Mikolov et al. (2013a; 2013b) culminated in the skip-gram with negative-sampling training method (SGNS): an efficient embedding algorithm that provides state-of-the-art results on various lin- guistic tasks. It was popularized viaword2vec, a program for creating word embeddings.

A recent study by Baroni et al. (2014) con- ducts a set of systematic experiments compar- ing word2vec embeddings to the more tradi- tional distributional methods, such as pointwise mutual information (PMI) matrices (see Turney and Pantel (2010) and Baroni and Lenci (2010) for comprehensive surveys). These results suggest that the new embedding methods consistently out- perform the traditional methods by a non-trivial margin on many similarity-oriented tasks. How- ever, state-of-the-art embedding methods are all based on the same bag-of-contexts representation of words. Furthermore, analysis by Levy and Goldberg (2014c) shows thatword2vec’s SGNS is implicitly factorizing a word-context PMI ma- trix. That is, the mathematical objective and the sources of information available to SGNS are in fact very similar to those employed by the more traditional methods.

What, then, is the source of superiority (or per- ceived superiority) of these recent embeddings?

While the focus of the presentation in the word- embedding literature is on the mathematical model and the objective being optimized, other factors af- fect the results as well. In particular, embedding algorithms suggest some naturalhyperparameters that can be tuned; many of which were already tuned to some extent by the algorithms’ design- ers. Some hyperparameters, such as the number of negative samples to use, are clearly marked as tunable. Other modifications, such as smoothing the negative-sampling distribution, are reported in passing and considered thereafter as part of the al- gorithm. Others still, such as dynamically-sized context windows, are not even mentioned in some of the papers, but are part of the canonical imple- mentation. All of these modifications and system design choices, which we collectively denote as hyperparameters, are part of the final algorithm, and, as we show, have a substantial impact on per- formance.

211

Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225, 2015. Action Editor: Patrick Pantel.

(2)

In this work, we make these hyperparameters explicit, and show how they can beadapted and transferred into the traditional count-based ap- proach. To asses how each hyperparameter con- tributes to the algorithms’ performance, we con- duct a comprehensive set of experiments and com- pare four different representation methods, while controlling for the various hyperparameters.

Once adapted across methods, hyperparameter tuning significantly improves performance in ev- ery task. In many cases, changing the setting of a single hyperparameter yields a greater increase in performance than switching to a better algorithm or training on a larger corpus.

In particular, word2vec’s smoothing of the negative sampling distribution can be adapted to PPMI-based methods by introducing a novel, smoothed variant of the PMI association measure (see Section 3.2). Using this variant increases per- formance by over 3 points per task, on average.

We suspect that this smoothing partially addresses the “Achilles’ heel” of PMI: its bias towards co- occurrences of rare words.

We also show that when all methods are allowed to tune a similar set of hyperparameters, their per- formance is largely comparable. In fact, there is no consistent advantage to one algorithmic approach over another, a result that contradicts the claim that embeddings are superior to count-based methods.

2 Background

We consider four word representation methods:

the explicit PPMI matrix, SVD factorization of said matrix, SGNS, and GloVe. For historical reasons, we refer to PPMI and SVD as “count- based” representations, as opposed to SGNS and GloVe, which are often referred to as “neural”

or “prediction-based” embeddings. All of these methods (as well as all other “skip-gram”-based embedding methods) are essentially bag-of-words models, in which the representation of each word reflects a weighted bag of context-words that co- occur with it. Such bag-of-word embedding mod- els were previously shown to perform as well as or better than more complex embedding methods on similarity and analogy tasks (Mikolov et al., 2013a; Pennington et al., 2014).

Notation We assume a collection of wordsw∈ VW and their contextsc∈VC, whereVW andVC are the word and context vocabularies, and denote the collection of observed word-context pairs as

D. We use#(w, c)to denote the number of times the pair(w, c)appears inD. Similarly,#(w) = P

c0VC#(w, c0) and#(c) = P

w0VW#(w0, c) are the number of timesw and coccurred in D, respectively. In some algorithms, words and con- texts are embedded in a space ofd dimensions.

In these cases, each wordw ∈ VW is associated with a vector w~ ∈ Rd and similarly each con- textc ∈ VC is represented as a vector~c ∈ Rd. We sometimes refer to the vectorsw~ as rows in a

|VW|×dmatrix W, and to the vectors~cas rows in a|VC|×dmatrixC. When referring to embed- dings produced by a specific methodx, we may useWxandCx(e.g.WSGN SorCSV D). All vec- tors are normalized to unit length before they are used for similarity calculation, making cosine sim- ilarity and dot-product equivalent (see Section 3.3 for further discussion).

Contexts D is commonly obtained by taking a corpusw1, w2, . . . , wn and defining the contexts of wordwi as the words surrounding it in anL- sized window wiL, . . . , wi1, wi+1, . . . , wi+L. While other definitions of contexts have been stud- ied (Pad´o and Lapata, 2007; Baroni and Lenci, 2010; Levy and Goldberg, 2014a) this work fo- cuses on fixed-window bag-of-words contexts.

2.1 Explicit Representations (PPMI Matrix) The traditional way to represent words in the distributional approach is to construct a high- dimensional sparse matrix M, where each row represents a word w in the vocabulary VW and each column a potential contextc∈VC. The value of each matrix cellMijrepresents the association between the wordwiand the contextcj. A popular measure of this association is pointwise mutual in- formation (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio betweenwandc’s joint probability and the product of their marginal prob- abilities, which can be estimated by:

P M I(w, c) = log ˆPˆ(w,c)

P(w) ˆP(c) = log#(w,c)#(w)·#(c)·|D| The rows ofMPMIcontain many entries of word- context pairs (w, c) that were never observed in the corpus, for whichP M I(w, c) = log 0 =−∞. A common approach is thus to replace theMPMI matrix withM0PMI, in whichP M I(w, c) = 0 in cases where#(w, c) = 0. A more consistent ap- proach is to usepositive PMI(PPMI), in which all

(3)

negative values are replaced by0:

P P M I(w, c) = max (P M I(w, c),0) Bullinaria and Levy (2007) showed that MPPMI outperformsM0PMIon semantic similarity tasks.

A well-known shortcoming of PMI, which per- sists in PPMI, is its bias towards infrequent events (Turney and Pantel, 2010). A rare context c that co-occurred with a target wordweven once, will often yield relatively high PMI score because Pˆ(c), which is in PMI’s denominator, is very small. This creates a situation in which the top

“distributional features” (contexts) ofw are often extremely rare words, which do not necessarily appear in the respective representations of words that are semantically similar tow. Nevertheless, the PPMI measure is widely regarded as state-of- the-art for these kinds of distributional-similarity models.

2.2 Singular Value Decomposition (SVD) While sparse vector representations work well, there are also advantages to working with dense low-dimensional vectors, such as improved com- putational efficiency and, arguably, better gener- alization. Such vectors can be obtained by per- forming dimensionality reduction over the sparse high-dimensional matrix.

A common method of doing so is truncated Sin- gular Value Decomposition (SVD), which finds the optimal rankdfactorization with respect toL2

loss (Eckart and Young, 1936). It was popular- ized in NLP via Latent Semantic Analysis (LSA) (Deerwester et al., 1990).

SVD factorizesMinto the product of three ma- tricesU ·Σ·V>, where U and V are orthonor- mal andΣis a diagonal matrix of eigenvalues in decreasing order. By keeping only the topdele- ments ofΣ, we obtainMd = Ud·Σd·Vd>. The dot-products between the rows ofW =Ud·Σdare equal to the dot-products between rows ofMd.

In the setting of word-context matrices, the dense,d-dimensional rows ofWcan substitute the very high-dimensional rows ofM. Indeed, a com- mon approach in NLP literature is factorizing the PPMI matrixMPPMI with SVD, and then taking the rows of:

WSVD=Ud·Σd CSVD=Vd (1) as word and context representations, respectively.

2.3 Skip-Grams with Negative Sampling (SGNS)

We present a brief sketch of SGNS – the skip-gram embedding model introduced in (Mikolov et al., 2013a) trained using the negative-sampling proce- dure presented in (Mikolov et al., 2013b). A de- tailed derivation of SGNS is available in (Gold- berg and Levy, 2014).

SGNS seeks to represent each word w ∈ VW

and each contextc ∈ VC asd-dimensional vec- tors w~ and~c, such that words that are “similar”

to each other will have similar vector representa- tions. It does so by trying to maximize a function of the productw~ ·~cfor(w, c)pairs that occur in D, and minimize it for negative examples:(w, cN) pairs that do not necessarily occur inD. The neg- ative examples are created by stochastically cor- rupting observed(w, c)pairs fromD– hence the name “negative sampling”. For each observation of(w, c), SGNS drawsk contexts from the em- pirical unigram distribution PD(c) = #(c)|D| . In word2vec’s implementation of SGNS, this dis- tribution is smoothed, a design choice that boosts its performance. We explore this hyperparameter and others in Section 3.

SGNS as Implicit Matrix Factorization Levy and Golberg (2014c) show that SGNS’s corpus- level objective achieves its optimal value when:

~

w·~c=PMI(w, c)−logk

Hence, SGNS is implicitly factorizing a word- context matrix whose cell’s values are PMI, shifted by a global constant (logk):

W·C>=MPMI−logk

SGNS performs a different kind of factorization from traditional SVD (see 2.2). In particular, the factorization’s loss function is not based on L2, and is much less sensitive to extreme and infi- nite values due to a sigmoid function surrounding

~

w·~c. Furthermore, the loss is weighted, caus- ing rare(w, c)pairs to affect the objective much less than frequent ones. Thus, while many cells in MPMIequallog 0 =−∞, the cost incurred for re- constructing these cells as a small negative value, such as−5instead of as−∞, is negligible.1

1The logistic (sigmoidal) objective also curbs very high positive values of PMI. We suspect that this property, along with the weighted factorization property, addresses the afore- mentioned shortcoming of PMI, i.e. its overweighting of in- frequent events.

(4)

An additional difference from SVD, which will be explored further in Section 3.3, is that SVD factorizesM into three matrices, two of them or- thonormal and one diagonal, while SGNS factor- izesM into two unconstrained matrices.

2.4 Global Vectors (GloVe)

GloVe (Pennington et al., 2014) seeks to represent each wordw ∈ VW and each contextc ∈ VC as d-dimensional vectorsw~and~csuch that:

~

w·~c+bw+bc= log (#(w, c)) ∀(w, c)∈D Here,bwandbc(scalars) are word/context-specific biases, and are also parameters to be learned in addition tow~and~c.

GloVe’s objective is explicitly defined as a fac- torization of the log-count matrix, shifted by the entire vocabularies’ bias terms:

Mlog(#(w,c))

≈W·C>+b~w+b~c Whereb~wis a|VW|dimensional row vector andb~c is a|VC|dimensional column vector.

If we were to fix bw = log #(w) and bc = log #(c), this would be almost2equivalent to fac- torizing the PMI matrix shifted bylog(|D|). How- ever, GloVe learns these parameters, giving an ex- tra degree of freedom over SVD and SGNS. The model is fit to minimize a weighted least square loss, giving more weight to frequent(w, c)pairs.3 Finally, an important novelty introduced in (Pennington et al., 2014) is that, assumingVC = VW, one could take the representation of a word wto bew~+c~wwherec~wis the row correspond- ing tow in C>. This may improve results con- siderably in some circumstances, as we discuss in Sections 3.3 and 6.2.

3 Transferable Hyperparameters

This section presents various hyperparameters im- plemented inword2vec and GloVe, and shows how to adapt and apply them to count-based methods. We divide these into: pre-processing hyperparameters, which affect the algorithms’

input data; association metric hyperparameters, which define how word-context interactions are calculated; and post-processing hyperparameters, which modify the resulting word vectors.

2GloVe’s objective ignores (w, c)pairs that do not co- occur in the training corpus, treating them as missing values.

SGNS, on the other hand, does take such pairs into account through the negative sampling procedure.

3The weighting formula is another hyper-parameter that could be tuned, but we keep to the default weighting scheme.

3.1 Pre-processing Hyperparameters

All the matrix-based algorithms rely on a col- lectionDof word-context pairs (w, c)as inputs.

word2vecintroduces three novel variations on the wayD is collected, which can be easily ap- plied to other methods beyond SGNS.

Dynamic Context Window (dyn) The tradi- tional approaches usually use a constant-sized un- weighted context window. For instance, if the win- dow size is5, then a word five tokens apart from the target is treated the same as an adjacent word.

Following the intuition that contexts closer to the target are more important, context words can be weighted according to their distance from the fo- cus word. Both GloVe and word2vec employ such a weighting scheme, and while less com- mon, this approach was also explored in tradi- tional count-based methods, e.g. (Sahlgren, 2006).

GloVe’s implementation weights contexts using the harmonic function, e.g. a context word three tokens away will be counted as13of an occurrence.

On the other hand,word2vec’s implementation is equivalent to weighing by the distance from the focus word divided by the window size. For ex- ample, a size-5 window will weigh its contexts by

5

5,45,35,25,15.

The reason we call this modification dynamic context windows is because word2vec imple- ments its weighting scheme by uniformly sam- pling the actual window size between1andL, for each token (Mikolov et al., 2013a). The sampling method is faster than the direct method in terms of training time, since there are fewer SGD updates in SGNS and fewer non-zero matrix cells in the other methods. For our systematic experiments, we used theword2vec-style sampled version for all methods, including GloVe.

Subsampling (sub) Subsampling is a method of diluting very frequent words, akin to removing stop-words. The subsampling method presented in (Mikolov et al., 2013a) randomly removes words that are more frequent than some thresholdtwith a probability ofp, wherefmarks the word’s corpus frequency:

p= 1− rt

f (2)

Following the recommendation in (Mikolov et al., 2013a), we uset= 10−5in our experiments.4

4word2vec’s code implements a slightly different for- mula:p= fftq

t

f. We followed the formula presented

(5)

Another implementation detail of subsampling in word2vec is that the removal of tokens is done before the corpus is processed into word- context pairs. This practically enlarges the con- text window’s size for many tokens, because they can now reach words that were not in their origi- nalL-sized windows. We call this kind of subsam- pling “dirty”, as opposed to “clean” subsampling, which removes subsampled words without affect- ing the context window’s size. We found their im- pact on performance comparable, and report re- sults of only the “dirty” variant.

Deleting Rare Words (del) While it is com- mon to ignore words that are rare in the training corpus, word2vec removes these tokens from the corpusbeforecreating context windows. As with subsampling, this variation narrows the dis- tance between tokens, inserting new word-context pairs that did not exist in the original corpus with the same window size. Though this variation may also have an effect on performance, preliminary experiments showed that it was small, and we therefore do not investigate its effect in this paper.

3.2 Association Metric Hyperparameters The PMI (or PPMI) between a word and its con- text is well known to be an effective association measure in the word similarity literature. Levy and Golberg (2014c) show that SGNS is implicitly fac- torizing a word-context matrix whose cell’s val- ues are shifted PMI. Following their analysis, we present two variations of the PMI (and implicitly PPMI) association metric, which we adopt from SGNS. These enhancements of PMI are not di- rectly applicable to GloVe, which, by definition, uses a different association measure.

Shifted PMI (neg) SGNS has a natural hyper- parameter k (the number of negative samples), which affects the value that SGNS is trying to op- timize for each(w, c): P M I(w, c)−logk. The shift caused byk > 1can be applied to distri- butional methods through shifted PPMI (Levy and Goldberg, 2014c):

SP P M I(w, c) = max (P M I(w, c)−logk,0) It is important to understand that in SGNS,khas two distinct functions. First, it is used to better estimate the distribution of negative examples; a higherk means more data and better estimation.

in the original paper (equation 2).

Second, it acts as a prior on the probability of ob- serving a positive example (an actual occurrence of(w, c)in the corpus) versus a negative example;

a higherkmeans that negative examples are more probable. Shifted PPMI captures only the second aspect ofk (a prior). We experiment with three values ofk:1,5,15.

Context Distribution Smoothing (cds) In word2vec, negative examples (contexts) are sampled according to a smoothed unigram dis- tribution. In order to smooth the original con- texts’ distribution, all context counts are raised to the power of α (Mikolov et al. (2013b) found α = 0.75 to work well). This smoothing varia- tion has an analog when calculating PMI directly:

P M Iα(w, c) = log Pˆ(w, c)

Pˆ(w) ˆPα(c) (3) Pˆα(c) = # (c)α

P

c# (c)α

Like other smoothing techniques (Pantel and Lin, 2002; Turney and Littman, 2003), context distri- bution smoothing alleviates PMI’s bias towards rare words. It does so by enlarging the probability of sampling a rare context (sincePˆα(c) > Pˆ(c) when cis infrequent), which in turn reduces the PMI of(w, c)for anywco-occurring with the rare contextc. In Section 6.2 we demonstrate that this novel variant of PMI is very effective, and consis- tently improves performance across tasks, meth- ods, and configurations. We experiment with two values ofα:1(unsmoothed) and0.75(smoothed).

3.3 Post-processing Hyperparameters We present three hyperparameters that modify the algorithms’ output: the word vectors.

Adding Context Vectors (w+c) Pennington et al. (2014) propose using the context vectors in ad- dition to the word vectors as GloVe’s output. For example, the word “cat” can be represented as:

~vcat=w~cat+~ccat

wherew~ and~care the word and context embed- dings, respectively.

This vector combination was originally moti- vated as an ensemble method. Here, we provide a different interpretation of its effect on the co- sine similarity function. Specifically, we show

(6)

that adding context vectors effectively adds first- order similarity terms to the second-order similar- ity function.

Consider the cosine similarity of two words:

cos(x, y) = ~vx·~vy

√~vx·~vxp

~vy·~vy

= (w~x+~cx)·(w~y+~cy) p(w~x+~cx)·(w~x+~cx)p

(w~y+~cy)·(w~y+~cy)

= w~x·w~y+~cx·~cy+w~x·~cy+~cx·w~y

pw~2x+ 2w~x·~cx+~c2xq

~

w2y+ 2w~y·~cy+~c2y

= w~x·w~y+~cx·~cy+w~x·~cy+~cx·w~y 2√

~

wx·~cx+ 1p

~

wy·~cy+ 1 (4) (The last step follows because, as noted in Sec- tion 2, the word and context vectors are normal- ized after training.)

The resulting expression combines similarity terms which can be divided into two groups:

second-order similarity (wx·wy,cx·cy) and first- order similarity (w·c). The second-order terms measure the extent to which the two words arere- placeablebased on their tendencies to appear in similar contexts, and are the manifestation of Har- ris’s (1954) distributional hypothesis. The first- order terms measure the tendency of one word to appear in the context of the other.

In SVD and SGNS, the first-order similarity terms betweenw andcconverge toP M I(w, c), while in GloVe it converges into their log-count (with some bias terms).

The similarity calculated in equation 4 is thus a symmetric combination of the first-order and sec- ond order similarities ofxandy, normalized by a function of their reflective first-order similarities:

sim(x, y) = sim2(x, y) +sim1(x, y) psim1(x, x) + 1p

sim1(y, y) + 1 This similarity measure states that words are similar if they tend to appear in similar contexts, orif they tend to appear in the contexts of each other (and preferably both).

The additive w+c representation can be triv- ially applied to other methods that produce distinct word and context vectors (e.g. SVD and SGNS).

On the other hand, explicit methods such as PPMI are sparse by definition, and nullify the vast ma- jority of first-order similarities. We therefore do not applyw+cto PPMI in this study.

Eigenvalue Weighting (eig) As mentioned in Section 2.2, the word and context vectors derived using SVD are typically represented by (equa- tion 1):

WSVD=Ud·Σd CSVD=Vd

However, this is not necessarily the optimal con- struction ofWSVD for word similarity tasks. We note that in the SVD-based factorization, the re- sulting word and context matrices have very dif- ferent properties. In particular, the context ma- trix CSVD is orthonormal while the word matrix WSVD is not. On the other hand, the factorization achieved by SGNS’s training procedure is much more “symmetric”, in the sense that neitherWW2V norCW2Vis orthonormal, and no particular bias is given to either of the matrices in the training ob- jective. Similar symmetry can be achieved with the following factorization:

W =Ud·p

Σd C=Vd·p

Σd (5) Alternatively, the eigenvalue matrix can be dis- missed altogether:

W =Ud C =Vd (6)

While it is not theoretically clear why the symmetric approach is better for semantic tasks, it does work much better empirically (see Sec- tion 6.1). A similar observation was made by Caron (2001), who suggested adding a parameter pto control the eigenvalue matrixΣ:

WSVDp =Ud·Σpd

Later studies show that weighting the eigenvalue matrixΣdwith the exponentpcan have a signif- icant effect on performance, and should be tuned (Bullinaria and Levy, 2012; Turney, 2012). Adapt- ing the notion of symmetric decomposition from SGNS, this study experiments only with symmet- ric variants of SVD (p= 0,p= 0.5; equations (6) and (5)) and the traditional factorization (p = 1; equation (1)).

Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normal- ized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine sim- ilarity. This normalization is a hyperparameter set- ting in itself, and other normalizations are also ap- plicable. The trivial case is using no normalization

(7)

Hyper- Explored Applicable

parameter Values Methods

win 2,5,10 All

dyn none, with All

sub none, dirty, clean All

del none, with All

neg 1,5,15 PPMI, SVD, SGNS

cds 1,0.75 PPMI, SVD, SGNS

w+c onlyw,w+c SVD, SGNS, GloVe

eig 0,0.5,1 SVD

nrm none, row, col, both All

Table 1: The space of hyperparameters explored in this work.

Explored only in preliminary experiments.

at all. Another setting, used by Pennington et al.

(2014), normalizes the columns ofW rather than its rows. It is also possible to consider a fourth setting that combines both row and column nor- malizations.

Note that column normalization is akin to dis- missing the eigenvalues in SVD. While the hy- perparameter settingeig = 0 has an important positive impact on SVD, the same cannot be said of column normalization on other methods. In preliminary experiments, we tried the four differ- ent normalization schemes described above (none, row, column, and both), and found the standardL2

normalization ofW’s rows (i.e. using the cosine similarity measure) to be consistently superior.

4 Experimental Setup

We explored a large space of hyperparameters, representations, and evaluation datasets.

4.1 Hyperparameter Space

Table 1 enumerates the hyperparameter space. We generated 72 PPMI, 432 SVD, 144 SGNS, and 24 GloVe representations; 672 overall.

4.2 Word Representations

Corpus All models were trained on English Wikipedia (August 2013 dump), pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 mil- lion sentences, spanning 1.5 billion tokens. Mod- els were derived using windows of 2, 5, and 10 tokens to each side of the focus word (the window size parameter is denotedwin). Words that ap- peared less than 100 times in the corpus were ig- nored, resulting in vocabularies of 189,533 terms for both words and contexts.

Training Embeddings We trained a 500- dimensional representation with SVD, SGNS, and

GloVe. SGNS was trained using a modified ver- sion ofword2vecwhich receives a sequence of pre-extracted word-context pairs (Levy and Gold- berg, 2014a). GloVe was trained with 50 itera- tions using the original implementation (Penning- ton et al., 2014), applied to the pre-extracted word- context pairs.

4.3 Test Datasets

We evaluated each word representation on eight datasets covering similarity and analogy tasks.

Word Similarity We used six datasets to eval- uate word similarity: the popular WordSim353 (Finkelstein et al., 2002) partitioned into two datasets,WordSim SimilarityandWordSim Relat- edness (Zesch et al., 2008; Agirre et al., 2009);

Bruni et al.’s (2012) MEN dataset; Radinsky et al.’s (2011) Mechanical Turk dataset; Luong et al.’s (2013)Rare Words dataset; and Hill et al.’s (2014)SimLex-999dataset. All these datasets con- tain word pairs together with human-assigned sim- ilarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similar- ities, and measuring the correlation (Spearman’s ρ) with the human ratings.

Analogy The twoanalogydatasets present ques- tions of the form “ais toa asbis tob”, where bis hidden, and must be guessed from the entire vocabulary. MSR’s analogy dataset (Mikolov et al., 2013c) contains 8000 morpho-syntactic anal- ogy questions, such as “goodis tobestassmartis tosmartest”. Google’s analogy dataset (Mikolov et al., 2013a) contains 19544 questions, about half of the same kind as in MSR (syntactic analogies), and another half of a more semantic nature, such as capital cities (“Parisis toFranceasTokyois to Japan”). After filtering questions involving out- of-vocabulary words, i.e. words that appeared in English Wikipedia less than 100 times, we remain with 7118 instances in MSR and 19258 instances in Google. The analogy questions are answered using 3CosAdd (addition and subtraction):

arg max

b∈VW\{a,b,a}cos(b, aa+b) =

arg max

b∈VW\{a,b,a}(cos(b, a)cos(b, a) + cos(b, b)) as well as 3CosMul, which is state-of-the-art in analogy recovery (Levy and Goldberg, 2014b):

arg max

bVW\{a,b,a}

cos(b, a)·cos(b, b) cos(b, a) +ε

(8)

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul

PPMI .709 .540 .688 .648 .393 .338 .491 /.650 .246 / .439

SVD .776 .658 .752 .557 .506 .422 .452 / .498 .357 / .412

SGNS .724 .587 .686 .678 .434 .401 .530 / .552 .578 /.592

GloVe .666 .467 .659 .599 .403 .398 .442 / .465 .529 / .576

Table 2: Performance of each method across different tasks in the “vanilla” scenario (all hyperparameters set to default):

win= 2;dyn=none;sub=none;neg= 1;cds= 1;w+c=onlyw;eig= 0.0.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul

PPMI .755 .688 .745 .686 .423 .354 .553 /.629 .289 / .413

SVD .784 .672 .777 .625 .514 .402 .547 / .587 .402 / .457

SGNS .773 .623 .723 .676 .431 .423 .599 / .625 .514 / .546

GloVe .667 .506 .685 .599 .372 .389 .539 / .563 .503 /.559

CBOW .766 .613 .757 .663 .480 .412 .547 / .591 .557 /.598

Table 3: Performance of each method across different tasks usingword2vec’s recommended configuration: win = 2;

dyn=with;sub=dirty;neg= 5;cds= 0.75;w+c=onlyw;eig= 0.0. CBOW is presented for comparison.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul

PPMI .755 .697 .745 .686 .462 .393 .553 / .679 .306 / .535

SVD .793 .691 .778 .666 .514 .432 .554 / .591 .408 / .468

SGNS .793 .685 .774 .693 .470 .438 .676 /.688 .618 /.645

GloVe .725 .604 .729 .632 .403 .398 .569 / .596 .533 / .580

Table 4: Performance of each method across different tasks using the best configuration for that method and task combination, assumingwin= 2.

ε= 0.001is used to prevent division by zero. We abbreviate the two methods “Add” and “Mul”, re- spectively. The evaluation metric for the analogy questions is the percentage of questions for which the argmax result was the correct answer (b).

5 Results

We begin by comparing the effect of various hy- perparameter configurations, and observe that dif- ferent settings have a substantial impact on per- formance (Section 5.1); at times, this improve- ment is greater than that of switching to a dif- ferent representation method. We then show that, in some tasks, careful hyperparameter tuning can also outweigh the importance of adding more data (5.2). Finally, we observe that our results do not agree with a few recent claims in the word embed- ding literature, and suggest that these discrepan- cies stem from hyperparameter settings that were not controlled for in previous experiments (5.3).

5.1 Hyperparameters vs Algorithms

We first examine a “vanilla” scenario (Table 2), in which all hyperparameters are “turned off” (set to

default values): small context windows (win = 2), no dynamic contexts (dyn = none), no sub- sampling (sub = none), one negative sample (neg= 1), no smoothing (cds= 1), no context vectors (w+c = onlyw), and default eigenvalue weights (eig= 0.0).5Overall, SVD outperforms other methods on most word similarity tasks, often having a considerable advantage over the second- best. In contrast, analogy tasks present mixed re- sults; SGNS yields the best result in MSR’s analo- gies, while PPMI dominates Google’s dataset.

The second scenario (Table 3) sets the hyper- parameters toword2vec’s default values: small context windows (win = 2),6 dynamic contexts (dyn= with), dirty subsampling (sub = dirty), five negative samples (neg= 5), context distribu- tion smoothing (cds= 0.75), no context vectors (w+c = onlyw), and default eigenvalue weights

5While it is more common to seteig= 1, this setting degrades SVD’s performance considerably (see Section 6.1).

6Whileword2vec’s default window size is5, we present a single window size (win= 2) in Tables 2-4, in order to iso- latewin’s effect from the effects of other hyperparameters.

Running the same experiments with different window sizes reveals similar trends. Additional results with broader win- dow sizes are shown in Table 5.

(9)

win Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul 2

PPMI .732 .699 .744 .654 .457 .382 .552 / .677 .306 / .535

SVD .772 .671 .777 .647 .508 .425 .554 / .591 .408 / .468

SGNS .789 .675 .773 .661 .449 .433 .676 /.689 .617 /.644

GloVe .720 .605 .728 .606 .389 .388 .649 / .666 .540 / .591

5

PPMI .732 .706 .738 .668 .442 .360 .518 / .649 .277 / .467

SVD .764 .679 .776 .639 .499 .416 .532 / .569 .369 / .424

SGNS .772 .690 .772 .663 .454 .403 .692 /.714 .605 /.645

GloVe .745 .617 .746 .631 .416 .389 .700 / .712 .541 / .599

10

PPMI .735 .701 .741 .663 .235 .336 .532 / .605 .249 / .353

SVD .766 .681 .770 .628 .312 .419 .526 / .562 .356 / .406

SGNS .794 .700 .775 .678 .281 .422 .694 / .710 .520 /.557

GloVe .746 .643 .754 .616 .266 .375 .702 /.712 .463 / .519

10 SGNS-LS .766 .681 .781 .689 .451 .414 .739 /.758 .690 /.729

GloVe-LS .678 .624 .752 .639 .361 .371 .732 / .750 .628 / .685

Table 5: Performance of each method across different tasks using 2-fold cross-validation for hyperparameter tuning. Configu- rations on large-scale (LS) corpora are also presented for comparison.

(eig= 0.0). The results in this scenario are quite different than those of the vanilla scenario, with better performance in many cases. However, this change is not uniform, as we observe that differ- ent settings boost different algorithms. In fact, the question “Which method is best?” might have a completely different answer when comparing on the same task but with different hyperparameter values. Looking at Table 2 and Table 3, for ex- ample, SVD is the best algorithm for SimLex-999 in the vanilla scenario, whereas in theword2vec scenario, it does not perform as well as SGNS.

The third scenario (Table 4) enables the full range of hyperparameters given small context win- dows (win = 2); we evaluate each method on each task given every hyperparameter configura- tion, and choose the best performance. We see a considerable performance increase across all methods when comparing to both the vanilla (Ta- ble 2) and word2vec scenarios (Table 3): the best combination of hyperparameters improves up to 15.7 points beyond the vanilla setting, and over 6 points on average. It appears that selecting the right hyperparameter settings often has more im- pact than choosing the most suitable algorithm.

Main Result The numbers in Table 4 result from an “oracle” experiment, in which the hyperparam- eters are tuned on the test data, providing an upper bound on the potential performance improvement of hyperparameter tuning. Are such gains achiev- able in practice?

Table 5 describes a realistic scenario, where the hyperparameters are tuned on a training set, which is separate from the unseen test data. We also

report results for different window sizes (win = 2,5,10). We use 2-fold cross validation, in which, for each task, the hyperparameters are tuned on each half of the data and evaluated on the other half. The numbers reported in Table 5 are the av- erages of the two runs for each data-point.

The results indicate that approaching the ora- cle’s improvements are indeed feasible. When comparing the performance of the trained config- uration (Table 5) to that of the optimal one (Ta- ble 4), their average difference is about1%, with larger datasets usually finding the optimal configu- ration. It is therefore both practical and beneficial to properly tune hyperparameters for word simi- larity and analogy detection tasks.

An interesting observation, which immediately appears when looking at Table 5, is that there is no single method that consistently performs better than the rest. This behavior is visible across all window sizes, and is discussed in further detail in Section 5.3.

5.2 Hyperparameters vs Big Data

An important factor in evaluating distributional methods is the size of corpus and vocabulary, where larger corpora tend to yield better repre- sentations. However, training word vectors from larger corpora is more costly in computation time, which could be spent in tuning hyperparameters.

To compare the effect of bigger data versus more flexible hyperparameter settings, we created a large corpus with over 10.5 billion words (7 times larger than our original corpus). This cor- pus was built from an 8.5 billion word corpus sug-

(10)

gested by Mikolov for training word2vec,7 to which we added UKWaC (Ferraresi et al., 2008).

As with the original setup, our vocabulary con- tained every word that appeared at least 100 times in the corpus, amounting to about 620,000 words.

Finally, we fixed the context windows to be broad and dynamic (win = 10,dyn = with), and ex- plored 16 hyperparameter settings comprising of:

subsampling (sub), shifted PMI (neg = 1,5), context distribution smoothing (cds), and adding context vectors (w+c). This space is somewhat more restricted than the original hyperparameter space.

In terms of computation, SGNS scales nicely, requiring about half a day of computation per setup. GloVe, on the other hand, took several days to run a single 50-iteration instance for this corpus.

Applying the traditional count-based methods to this setting proved technically challenging, as they consumed too much memory to be efficiently ma- nipulated. We thus present results for only SGNS and GloVe (Table 5).

Remarkably, there are some cases (3/6 word similarity tasks) in which tuning a larger space of hyperparameters is indeed more beneficial than expanding the corpus. In other cases, however, more data does seem to pay off, as evident with both analogy tasks.

5.3 Re-evaluating Prior Claims

Prior art raises several claims regarding the superi- ority of certain methods over the others. However, these studies did not control for the hyperparame- ters presented in this work. We thus revisit these claims, and examine their validity based on the re- sults in Table 5.8

Are embeddings superior to count-based dis- tributional methods? It is commonly believed that modern prediction-based embeddings per- form better than traditional count-based methods.

This claim was recently supported by a series of systematic evaluations by Baroni et al. (2014).

However, our results suggest a different trend. Ta- ble 5 shows that in word similarity tasks, the av- erage score of SGNS is actually lower than SVD’s whenwin = 2,5, and it never outperforms SVD

7http://word2vec.googlecode.com/svn/

trunk/demo-train-big-model-v1.sh

8We note that all conclusions drawn in this section rely on the specific data and settings with which we experiment. It is indeed feasible that experiments on different tasks, data, and hyperparameters may yield other conclusions.

by more than 1.7 points in those cases. In Google’s analogies SGNS and GloVe indeed perform bet- ter than PPMI, but only by a margin of 3.7 points (compare PPMI withwin = 2 and SGNS with win= 5). MSR’s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD.9Overall, there does not seem to be a consistent significant advantage to one ap- proach over the other, thus refuting the claim that prediction-based methods are superior to count- based approaches.

The contradictory results in (Baroni et al., 2014) stem from creating word2vec embed- dings with somewhat pre-tuned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representa- tions. In particular, shifted PMI (negative sam- pling) and context distribution smoothing (cds= 0.75, equation (3) in Section 3.2) were turned on for SGNS, but not for PPMI and SVD. An additional difference is Baroni et al.’s setting of eig=1, which significantly deteriorates SVD’s performance (see Section 6.1).

Is GloVe superior to SGNS? Pennington et al.

(2014) show a variety of experiments in which GloVe outperforms SGNS (among other meth- ods). However, our results show the complete op- posite. In fact, SGNS outperforms GloVe inevery task (Table 5). Only when restricted to 3CosAdd, a suboptimal configuration, does GloVe show a 0.8 point advantage over SGNS. This trend persists when scaling up to a larger corpus and vocabulary.

This contradiction can be explained by three major differences in the experimental setup. First, in our experiments, hyperparameters were allowed to vary; in particular,w+cwas applied to all the methods, including SGNS. Secondly, Pennington et al. (2014) only evaluated on Google’s analo- gies, but not on MSR’s. Finally, in our work, all methods are compared using the same underlying corpus.

It is also important to bear in mind that, by definition, GloVe cannot use two hyperparame- ters: shifted PMI (neg) and context distribution smoothing (cds). Instead, GloVe learns a set of bias parameters that subsumes these two modifica- tions and many other potential changes to the PMI metric. Albeit its greater flexibility, GloVe does not fair better than SGNS in our experiments.

9Unlike PPMI, SVD underperforms in both analogy tasks.

(11)

Is PPMI on-par with SGNS on analogy tasks?

Levy and Goldberg (2014b) show that PPMI and SGNS perform similarly on both Google’s and MSR’s analogy tasks. Nevertheless, the results in Table 5 show a clear advantage to SGNS.

While the gap on Google’s analogies is not very large (PPMI lags behind SGNS by only 3.7 points), SGNS consistently outperforms PPMI by a large margin on the MSR dataset. MSR’s analogy dataset captures syntactic relations, such as singular-plural inflections for nouns and tense modifications for verbs. We conjecture that cap- turing these syntactic relations may rely on certain types of contexts, such as determiners and func- tion words, which SGNS might be better at cap- turing – perhaps due to the way it assigns weights to different examples, or because it also captures negative correlations which are filtered by PPMI.

A deeper look into Levy and Goldberg’s (2014b) experiments reveals the use of PPMI with positionalcontexts (i.e. each context is a conjunc- tion of a word and its relative position to the target word), whereas SGNS was employed with regular bag-of-words contexts. Positional contexts might contain relevant information for recovering syn- tactic analogies, explaining PPMI’s relatively high score on MSR’s analogy task in (Levy and Gold- berg, 2014b).

Does 3CosMul recover more analogies than 3CosAdd? Levy and Goldberg (2014b) show that using similarity multiplication (3CosMul) rather than addition (3CosAdd) improves results on all methods and on every task. This claim is consistent with our findings; indeed, 3CosMul dominates 3CosAdd in every case. The improve- ment is particularly noticeable for SVD and PPMI, which considerably underperform other methods when using 3CosAdd.

5.4 Comparison with CBOW

Another algorithm featured in word2vec is CBOW. Unlike the other methods, CBOWcannot be easily expressed as a factorization of a word- context matrix; it ties together the tokens of each context window by representing the context vec- tor as the sum of its words’ vectors. It is thus more expressive than the other methods, and has a po- tential of deriving better word representations.

While Mikolov et al. (2013b) found SGNS to outperform CBOW, Baroni et al. (2014) reports that CBOW had a slight advantage. We com-

win eig Average Performance

0 .612

2 0.5 .611

1 .551

0 .616

5 0.5 .612

1 .534

0 .584

10 0.5 .567

1 .484

Table 6: The average performance of SVD on word similarity tasks given different values ofeig, in the vanilla scenario.

pared CBOW to the other methods when setting all the hyperparameters to the defaults provided by word2vec (Table 3). With the exception of MSR’s analogy task, CBOW is not the best- performing method of any other task in this sce- nario. Other scenarios showed similar trends in our preliminary experiments.

While CBOW can potentially derive better rep- resentations by combining the tokens in each con- text window, this potential is not realized in prac- tice. Nevertheless, Melamud et al. (2014) show that capturing joint contexts can indeed improve performance on word similarity tasks, and we be- lieve it is a direction worth pursuing.

6 Hyperparameter Analysis

We analyze the individual impact of each hyper- parameter, and try to characterize the conditions in which a certain setting is beneficial.

6.1 Harmful Configurations

Certain hyperparameter settings might cripple the performance of a certain method. We observe two scenarios in which SVD performs poorly.

SVD does not benefit from shifted PPMI. Set- tingneg>1consistently deteriorates SVD’s per- formance. Levy and Goldberg (2014c) made a similar observation, and hypothesized that this is a result of the increasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to the zero matrix. SVD’sL2ob- jective is unweighted, and it does not distinguish between observed and unobserved matrix cells.

Using SVD “correctly” is bad. The traditional way of representing words with SVD uses the eigenvalue matrix (eig= 1):W =Ud·Σd. De- spite being theoretically well-motivated, this set- ting leads to very poor results in practice, when compared to other settings (eig= 0.5or0). Ta- ble 6 demonstrates this gap.

(12)

The drop in average accuracy when setting eig = 1 is astounding. The performance gap persists under different hyperparameter settings as well, and drops in performance of over15points (absolute) when usingeig= 1instead ofeig= 0.5or0are not uncommon. This setting is one of the main reasons for SVD’s inferior results in the study by Baroni et al. (2014), and also the reason we chose to useeig = 0.5as the default setting for SVD in the vanilla scenario.

6.2 Beneficial Configurations

To identify which hyperparameter settings are beneficial, we looked at the best configuration of each method on each task. We then counted the number of times each hyperparameter setting was chosen in these configurations (Table 7). Some trends emerge, such as PPMI and SVD’s prefer- ence towards shorter context windows10 (win = 2), and that SGNS always prefers numerous nega- tive samples (neg>1).

To get a closer look and isolate the effect of each hyperparameter, we controlled for said hy- perparameter, and compared the best configura- tions given each of the hyperparameter’s settings.

Table 8 shows the difference between default and non-default settings of each hyperparameter.

While many hyperparameter settings can im- prove performance, they may also degrade it when chosen incorrectly. For instance, in the case of shifted PMI (neg), SGNS consistently profits fromneg> 1, while SVD’s performance is dra- matically reduced. For PPMI, the utility of ap- plying neg > 1 depends on the type of task:

word similarity or analogy. Another example is dynamic context windows (dyn), which is benefi- cial for MSR’s analogy task, but largely detrimen- tal to other tasks.

It appears that the only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds = 0.75), yielding a consistent improvement at an insignificant risk.

Note thatcdshelps PPMI more than it does other methods; we suggest that this is because it re- duces the relative impact of rare words on the dis- tributional representation, thus addressing PMI’s

“Achilles’ heel”.

10This might also relate to PMI’s bias towards infrequent events (see Section 2.1). Broader windows create more ran- dom co-occurrences with rare words, “polluting” the distribu- tional vector with random words that have high PMI scores.

7 Practical Recommendations

It is generally advisable to tune all hyperparam- eters, as well as algorithm-specific hyperparame- ters, for the task at hand. However, this may be computationally expensive. We thus provide some

“rules of thumb”, which we found to work well in our setting:

• Always use context distribution smoothing (cds = 0.75) to modify PMI, as described in Section 3.2. It consistently improves performance, and is applicable to PPMI, SVD, and SGNS.

•Do not use SVD “correctly” (eig= 1). Instead, use one of the symmetric variants (Section 3.3).

•SGNS is a robust baseline. While it might not be the best method for every task, it does not signif- icantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory con- sumption.

•With SGNS, prefer many negative samples.

•for both SGNS and GloVe, it is worthwhile to ex- periment with thew~+~cvariant, which is cheap to apply (does not require retraining) and can result in substantial gains (as well as substantial losses).

8 Conclusions

Recent embedding methods introduce a plethora of design choices beyond network architecture and optimization algorithms. We reveal that these seemingly minor variations can have a large im- pact on the success of word representation meth- ods. By showing how to adapt and tune these hy- perparameters in traditional methods, we allow a proper comparison between representations, and challenge various claims of superiority from the word embedding literature.

This study also exposes the need for more controlled-variable experiments, and extending the concept of “variable” from the obvious task, data, and method to the often ignored prepro- cessing steps and hyperparameter settings. We also stress the need for transparent and repro- ducible experiments, and commend authors such as Mikolov, Pennington, and others for making their code publicly available. In this spirit, we make our code available as well.11

11http://bitbucket.org/omerlevy/

hyperwords

(13)

Method win dyn sub neg cds w+c 2 : 5 : 10 none : with none : dirty 1 : 5 : 15 1.00 : 0.75 onlyw:w+c

PPMI 7 : 1 : 0 4 : 4 4 : 4 2 : 6 : 0 1 : 7

SVD 7 : 1 : 0 4 : 4 1 : 7 8 : 0 : 0 2 : 6 7 : 1

SGNS 2 : 3 : 3 6 : 2 4 : 4 0 : 4 : 4 3 : 5 4 : 4

GloVe 1 : 3 : 4 6 : 2 7 : 1 4 : 4

Table 7: The impact of each hyperparameter, measured by the number of tasks in which the best configuration had that hyper- parameter setting. Non-applicable combinations are marked by “—”.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul

PPMI +0.5% –1.0% 0.0% +0.1% +0.4% –0.1% –0.1% +1.2%

SVD –0.8% –0.2% 0.0% +0.6% +0.4% –0.1% +0.6% +2.1%

SGNS –0.9% –1.5% –0.3% +0.1% –0.1% –0.1% –1.0% +0.7%

GloVe –0.8% –1.2% –0.9% –0.8% +0.1% –0.9% –3.3% +1.8%

(a) Performance difference between best models withdyn=withanddyn=none.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +1.9% +1.3% +1.0% –3.8% –3.9% –5.0% –12.2%

SVD +0.7% +0.2% +0.6% +0.7% +0.8% –0.3% +4.0% +2.4%

SGNS +1.5% +2.2% +1.5% +0.1% –0.4% –0.1% –4.4% –5.4%

GloVe +0.2% –1.3% –1.0% –0.2% –3.4% –0.9% –3.0% –3.6%

(b) Performance difference between best models withsub=dirtyandsub=none.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +4.9% +1.3% +1.0% +2.2% +0.8% –6.2% –9.2%

SVD –1.7% –2.2% –1.9% –4.6% –3.4% –3.5% –13.9% –14.9%

SGNS +1.5% +2.9% +2.3% +0.5% +1.5% +1.1% +3.3% +2.1%

GloVe

(c) Performance difference between best models withneg>1andneg= 1.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul

PPMI +1.3% +2.8% 0.0% +2.1% +3.5% +2.9% +2.7% +9.2%

SVD +0.4% –0.2% +0.1% +1.1% +0.4% –0.3% +1.4% +2.2%

SGNS +0.4% +1.4% 0.0% +0.1% 0.0% +0.2% +0.6% 0.0%

GloVe

(d) Performance difference between best models withcds= 0.75andcds= 1.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul

PPMI

SVD –0.6% –0.2% –0.4% –2.1% –0.7% +0.7% –1.8% –3.4%

SGNS +1.4% +2.2% +1.2% +1.1% –0.3% –2.3% –1.0% –7.5%

GloVe +2.3% +4.7% +3.0% –0.1% –0.7% –2.6% +3.3% –8.9%

(e) Performance difference between best models withw+c=w+candw+c=onlyw.

Table 8: The added value versus the risk of setting each hyperparameter. The figures show the differences in performance between the best achievable configurations when restricting a hyperparameter to different values. This difference indicates the potential gain of tuning a given hyperparameter, as well as the risks of decreased performance when not tuning it. For example, an entry of +9.2% in Table (d) means that the best model withcds= 0.75is 9.2% more accurate (absolute) than the best model withcds= 1; i.e. on MSR’s analogies, usingcds= 0.75instead ofcds= 1improved PPMI’s accuracy from .443 to .535.

Références

Documents relatifs

Figures 2-3 plot the results of two experiments from P&W which focused on the abstract/concrete distinction. P&W used a false memory task in the generalisation phase,

• contextual relatedness: judging pairs in a text where they appear together: is the relation between the items relevant for textual cohesion.. While they are insectivores,

Words are then represented as posterior distributions over those latent classes, and the model allows to naturally obtain in-context and out-of-context word representations, which

We evaluate sentence autoencoding in two ways. First, we test the bag-of-words reconstruc- tion on its own, by feeding the system the encoded sentence embedding and evaluating

For every author, we built two lists of word pairs (with their lemma and PoS), one relative to the tagged reference corpus (reference pairs) and the other to the tagged test set

Baroni M., Lenci A., “How we BLESSed distributional semantic evaluation”, Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Asso- ciation

We extend our sharp minimax results in different directions: first, to Gaussian matrices with unknown variance, next, to matrices of random variables having a distribution from

In this paper we focus on the problem of high-dimensional matrix estimation from noisy observations with unknown variance of the noise.. Our main interest is the high