Improving Distributional Similarity with Lessons Learned from Word Embeddings

(1)

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Omer Levy Yoav Goldberg Ido Dagan Computer Science Department

Bar-Ilan University Ramat-Gan, Israel

{omerlevy,yogo,dagan}@cs.biu.ac.il

Abstract

Recent trends suggest that neural- network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter op- timizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

1 Introduction

Understanding the meaning of a word is at the heart of natural language processing (NLP). While a deep, human-like, understanding remains elu- sive, many methods have been successful in recovering certain aspects of similarity between words.

Recently, neural-network based approaches in which words are “embedded” into a low- dimensional space were proposed by various authors (Bengio et al., 2003; Collobert and Weston, 2008). These models represent each word as ad- dimensional vector of real numbers, and vectors that are close to each other are shown to be semantically related. In particular, a sequence of papers by Mikolov et al. (2013a; 2013b) culminated in the skip-gram with negative-sampling training method (SGNS): an efficient embedding algorithm that provides state-of-the-art results on various lin- guistic tasks. It was popularized viaword2vec, a program for creating word embeddings.

A recent study by Baroni et al. (2014) con- ducts a set of systematic experiments comparing word2vec embeddings to the more traditional distributional methods, such as pointwise mutual information (PMI) matrices (see Turney and Pantel (2010) and Baroni and Lenci (2010) for comprehensive surveys). These results suggest that the new embedding methods consistently outperform the traditional methods by a non-trivial margin on many similarity-oriented tasks. How- ever, state-of-the-art embedding methods are all based on the same bag-of-contexts representation of words. Furthermore, analysis by Levy and Goldberg (2014c) shows thatword2vec’s SGNS is implicitly factorizing a word-context PMI matrix. That is, the mathematical objective and the sources of information available to SGNS are in fact very similar to those employed by the more traditional methods.

What, then, is the source of superiority (or per- ceived superiority) of these recent embeddings?

While the focus of the presentation in the word- embedding literature is on the mathematical model and the objective being optimized, other factors affect the results as well. In particular, embedding algorithms suggest some naturalhyperparameters that can be tuned; many of which were already tuned to some extent by the algorithms’ design- ers. Some hyperparameters, such as the number of negative samples to use, are clearly marked as tunable. Other modifications, such as smoothing the negative-sampling distribution, are reported in passing and considered thereafter as part of the algorithm. Others still, such as dynamically-sized context windows, are not even mentioned in some of the papers, but are part of the canonical implementation. All of these modifications and system design choices, which we collectively denote as hyperparameters, are part of the final algorithm, and, as we show, have a substantial impact on performance.

211

Transactions of the Association for Computational Linguistics, vol. 3, pp. 211–225, 2015. Action Editor: Patrick Pantel.

(2)

In this work, we make these hyperparameters explicit, and show how they can beadapted and transferred into the traditional count-based approach. To asses how each hyperparameter con- tributes to the algorithms’ performance, we con- duct a comprehensive set of experiments and compare four different representation methods, while controlling for the various hyperparameters.

Once adapted across methods, hyperparameter tuning significantly improves performance in every task. In many cases, changing the setting of a single hyperparameter yields a greater increase in performance than switching to a better algorithm or training on a larger corpus.

In particular, word2vec’s smoothing of the negative sampling distribution can be adapted to PPMI-based methods by introducing a novel, smoothed variant of the PMI association measure (see Section 3.2). Using this variant increases performance by over 3 points per task, on average.

We suspect that this smoothing partially addresses the “Achilles’ heel” of PMI: its bias towards co- occurrences of rare words.

We also show that when all methods are allowed to tune a similar set of hyperparameters, their performance is largely comparable. In fact, there is no consistent advantage to one algorithmic approach over another, a result that contradicts the claim that embeddings are superior to count-based methods.

2 Background

We consider four word representation methods:

the explicit PPMI matrix, SVD factorization of said matrix, SGNS, and GloVe. For historical reasons, we refer to PPMI and SVD as “count- based” representations, as opposed to SGNS and GloVe, which are often referred to as “neural”

or “prediction-based” embeddings. All of these methods (as well as all other “skip-gram”-based embedding methods) are essentially bag-of-words models, in which the representation of each word reflects a weighted bag of context-words that co- occur with it. Such bag-of-word embedding models were previously shown to perform as well as or better than more complex embedding methods on similarity and analogy tasks (Mikolov et al., 2013a; Pennington et al., 2014).

Notation We assume a collection of wordsw∈ V_W and their contextsc∈V_C, whereV_W andV_C are the word and context vocabularies, and denote the collection of observed word-context pairs as

D. We use#(w, c)to denote the number of times the pair(w, c)appears inD. Similarly,#(w) = P

c⁰∈VC#(w, c⁰) and#(c) = P

w⁰∈VW#(w⁰, c) are the number of timesw and coccurred in D, respectively. In some algorithms, words and contexts are embedded in a space ofd dimensions.

In these cases, each wordw ∈ VW is associated with a vector w~ ∈ R^d and similarly each contextc ∈ VC is represented as a vector~c ∈ R^d. We sometimes refer to the vectorsw~ as rows in a

|V_W|×dmatrix W, and to the vectors~cas rows in a|V_C|×dmatrixC. When referring to embeddings produced by a specific methodx, we may useW^xandC^x(e.g.W^{SGN S}orC^{SV D}). All vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent (see Section 3.3 for further discussion).

Contexts D is commonly obtained by taking a corpusw1, w2, . . . , wn and defining the contexts of wordwi as the words surrounding it in anL- sized window wi−L, . . . , wi−1, wi+1, . . . , wi+L. While other definitions of contexts have been stud- ied (Pad´o and Lapata, 2007; Baroni and Lenci, 2010; Levy and Goldberg, 2014a) this work fo- cuses on fixed-window bag-of-words contexts.

2.1 Explicit Representations (PPMI Matrix) The traditional way to represent words in the distributional approach is to construct a high- dimensional sparse matrix M, where each row represents a word w in the vocabulary VW and each column a potential contextc∈VC. The value of each matrix cellMijrepresents the association between the wordwiand the contextcj. A popular measure of this association is pointwise mutual information (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio betweenwandc’s joint probability and the product of their marginal prob- abilities, which can be estimated by:

P M I(w, c) = log _ˆ^P^ˆ^(w,c)

P(w) ˆP(c) = log^#(w,c)_#(w)_·_#(c)^·|^D^| The rows ofM^PMIcontain many entries of word- context pairs (w, c) that were never observed in the corpus, for whichP M I(w, c) = log 0 =−∞. A common approach is thus to replace theM^PMI matrix withM₀^PMI, in whichP M I(w, c) = 0 in cases where#(w, c) = 0. A more consistent approach is to usepositive PMI(PPMI), in which all

(3)

negative values are replaced by0:

P P M I(w, c) = max (P M I(w, c),0) Bullinaria and Levy (2007) showed that M^PPMI outperformsM₀^PMIon semantic similarity tasks.

A well-known shortcoming of PMI, which persists in PPMI, is its bias towards infrequent events (Turney and Pantel, 2010). A rare context c that co-occurred with a target wordweven once, will often yield relatively high PMI score because Pˆ(c), which is in PMI’s denominator, is very small. This creates a situation in which the top

“distributional features” (contexts) ofw are often extremely rare words, which do not necessarily appear in the respective representations of words that are semantically similar tow. Nevertheless, the PPMI measure is widely regarded as state-of- the-art for these kinds of distributional-similarity models.

2.2 Singular Value Decomposition (SVD) While sparse vector representations work well, there are also advantages to working with dense low-dimensional vectors, such as improved computational efficiency and, arguably, better gener- alization. Such vectors can be obtained by performing dimensionality reduction over the sparse high-dimensional matrix.

A common method of doing so is truncated Sin- gular Value Decomposition (SVD), which finds the optimal rankdfactorization with respect toL2

loss (Eckart and Young, 1936). It was popularized in NLP via Latent Semantic Analysis (LSA) (Deerwester et al., 1990).

SVD factorizesMinto the product of three ma- tricesU ·Σ·V^>, where U and V are orthonormal andΣis a diagonal matrix of eigenvalues in decreasing order. By keeping only the topdele- ments ofΣ, we obtainM_d = U_d·Σ_d·V_d^>. The dot-products between the rows ofW =U_d·Σ_dare equal to the dot-products between rows ofM_d.

In the setting of word-context matrices, the dense,d-dimensional rows ofWcan substitute the very high-dimensional rows ofM. Indeed, a common approach in NLP literature is factorizing the PPMI matrixM^PPMI with SVD, and then taking the rows of:

W^SVD=U_d·Σ_d C^SVD=V_d (1) as word and context representations, respectively.

2.3 Skip-Grams with Negative Sampling (SGNS)

We present a brief sketch of SGNS – the skip-gram embedding model introduced in (Mikolov et al., 2013a) trained using the negative-sampling procedure presented in (Mikolov et al., 2013b). A de- tailed derivation of SGNS is available in (Gold- berg and Levy, 2014).

SGNS seeks to represent each word w ∈ VW

and each contextc ∈ VC asd-dimensional vectors w~ and~c, such that words that are “similar”

to each other will have similar vector representations. It does so by trying to maximize a function of the productw~ ·~cfor(w, c)pairs that occur in D, and minimize it for negative examples:(w, c_N) pairs that do not necessarily occur inD. The negative examples are created by stochastically cor- rupting observed(w, c)pairs fromD– hence the name “negative sampling”. For each observation of(w, c), SGNS drawsk contexts from the em- pirical unigram distribution P_D(c) = ^#(c)_|D| . In word2vec’s implementation of SGNS, this distribution is smoothed, a design choice that boosts its performance. We explore this hyperparameter and others in Section 3.

SGNS as Implicit Matrix Factorization Levy and Golberg (2014c) show that SGNS’s corpus- level objective achieves its optimal value when:

~

w·~c=PMI(w, c)−logk

Hence, SGNS is implicitly factorizing a word- context matrix whose cell’s values are PMI, shifted by a global constant (logk):

W·C^>=M^PMI−logk

SGNS performs a different kind of factorization from traditional SVD (see 2.2). In particular, the factorization’s loss function is not based on L₂, and is much less sensitive to extreme and infi- nite values due to a sigmoid function surrounding

~

w·~c. Furthermore, the loss is weighted, caus- ing rare(w, c)pairs to affect the objective much less than frequent ones. Thus, while many cells in M^PMIequallog 0 =−∞, the cost incurred for re- constructing these cells as a small negative value, such as−5instead of as−∞, is negligible.¹

1The logistic (sigmoidal) objective also curbs very high positive values of PMI. We suspect that this property, along with the weighted factorization property, addresses the afore- mentioned shortcoming of PMI, i.e. its overweighting of infrequent events.

(4)

An additional difference from SVD, which will be explored further in Section 3.3, is that SVD factorizesM into three matrices, two of them orthonormal and one diagonal, while SGNS factor- izesM into two unconstrained matrices.

2.4 Global Vectors (GloVe)

GloVe (Pennington et al., 2014) seeks to represent each wordw ∈ VW and each contextc ∈ VC as d-dimensional vectorsw~and~csuch that:

~

w·~c+b_w+b_c= log (#(w, c)) ∀(w, c)∈D Here,bwandbc(scalars) are word/context-specific biases, and are also parameters to be learned in addition tow~and~c.

GloVe’s objective is explicitly defined as a factorization of the log-count matrix, shifted by the entire vocabularies’ bias terms:

Mlog(#(w,c))

≈W·C^>+b~^w+b~^c Whereb~^wis a|VW|dimensional row vector andb~^c is a|VC|dimensional column vector.

If we were to fix bw = log #(w) and bc = log #(c), this would be almost²equivalent to factorizing the PMI matrix shifted bylog(|D|). How- ever, GloVe learns these parameters, giving an ex- tra degree of freedom over SVD and SGNS. The model is fit to minimize a weighted least square loss, giving more weight to frequent(w, c)pairs.³ Finally, an important novelty introduced in (Pennington et al., 2014) is that, assumingVC = V_W, one could take the representation of a word wto bew~+c~_wwherec~_wis the row correspond- ing tow in C^>. This may improve results considerably in some circumstances, as we discuss in Sections 3.3 and 6.2.

3 Transferable Hyperparameters

This section presents various hyperparameters im- plemented inword2vec and GloVe, and shows how to adapt and apply them to count-based methods. We divide these into: pre-processing hyperparameters, which affect the algorithms’

input data; association metric hyperparameters, which define how word-context interactions are calculated; and post-processing hyperparameters, which modify the resulting word vectors.

2GloVe’s objective ignores (w, c)pairs that do not co- occur in the training corpus, treating them as missing values.

SGNS, on the other hand, does take such pairs into account through the negative sampling procedure.

3The weighting formula is another hyper-parameter that could be tuned, but we keep to the default weighting scheme.

3.1 Pre-processing Hyperparameters

All the matrix-based algorithms rely on a col- lectionDof word-context pairs (w, c)as inputs.

word2vecintroduces three novel variations on the wayD is collected, which can be easily applied to other methods beyond SGNS.

Dynamic Context Window (dyn) The traditional approaches usually use a constant-sized unweighted context window. For instance, if the window size is5, then a word five tokens apart from the target is treated the same as an adjacent word.

Following the intuition that contexts closer to the target are more important, context words can be weighted according to their distance from the focus word. Both GloVe and word2vec employ such a weighting scheme, and while less common, this approach was also explored in traditional count-based methods, e.g. (Sahlgren, 2006).

GloVe’s implementation weights contexts using the harmonic function, e.g. a context word three tokens away will be counted as¹₃of an occurrence.

On the other hand,word2vec’s implementation is equivalent to weighing by the distance from the focus word divided by the window size. For example, a size-5 window will weigh its contexts by

5

5,⁴₅,³₅,²₅,¹₅.

The reason we call this modification dynamic context windows is because word2vec implements its weighting scheme by uniformly sampling the actual window size between1andL, for each token (Mikolov et al., 2013a). The sampling method is faster than the direct method in terms of training time, since there are fewer SGD updates in SGNS and fewer non-zero matrix cells in the other methods. For our systematic experiments, we used theword2vec-style sampled version for all methods, including GloVe.

Subsampling (sub) Subsampling is a method of diluting very frequent words, akin to removing stop-words. The subsampling method presented in (Mikolov et al., 2013a) randomly removes words that are more frequent than some thresholdtwith a probability ofp, wherefmarks the word’s corpus frequency:

p= 1− rt

f (2)

Following the recommendation in (Mikolov et al., 2013a), we uset= 10⁻⁵in our experiments.⁴

4word2vec’s code implements a slightly different formula:p= ^f⁻_f^t−q

t

f. We followed the formula presented

(5)

Another implementation detail of subsampling in word2vec is that the removal of tokens is done before the corpus is processed into word- context pairs. This practically enlarges the context window’s size for many tokens, because they can now reach words that were not in their origi- nalL-sized windows. We call this kind of subsampling “dirty”, as opposed to “clean” subsampling, which removes subsampled words without affect- ing the context window’s size. We found their impact on performance comparable, and report results of only the “dirty” variant.

Deleting Rare Words (del) While it is common to ignore words that are rare in the training corpus, word2vec removes these tokens from the corpusbeforecreating context windows. As with subsampling, this variation narrows the distance between tokens, inserting new word-context pairs that did not exist in the original corpus with the same window size. Though this variation may also have an effect on performance, preliminary experiments showed that it was small, and we therefore do not investigate its effect in this paper.

3.2 Association Metric Hyperparameters The PMI (or PPMI) between a word and its context is well known to be an effective association measure in the word similarity literature. Levy and Golberg (2014c) show that SGNS is implicitly factorizing a word-context matrix whose cell’s values are shifted PMI. Following their analysis, we present two variations of the PMI (and implicitly PPMI) association metric, which we adopt from SGNS. These enhancements of PMI are not directly applicable to GloVe, which, by definition, uses a different association measure.

Shifted PMI (neg) SGNS has a natural hyperparameter k (the number of negative samples), which affects the value that SGNS is trying to op- timize for each(w, c): P M I(w, c)−logk. The shift caused byk > 1can be applied to distributional methods through shifted PPMI (Levy and Goldberg, 2014c):

SP P M I(w, c) = max (P M I(w, c)−logk,0) It is important to understand that in SGNS,khas two distinct functions. First, it is used to better estimate the distribution of negative examples; a higherk means more data and better estimation.

in the original paper (equation 2).

Second, it acts as a prior on the probability of ob- serving a positive example (an actual occurrence of(w, c)in the corpus) versus a negative example;

a higherkmeans that negative examples are more probable. Shifted PPMI captures only the second aspect ofk (a prior). We experiment with three values ofk:1,5,15.

Context Distribution Smoothing (cds) In word2vec, negative examples (contexts) are sampled according to a smoothed unigram distribution. In order to smooth the original contexts’ distribution, all context counts are raised to the power of α (Mikolov et al. (2013b) found α = 0.75 to work well). This smoothing variation has an analog when calculating PMI directly:

P M Iα(w, c) = log Pˆ(w, c)

Pˆ(w) ˆPα(c) (3) Pˆα(c) = # (c)^α

P

c# (c)^α

Like other smoothing techniques (Pantel and Lin, 2002; Turney and Littman, 2003), context distribution smoothing alleviates PMI’s bias towards rare words. It does so by enlarging the probability of sampling a rare context (sincePˆα(c) > Pˆ(c) when cis infrequent), which in turn reduces the PMI of(w, c)for anywco-occurring with the rare contextc. In Section 6.2 we demonstrate that this novel variant of PMI is very effective, and consistently improves performance across tasks, methods, and configurations. We experiment with two values ofα:1(unsmoothed) and0.75(smoothed).

3.3 Post-processing Hyperparameters We present three hyperparameters that modify the algorithms’ output: the word vectors.

Adding Context Vectors (w+c) Pennington et al. (2014) propose using the context vectors in addition to the word vectors as GloVe’s output. For example, the word “cat” can be represented as:

~v_cat=w~_cat+~c_cat

wherew~ and~care the word and context embeddings, respectively.

This vector combination was originally motivated as an ensemble method. Here, we provide a different interpretation of its effect on the cosine similarity function. Specifically, we show

(6)

that adding context vectors effectively adds first- order similarity terms to the second-order similarity function.

Consider the cosine similarity of two words:

cos(x, y) = ~v_x·~v_y

√~vx·~vxp

~vy·~vy

= (w~x+~cx)·(w~y+~cy) p(w~_x+~c_x)·(w~_x+~c_x)p

(w~_y+~c_y)·(w~_y+~c_y)

= w~x·w~y+~cx·~cy+w~x·~cy+~cx·w~y

pw~²_x+ 2w~x·~cx+~c²_xq

~

w²_y+ 2w~y·~cy+~c²_y

= w~_x·w~_y+~c_x·~c_y+w~_x·~c_y+~c_x·w~_y 2√

~

wx·~cx+ 1p

~

wy·~cy+ 1 (4) (The last step follows because, as noted in Sec- tion 2, the word and context vectors are normalized after training.)

The resulting expression combines similarity terms which can be divided into two groups:

second-order similarity (wx·wy,cx·cy) and first- order similarity (w_∗·c_∗). The second-order terms measure the extent to which the two words arere- placeablebased on their tendencies to appear in similar contexts, and are the manifestation of Har- ris’s (1954) distributional hypothesis. The first- order terms measure the tendency of one word to appear in the context of the other.

In SVD and SGNS, the first-order similarity terms betweenw andcconverge toP M I(w, c), while in GloVe it converges into their log-count (with some bias terms).

The similarity calculated in equation 4 is thus a symmetric combination of the first-order and second order similarities ofxandy, normalized by a function of their reflective first-order similarities:

sim(x, y) = sim2(x, y) +sim1(x, y) psim1(x, x) + 1p

sim1(y, y) + 1 This similarity measure states that words are similar if they tend to appear in similar contexts, orif they tend to appear in the contexts of each other (and preferably both).

The additive w+c representation can be triv- ially applied to other methods that produce distinct word and context vectors (e.g. SVD and SGNS).

On the other hand, explicit methods such as PPMI are sparse by definition, and nullify the vast ma- jority of first-order similarities. We therefore do not applyw+cto PPMI in this study.

Eigenvalue Weighting (eig) As mentioned in Section 2.2, the word and context vectors derived using SVD are typically represented by (equation 1):

W^SVD=Ud·Σd C^SVD=Vd

However, this is not necessarily the optimal con- struction ofW^SVD for word similarity tasks. We note that in the SVD-based factorization, the resulting word and context matrices have very different properties. In particular, the context matrix C^SVD is orthonormal while the word matrix W^SVD is not. On the other hand, the factorization achieved by SGNS’s training procedure is much more “symmetric”, in the sense that neitherW^W2V norC^W2Vis orthonormal, and no particular bias is given to either of the matrices in the training objective. Similar symmetry can be achieved with the following factorization:

W =Ud·p

Σd C=Vd·p

Σd (5) Alternatively, the eigenvalue matrix can be dis- missed altogether:

W =U_d C =V_d (6)

While it is not theoretically clear why the symmetric approach is better for semantic tasks, it does work much better empirically (see Sec- tion 6.1). A similar observation was made by Caron (2001), who suggested adding a parameter pto control the eigenvalue matrixΣ:

W^SVD^p =U_d·Σ^p_d

Later studies show that weighting the eigenvalue matrixΣdwith the exponentpcan have a significant effect on performance, and should be tuned (Bullinaria and Levy, 2012; Turney, 2012). Adapt- ing the notion of symmetric decomposition from SGNS, this study experiments only with symmetric variants of SVD (p= 0,p= 0.5; equations (6) and (5)) and the traditional factorization (p = 1; equation (1)).

Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity. This normalization is a hyperparameter setting in itself, and other normalizations are also applicable. The trivial case is using no normalization

(7)

Hyper- Explored Applicable

parameter Values Methods

win 2,5,10 All

dyn none, with All

sub none, dirty, clean^† All

del none, with^† All

neg 1,5,15 PPMI, SVD, SGNS

cds 1,0.75 PPMI, SVD, SGNS

w+c onlyw,w+c SVD, SGNS, GloVe

eig 0,0.5,1 SVD

nrm none^†, row, col^†, both^† All

Table 1: The space of hyperparameters explored in this work.

†Explored only in preliminary experiments.

at all. Another setting, used by Pennington et al.

(2014), normalizes the columns ofW rather than its rows. It is also possible to consider a fourth setting that combines both row and column normalizations.

Note that column normalization is akin to dis- missing the eigenvalues in SVD. While the hyperparameter settingeig = 0 has an important positive impact on SVD, the same cannot be said of column normalization on other methods. In preliminary experiments, we tried the four different normalization schemes described above (none, row, column, and both), and found the standardL2

normalization ofW’s rows (i.e. using the cosine similarity measure) to be consistently superior.

4 Experimental Setup

We explored a large space of hyperparameters, representations, and evaluation datasets.

4.1 Hyperparameter Space

Table 1 enumerates the hyperparameter space. We generated 72 PPMI, 432 SVD, 144 SGNS, and 24 GloVe representations; 672 overall.

4.2 Word Representations

Corpus All models were trained on English Wikipedia (August 2013 dump), pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 mil- lion sentences, spanning 1.5 billion tokens. Mod- els were derived using windows of 2, 5, and 10 tokens to each side of the focus word (the window size parameter is denotedwin). Words that appeared less than 100 times in the corpus were ignored, resulting in vocabularies of 189,533 terms for both words and contexts.

Training Embeddings We trained a 500- dimensional representation with SVD, SGNS, and

GloVe. SGNS was trained using a modified version ofword2vecwhich receives a sequence of pre-extracted word-context pairs (Levy and Gold- berg, 2014a). GloVe was trained with 50 itera- tions using the original implementation (Penning- ton et al., 2014), applied to the pre-extracted word- context pairs.

4.3 Test Datasets

We evaluated each word representation on eight datasets covering similarity and analogy tasks.

Word Similarity We used six datasets to evaluate word similarity: the popular WordSim353 (Finkelstein et al., 2002) partitioned into two datasets,WordSim SimilarityandWordSim Relat- edness (Zesch et al., 2008; Agirre et al., 2009);

Bruni et al.’s (2012) MEN dataset; Radinsky et al.’s (2011) Mechanical Turk dataset; Luong et al.’s (2013)Rare Words dataset; and Hill et al.’s (2014)SimLex-999dataset. All these datasets contain word pairs together with human-assigned similarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation (Spearman’s ρ) with the human ratings.

Analogy The twoanalogydatasets present questions of the form “ais toa^∗ asbis tob^∗”, where b^∗is hidden, and must be guessed from the entire vocabulary. MSR’s analogy dataset (Mikolov et al., 2013c) contains 8000 morpho-syntactic analogy questions, such as “goodis tobestassmartis tosmartest”. Google’s analogy dataset (Mikolov et al., 2013a) contains 19544 questions, about half of the same kind as in MSR (syntactic analogies), and another half of a more semantic nature, such as capital cities (“Parisis toFranceasTokyois to Japan”). After filtering questions involving out- of-vocabulary words, i.e. words that appeared in English Wikipedia less than 100 times, we remain with 7118 instances in MSR and 19258 instances in Google. The analogy questions are answered using 3CosAdd (addition and subtraction):

arg max

b^∗∈VW\{a^∗,b,a}cos(b^∗, a^∗−a+b) =

arg max

b^∗∈VW\{a^∗,b,a}(cos(b^∗, a^∗)−cos(b^∗, a) + cos(b^∗, b)) as well as 3CosMul, which is state-of-the-art in analogy recovery (Levy and Goldberg, 2014b):

arg max

b^∗∈VW\{a^∗,b,a}

cos(b^∗, a^∗)·cos(b^∗, b) cos(b^∗, a) +ε

(8)

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul

PPMI .709 .540 .688 .648 .393 .338 .491 /.650 .246 / .439

SVD .776 .658 .752 .557 .506 .422 .452 / .498 .357 / .412

SGNS .724 .587 .686 .678 .434 .401 .530 / .552 .578 /.592

GloVe .666 .467 .659 .599 .403 .398 .442 / .465 .529 / .576

Table 2: Performance of each method across different tasks in the “vanilla” scenario (all hyperparameters set to default):

win= 2;dyn=none;sub=none;neg= 1;cds= 1;w+c=onlyw;eig= 0.0.

PPMI .755 .688 .745 .686 .423 .354 .553 /.629 .289 / .413

SVD .784 .672 .777 .625 .514 .402 .547 / .587 .402 / .457

SGNS .773 .623 .723 .676 .431 .423 .599 / .625 .514 / .546

GloVe .667 .506 .685 .599 .372 .389 .539 / .563 .503 /.559

CBOW .766 .613 .757 .663 .480 .412 .547 / .591 .557 /.598

Table 3: Performance of each method across different tasks usingword2vec’s recommended configuration: win = 2;

dyn=with;sub=dirty;neg= 5;cds= 0.75;w+c=onlyw;eig= 0.0. CBOW is presented for comparison.

PPMI .755 .697 .745 .686 .462 .393 .553 / .679 .306 / .535

SVD .793 .691 .778 .666 .514 .432 .554 / .591 .408 / .468

SGNS .793 .685 .774 .693 .470 .438 .676 /.688 .618 /.645

GloVe .725 .604 .729 .632 .403 .398 .569 / .596 .533 / .580

Table 4: Performance of each method across different tasks using the best configuration for that method and task combination, assumingwin= 2.

ε= 0.001is used to prevent division by zero. We abbreviate the two methods “Add” and “Mul”, respectively. The evaluation metric for the analogy questions is the percentage of questions for which the argmax result was the correct answer (b^∗).

5 Results

We begin by comparing the effect of various hyperparameter configurations, and observe that different settings have a substantial impact on performance (Section 5.1); at times, this improvement is greater than that of switching to a different representation method. We then show that, in some tasks, careful hyperparameter tuning can also outweigh the importance of adding more data (5.2). Finally, we observe that our results do not agree with a few recent claims in the word embedding literature, and suggest that these discrepan- cies stem from hyperparameter settings that were not controlled for in previous experiments (5.3).

5.1 Hyperparameters vs Algorithms

We first examine a “vanilla” scenario (Table 2), in which all hyperparameters are “turned off” (set to

default values): small context windows (win = 2), no dynamic contexts (dyn = none), no subsampling (sub = none), one negative sample (neg= 1), no smoothing (cds= 1), no context vectors (w+c = onlyw), and default eigenvalue weights (eig= 0.0).⁵Overall, SVD outperforms other methods on most word similarity tasks, often having a considerable advantage over the second- best. In contrast, analogy tasks present mixed results; SGNS yields the best result in MSR’s analogies, while PPMI dominates Google’s dataset.

The second scenario (Table 3) sets the hyperparameters toword2vec’s default values: small context windows (win = 2),⁶ dynamic contexts (dyn= with), dirty subsampling (sub = dirty), five negative samples (neg= 5), context distribution smoothing (cds= 0.75), no context vectors (w+c = onlyw), and default eigenvalue weights

5While it is more common to seteig= 1, this setting degrades SVD’s performance considerably (see Section 6.1).

6Whileword2vec’s default window size is5, we present a single window size (win= 2) in Tables 2-4, in order to iso- latewin’s effect from the effects of other hyperparameters.

Running the same experiments with different window sizes reveals similar trends. Additional results with broader window sizes are shown in Table 5.

(9)

win Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul 2

PPMI .732 .699 .744 .654 .457 .382 .552 / .677 .306 / .535

SVD .772 .671 .777 .647 .508 .425 .554 / .591 .408 / .468

SGNS .789 .675 .773 .661 .449 .433 .676 /.689 .617 /.644

GloVe .720 .605 .728 .606 .389 .388 .649 / .666 .540 / .591

5

PPMI .732 .706 .738 .668 .442 .360 .518 / .649 .277 / .467

SVD .764 .679 .776 .639 .499 .416 .532 / .569 .369 / .424

SGNS .772 .690 .772 .663 .454 .403 .692 /.714 .605 /.645

GloVe .745 .617 .746 .631 .416 .389 .700 / .712 .541 / .599

10

PPMI .735 .701 .741 .663 .235 .336 .532 / .605 .249 / .353

SVD .766 .681 .770 .628 .312 .419 .526 / .562 .356 / .406

SGNS .794 .700 .775 .678 .281 .422 .694 / .710 .520 /.557

GloVe .746 .643 .754 .616 .266 .375 .702 /.712 .463 / .519

10 SGNS-LS .766 .681 .781 .689 .451 .414 .739 /.758 .690 /.729

GloVe-LS .678 .624 .752 .639 .361 .371 .732 / .750 .628 / .685

Table 5: Performance of each method across different tasks using 2-fold cross-validation for hyperparameter tuning. Configu- rations on large-scale (LS) corpora are also presented for comparison.

(eig= 0.0). The results in this scenario are quite different than those of the vanilla scenario, with better performance in many cases. However, this change is not uniform, as we observe that different settings boost different algorithms. In fact, the question “Which method is best?” might have a completely different answer when comparing on the same task but with different hyperparameter values. Looking at Table 2 and Table 3, for example, SVD is the best algorithm for SimLex-999 in the vanilla scenario, whereas in theword2vec scenario, it does not perform as well as SGNS.

The third scenario (Table 4) enables the full range of hyperparameters given small context windows (win = 2); we evaluate each method on each task given every hyperparameter configuration, and choose the best performance. We see a considerable performance increase across all methods when comparing to both the vanilla (Ta- ble 2) and word2vec scenarios (Table 3): the best combination of hyperparameters improves up to 15.7 points beyond the vanilla setting, and over 6 points on average. It appears that selecting the right hyperparameter settings often has more impact than choosing the most suitable algorithm.

Main Result The numbers in Table 4 result from an “oracle” experiment, in which the hyperparameters are tuned on the test data, providing an upper bound on the potential performance improvement of hyperparameter tuning. Are such gains achievable in practice?

Table 5 describes a realistic scenario, where the hyperparameters are tuned on a training set, which is separate from the unseen test data. We also

report results for different window sizes (win = 2,5,10). We use 2-fold cross validation, in which, for each task, the hyperparameters are tuned on each half of the data and evaluated on the other half. The numbers reported in Table 5 are the av- erages of the two runs for each data-point.

The results indicate that approaching the oracle’s improvements are indeed feasible. When comparing the performance of the trained configuration (Table 5) to that of the optimal one (Ta- ble 4), their average difference is about1%, with larger datasets usually finding the optimal configuration. It is therefore both practical and beneficial to properly tune hyperparameters for word similarity and analogy detection tasks.

An interesting observation, which immediately appears when looking at Table 5, is that there is no single method that consistently performs better than the rest. This behavior is visible across all window sizes, and is discussed in further detail in Section 5.3.

5.2 Hyperparameters vs Big Data

An important factor in evaluating distributional methods is the size of corpus and vocabulary, where larger corpora tend to yield better representations. However, training word vectors from larger corpora is more costly in computation time, which could be spent in tuning hyperparameters.

To compare the effect of bigger data versus more flexible hyperparameter settings, we created a large corpus with over 10.5 billion words (7 times larger than our original corpus). This corpus was built from an 8.5 billion word corpus sug-

(10)

gested by Mikolov for training word2vec,⁷ to which we added UKWaC (Ferraresi et al., 2008).

As with the original setup, our vocabulary con- tained every word that appeared at least 100 times in the corpus, amounting to about 620,000 words.

Finally, we fixed the context windows to be broad and dynamic (win = 10,dyn = with), and explored 16 hyperparameter settings comprising of:

subsampling (sub), shifted PMI (neg = 1,5), context distribution smoothing (cds), and adding context vectors (w+c). This space is somewhat more restricted than the original hyperparameter space.

In terms of computation, SGNS scales nicely, requiring about half a day of computation per setup. GloVe, on the other hand, took several days to run a single 50-iteration instance for this corpus.

Applying the traditional count-based methods to this setting proved technically challenging, as they consumed too much memory to be efficiently ma- nipulated. We thus present results for only SGNS and GloVe (Table 5).

Remarkably, there are some cases (3/6 word similarity tasks) in which tuning a larger space of hyperparameters is indeed more beneficial than expanding the corpus. In other cases, however, more data does seem to pay off, as evident with both analogy tasks.

5.3 Re-evaluating Prior Claims

Prior art raises several claims regarding the superiority of certain methods over the others. However, these studies did not control for the hyperparameters presented in this work. We thus revisit these claims, and examine their validity based on the results in Table 5.⁸

Are embeddings superior to count-based distributional methods? It is commonly believed that modern prediction-based embeddings perform better than traditional count-based methods.

This claim was recently supported by a series of systematic evaluations by Baroni et al. (2014).

However, our results suggest a different trend. Ta- ble 5 shows that in word similarity tasks, the average score of SGNS is actually lower than SVD’s whenwin = 2,5, and it never outperforms SVD

7http://word2vec.googlecode.com/svn/

trunk/demo-train-big-model-v1.sh

8We note that all conclusions drawn in this section rely on the specific data and settings with which we experiment. It is indeed feasible that experiments on different tasks, data, and hyperparameters may yield other conclusions.

by more than 1.7 points in those cases. In Google’s analogies SGNS and GloVe indeed perform better than PPMI, but only by a margin of 3.7 points (compare PPMI withwin = 2 and SGNS with win= 5). MSR’s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD.⁹Overall, there does not seem to be a consistent significant advantage to one approach over the other, thus refuting the claim that prediction-based methods are superior to count- based approaches.

The contradictory results in (Baroni et al., 2014) stem from creating word2vec embeddings with somewhat pre-tuned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representations. In particular, shifted PMI (negative sampling) and context distribution smoothing (cds= 0.75, equation (3) in Section 3.2) were turned on for SGNS, but not for PPMI and SVD. An additional difference is Baroni et al.’s setting of eig=1, which significantly deteriorates SVD’s performance (see Section 6.1).

Is GloVe superior to SGNS? Pennington et al.

(2014) show a variety of experiments in which GloVe outperforms SGNS (among other methods). However, our results show the complete op- posite. In fact, SGNS outperforms GloVe inevery task (Table 5). Only when restricted to 3CosAdd, a suboptimal configuration, does GloVe show a 0.8 point advantage over SGNS. This trend persists when scaling up to a larger corpus and vocabulary.

This contradiction can be explained by three major differences in the experimental setup. First, in our experiments, hyperparameters were allowed to vary; in particular,w+cwas applied to all the methods, including SGNS. Secondly, Pennington et al. (2014) only evaluated on Google’s analogies, but not on MSR’s. Finally, in our work, all methods are compared using the same underlying corpus.

It is also important to bear in mind that, by definition, GloVe cannot use two hyperparameters: shifted PMI (neg) and context distribution smoothing (cds). Instead, GloVe learns a set of bias parameters that subsumes these two modifications and many other potential changes to the PMI metric. Albeit its greater flexibility, GloVe does not fair better than SGNS in our experiments.

9Unlike PPMI, SVD underperforms in both analogy tasks.

(11)

Is PPMI on-par with SGNS on analogy tasks?

Levy and Goldberg (2014b) show that PPMI and SGNS perform similarly on both Google’s and MSR’s analogy tasks. Nevertheless, the results in Table 5 show a clear advantage to SGNS.

While the gap on Google’s analogies is not very large (PPMI lags behind SGNS by only 3.7 points), SGNS consistently outperforms PPMI by a large margin on the MSR dataset. MSR’s analogy dataset captures syntactic relations, such as singular-plural inflections for nouns and tense modifications for verbs. We conjecture that capturing these syntactic relations may rely on certain types of contexts, such as determiners and function words, which SGNS might be better at capturing – perhaps due to the way it assigns weights to different examples, or because it also captures negative correlations which are filtered by PPMI.

A deeper look into Levy and Goldberg’s (2014b) experiments reveals the use of PPMI with positionalcontexts (i.e. each context is a conjunc- tion of a word and its relative position to the target word), whereas SGNS was employed with regular bag-of-words contexts. Positional contexts might contain relevant information for recovering syntactic analogies, explaining PPMI’s relatively high score on MSR’s analogy task in (Levy and Gold- berg, 2014b).

Does 3CosMul recover more analogies than 3CosAdd? Levy and Goldberg (2014b) show that using similarity multiplication (3CosMul) rather than addition (3CosAdd) improves results on all methods and on every task. This claim is consistent with our findings; indeed, 3CosMul dominates 3CosAdd in every case. The improvement is particularly noticeable for SVD and PPMI, which considerably underperform other methods when using 3CosAdd.

5.4 Comparison with CBOW

Another algorithm featured in word2vec is CBOW. Unlike the other methods, CBOWcannot be easily expressed as a factorization of a word- context matrix; it ties together the tokens of each context window by representing the context vector as the sum of its words’ vectors. It is thus more expressive than the other methods, and has a potential of deriving better word representations.

While Mikolov et al. (2013b) found SGNS to outperform CBOW, Baroni et al. (2014) reports that CBOW had a slight advantage. We com-

win eig Average Performance

0 .612

2 0.5 .611

1 .551

0 .616

5 0.5 .612

1 .534

0 .584

10 0.5 .567

1 .484

Table 6: The average performance of SVD on word similarity tasks given different values ofeig, in the vanilla scenario.

pared CBOW to the other methods when setting all the hyperparameters to the defaults provided by word2vec (Table 3). With the exception of MSR’s analogy task, CBOW is not the best- performing method of any other task in this scenario. Other scenarios showed similar trends in our preliminary experiments.

While CBOW can potentially derive better representations by combining the tokens in each context window, this potential is not realized in practice. Nevertheless, Melamud et al. (2014) show that capturing joint contexts can indeed improve performance on word similarity tasks, and we be- lieve it is a direction worth pursuing.

6 Hyperparameter Analysis

We analyze the individual impact of each hyperparameter, and try to characterize the conditions in which a certain setting is beneficial.

6.1 Harmful Configurations

Certain hyperparameter settings might cripple the performance of a certain method. We observe two scenarios in which SVD performs poorly.

SVD does not benefit from shifted PPMI. Set- tingneg>1consistently deteriorates SVD’s performance. Levy and Goldberg (2014c) made a similar observation, and hypothesized that this is a result of the increasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to the zero matrix. SVD’sL2ob- jective is unweighted, and it does not distinguish between observed and unobserved matrix cells.

Using SVD “correctly” is bad. The traditional way of representing words with SVD uses the eigenvalue matrix (eig= 1):W =U_d·Σ_d. De- spite being theoretically well-motivated, this setting leads to very poor results in practice, when compared to other settings (eig= 0.5or0). Ta- ble 6 demonstrates this gap.

(12)

The drop in average accuracy when setting eig = 1 is astounding. The performance gap persists under different hyperparameter settings as well, and drops in performance of over15points (absolute) when usingeig= 1instead ofeig= 0.5or0are not uncommon. This setting is one of the main reasons for SVD’s inferior results in the study by Baroni et al. (2014), and also the reason we chose to useeig = 0.5as the default setting for SVD in the vanilla scenario.

6.2 Beneficial Configurations

To identify which hyperparameter settings are beneficial, we looked at the best configuration of each method on each task. We then counted the number of times each hyperparameter setting was chosen in these configurations (Table 7). Some trends emerge, such as PPMI and SVD’s prefer- ence towards shorter context windows¹⁰ (win = 2), and that SGNS always prefers numerous negative samples (neg>1).

To get a closer look and isolate the effect of each hyperparameter, we controlled for said hyperparameter, and compared the best configurations given each of the hyperparameter’s settings.

Table 8 shows the difference between default and non-default settings of each hyperparameter.

While many hyperparameter settings can improve performance, they may also degrade it when chosen incorrectly. For instance, in the case of shifted PMI (neg), SGNS consistently profits fromneg> 1, while SVD’s performance is dra- matically reduced. For PPMI, the utility of applying neg > 1 depends on the type of task:

word similarity or analogy. Another example is dynamic context windows (dyn), which is beneficial for MSR’s analogy task, but largely detrimen- tal to other tasks.

It appears that the only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds = 0.75), yielding a consistent improvement at an insignificant risk.

Note thatcdshelps PPMI more than it does other methods; we suggest that this is because it reduces the relative impact of rare words on the distributional representation, thus addressing PMI’s

“Achilles’ heel”.

10This might also relate to PMI’s bias towards infrequent events (see Section 2.1). Broader windows create more random co-occurrences with rare words, “polluting” the distributional vector with random words that have high PMI scores.

7 Practical Recommendations

It is generally advisable to tune all hyperparameters, as well as algorithm-specific hyperparameters, for the task at hand. However, this may be computationally expensive. We thus provide some

“rules of thumb”, which we found to work well in our setting:

• Always use context distribution smoothing (cds = 0.75) to modify PMI, as described in Section 3.2. It consistently improves performance, and is applicable to PPMI, SVD, and SGNS.

•Do not use SVD “correctly” (eig= 1). Instead, use one of the symmetric variants (Section 3.3).

•SGNS is a robust baseline. While it might not be the best method for every task, it does not significantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory con- sumption.

•With SGNS, prefer many negative samples.

•for both SGNS and GloVe, it is worthwhile to experiment with thew~+~cvariant, which is cheap to apply (does not require retraining) and can result in substantial gains (as well as substantial losses).

8 Conclusions

Recent embedding methods introduce a plethora of design choices beyond network architecture and optimization algorithms. We reveal that these seemingly minor variations can have a large impact on the success of word representation methods. By showing how to adapt and tune these hyperparameters in traditional methods, we allow a proper comparison between representations, and challenge various claims of superiority from the word embedding literature.

This study also exposes the need for more controlled-variable experiments, and extending the concept of “variable” from the obvious task, data, and method to the often ignored prepro- cessing steps and hyperparameter settings. We also stress the need for transparent and repro- ducible experiments, and commend authors such as Mikolov, Pennington, and others for making their code publicly available. In this spirit, we make our code available as well.¹¹

11http://bitbucket.org/omerlevy/

hyperwords

(13)

Method win dyn sub neg cds w+c 2 : 5 : 10 none : with none : dirty 1 : 5 : 15 1.00 : 0.75 onlyw:w+c

PPMI 7 : 1 : 0 4 : 4 4 : 4 2 : 6 : 0 1 : 7 —

SVD 7 : 1 : 0 4 : 4 1 : 7 8 : 0 : 0 2 : 6 7 : 1

SGNS 2 : 3 : 3 6 : 2 4 : 4 0 : 4 : 4 3 : 5 4 : 4

GloVe 1 : 3 : 4 6 : 2 7 : 1 — — 4 : 4

Table 7: The impact of each hyperparameter, measured by the number of tasks in which the best configuration had that hyperparameter setting. Non-applicable combinations are marked by “—”.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul

PPMI +0.5% –1.0% 0.0% +0.1% +0.4% –0.1% –0.1% +1.2%

SVD –0.8% –0.2% 0.0% +0.6% +0.4% –0.1% +0.6% +2.1%

SGNS –0.9% –1.5% –0.3% +0.1% –0.1% –0.1% –1.0% +0.7%

GloVe –0.8% –1.2% –0.9% –0.8% +0.1% –0.9% –3.3% +1.8%

(a) Performance difference between best models withdyn=withanddyn=none.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +1.9% +1.3% +1.0% –3.8% –3.9% –5.0% –12.2%

SVD +0.7% +0.2% +0.6% +0.7% +0.8% –0.3% +4.0% +2.4%

SGNS +1.5% +2.2% +1.5% +0.1% –0.4% –0.1% –4.4% –5.4%

GloVe +0.2% –1.3% –1.0% –0.2% –3.4% –0.9% –3.0% –3.6%

(b) Performance difference between best models withsub=dirtyandsub=none.

Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +4.9% +1.3% +1.0% +2.2% +0.8% –6.2% –9.2%

SVD –1.7% –2.2% –1.9% –4.6% –3.4% –3.5% –13.9% –14.9%

SGNS +1.5% +2.9% +2.3% +0.5% +1.5% +1.1% +3.3% +2.1%

GloVe — — — — — — — —

(c) Performance difference between best models withneg>1andneg= 1.

PPMI +1.3% +2.8% 0.0% +2.1% +3.5% +2.9% +2.7% +9.2%

SVD +0.4% –0.2% +0.1% +1.1% +0.4% –0.3% +1.4% +2.2%

SGNS +0.4% +1.4% 0.0% +0.1% 0.0% +0.2% +0.6% 0.0%

GloVe — — — — — — — —

(d) Performance difference between best models withcds= 0.75andcds= 1.

PPMI — — — — — — — —

SVD –0.6% –0.2% –0.4% –2.1% –0.7% +0.7% –1.8% –3.4%

SGNS +1.4% +2.2% +1.2% +1.1% –0.3% –2.3% –1.0% –7.5%

GloVe +2.3% +4.7% +3.0% –0.1% –0.7% –2.6% +3.3% –8.9%

(e) Performance difference between best models withw+c=w+candw+c=onlyw.

Table 8: The added value versus the risk of setting each hyperparameter. The figures show the differences in performance between the best achievable configurations when restricting a hyperparameter to different values. This difference indicates the potential gain of tuning a given hyperparameter, as well as the risks of decreased performance when not tuning it. For example, an entry of +9.2% in Table (d) means that the best model withcds= 0.75is 9.2% more accurate (absolute) than the best model withcds= 1; i.e. on MSR’s analogies, usingcds= 0.75instead ofcds= 1improved PPMI’s accuracy from .443 to .535.