Building and processing a dataset containing articles related to food adulteration

(1)

Building and processing a dataset containing articles

related to food adulteration

by

Deepak Narayanan

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Masters of Engineering, Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2015

c

○ Massachusetts Institute of Technology 2015. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 22, 2015

Certified by . . . .

Regina Barzilay

Professor

Thesis Supervisor

Accepted by . . . .

Albert Meyer

Chairman, Masters of Engineering Thesis Committee

(2)

(3)

Building and processing a dataset containing articles related

to food adulteration

by

Deepak Narayanan

Submitted to the Department of Electrical Engineering and Computer Science on May 22, 2015, in partial fulfillment of the

requirements for the degree of

Masters of Engineering, Computer Science and Engineering

Abstract

In this thesis, I explored the problem of building a dataset containing news articles related to adulteration, and curating this dataset in an automated fashion. In partic-ular, we looked at food-adulterant co-existence detection, query reforumulation, and entity extraction and text deduplication. All proposed algorithms were implemented in Python, and performance was evaluated on multiple datasets. Methods described in this thesis can be generalized to other applications as well.

Thesis Supervisor: Regina Barzilay Title: Professor

(4)

(5)

Acknowledgments

First and foremost, I would like to thank my advisor, Prof. Regina Barzilay, for giving me my first real opportunity at academic research. Without her feedback, this thesis would not be possible.

I would also like to thank Prof. Tommi Jaakola for his insightful comments – a lot of the initial work in this thesis was a bi-product of suggestions made by him.

Rushi Ganmukhi has been a major contributor to the FDA project over the last year, and a lot of the work in this thesis was done in close collaboration with him.

To the other members of the RBG group – thank you! Attending weekly group meetings has given me a glimpse into how academic research is done. In particu-lar, Karthik Narasimhan and Tao Lei were always extremely willing to answer my questions, however dumb.

Last, I would like to thank my mom for always pushing me to become better, both as a person and as a student. I would not be here without her support and dedication.

(6)

(7)

List of Figures

1-1 Different data sources that can be used to extract information about adulteration from . . . 17 3-1 Example of query refining . . . 32 3-2 Example of patterns extracted from tweets describing music events in

New York . . . 39 3-3 Design of snowball system . . . 40 4-1 CRF model – here 𝑥 is the sequence of observed variables and 𝑠 is the

sequence of hidden variables . . . 44 4-2 Example clustering of documents using HCA . . . 51 4-3 Pseudocode of HCA algorithm, with termination inter-cluster

(10)

(11)

List of Tables

2.1 F-measures obtained with collaborative filtering with different 𝑑 values 27 2.2 MAP scores on adulteration dataset . . . 28 4.1 Precision, recall and F1 of NER on training data . . . 47 4.2 Precision, recall and F1 of NER on testing data . . . 47 4.3 F-measure of clustering algorithms with and w/o keyword extraction

on Google News articles . . . 53 4.4 F-measure of clustering algorithms with and w/o keyword extraction

on set of 53 adulteration-related articles . . . 54 4.5 F-measure of clustering algorithms with and w/o keyword extraction

on set of 116 adulteration-related articles . . . 54 4.6 Recall and precision of record extraction on Twitter corpus . . . 64 4.7 Precision, recall and F1 of NER on Twitter event data . . . 64 4.8 Recall and precision of record extraction on adulteration datasets . . 64 4.9 Recall, precision and F-measure of clustering using combined approach

(12)

(13)

Chapter 1 Introduction

Over the last couple of years, instances of food adulteration have been on the rise. Here in the United States, the Food and Drug Administration (FDA) regulates and enforces laws over food safety. With food-related illnesses becoming increasingly com-mon, the FDA has recently become more interested in identifying and understanding the risks associated with imported FDA-regulated products.

The reasons for increased risk are often varying and hard to pinpoint. For instance, a couple of years ago, milk imported from China contained an increased amount of melamine (a chemical that can be extremely harmful to human beings but often added to foods to increase their apparent protein content) – this later was attributed to the increased grain prices in China. In order to make ends meet and cut costs, farmers added melamine to heavily watered-down milk to keep the protein content of the diluted milk high.

In addition, since food adulteration is largely decentralized, one of the primary challenges with understanding the causes of such risks is being able to piece together a number of disparate data sources. With the unified data, it would be possible to run inference models to obtain timely prospective information about instances of increased risk.

The main goals of this thesis are two-fold – one, building a system capable of re-trieving documents (in particular, news articles) that talk about specific adulteration incidents, and two, developing algorithms that are capable of curating this dataset

(14)

and making it more usable in an automated way.

Retrieval of relevant news articles

Our approach to retrieving all news articles related to food adulteration largely relies on the ability to determine if a particular food product and adulterant can co-exist with each other. To this end, much of the intial work presented in this thesis is con-centrated on building a classifier that is capable of determining if a food product and an adulterant can co- exist. We have information about some cases of adulteration, but have no information about most (food, adulterant) pairs. Given that similar food products are more likely to be adulterated by the same adulterants, and sym-metrically, similar adulterants are more likely to adulterate the same foods, it seems possible to build a classifier that predicts such co-existences, using information about foods and adulterants that is easily available.

One approach that we considered involved using a standard classification algo-rithm (such as SVM) to classify pairs of foods and adulterants – here each pair has some feature representation, and we try to classify each such vector as positive or negative. One inherent problem with this method is that it doesn’t really take into account interactions between different food and adulterant features (even though some of these relationships can potentially be exposed by using kernels). It seems much more natural to solve this problem using collaborative filtering – a technique used by many recommender systems.

Collaborative filtering is very loosely defined as the process of filtering for informa-tion or patterns using techniques that combine multiple view points or data sources. Netflix extensively uses collaborative filtering for movie recommendations. Most peo-ple have not watched all movies, so a good way of predicting a particular person’s reaction to a movie is to first find people with similar movie tastes, and then use the reviews of these like-minded people to generate a predicted rating.

The two approaches we’ve described so far face the same inherent problem – for our particular task there are no negative training examples (because we only have

(15)

information about a food product and adulterant co-existing, we don’t have any information about a food product and adulterant not co- existing) – a naive way of overcoming this problem is to cast all (or just a subset of) food - adulterant pairs not seen in our datasources as negative in the training step.

However, this seems like a poor way to solve this problem since some of these (food, adulterant) pairs may in fact co-exist (and with this approach we would mark them as negative in training). Moreover, our main goal remains trying to make predictions about unseen food and adulterant pairs – labeling all of these pairs as negative in our training seems at odds with the goals of our task, and makes it less likely that we will in fact be able to discover new co-existing (food, adulterant) pairs. Later in this thesis, we present an approach that attempts to circumvent this problem in a more natural manner, by looking at the problem from the perspective of ranking.

Once we have a sufficiently robust classifier which can predict co- relations with reasonable accuracy and precision, the next important step is to develop an automated process that not only generates queries given food products and adulterants that co-exist with each other, but also retrieves documents relevant to the generated query. We assume here that we have access to a stanndard information retrieval system like Google that is capable of retrieving documents relevant to a particular query.

Our query formulation and information retrieval tasks are in fact intertwined, since the only way we can really evaluate the performance of the queries generated by the above process is by evaluating the quality of the documents generated by running a retrieval task with the given query. We talk about some ways of generating suitable queries given metadata about the incident in Chapter 3 of this thesis.

At the end of chapter 3, we briefly discuss another way of looking at the article retrieval problem.

Curation of dataset

Creating a dataset of incidents that is easily usable isn’t confined to merely retrieving documents that describe an adulteration incident. In order for these articles to be

(16)

“understandable” by a computer system, some additional steps are imperative – for example, we often need to extract recognized entities from an unstructured blob of text, because computer programs understand structured information a lot better than unstructured information.

Furthermore, extracting well-defined fields from unstructured text allows us to access subsets of our dataset in much easier ways. For example, extracting the food and adulterant from every adulteration-related article allows us to look at adulteration incidents involving “milk” and “melamine” easily. We present results of running the current state-of-the-art entity recognition system, which makes use of a CRF, on our dataset of adulteration articles.

In addition, since articles are often extracted from a number of diverse sources, articles describing the same incident may in fact be lexically very different. Figuring out that these articles are “duplicates” is not as trivial as comparing two strings. We model deduplication as a clustering task, where we’re trying to classify articles related to the same incident into the same cluster, and classify unrelated articles into different clusters.

One can also think of making this task online – as soon as we retrieve an article, we can run our duplicate detection algorithm to determine if the new article is a duplicate of an existing article. This seems especially important when we’re using our algorithms to detect potential new adulteration incidents in real time.

The majority of chapter 4 describes an approach using variational inference to combine the above two tasks – record extraction and clustering – to reduce the noise associated with running each of these tasks individually. Our hope is that articles talking about the same incident must contain the same entitites, and articles with the same entities should be classified into the same cluster.

Use cases of this dataset

Once we have a database of news articles related to cases of adulteration, we can move on to building models that attempt to understand the types of events that may

(17)

Figure 1-1: Different data sources that can be used to extract information about adulteration from

lead to elevated risks. These indicators are most likely to be some complex pattern at the intersection of multiple data sources (for example, events now could lead to an increased number of adulteration incidents four months down the road), which is why its imperative to first build a reliable retrieval system that is capable of collecting data from these multiple data sources.

Note that even though the techniques presented in this thesis were tuned with the specific application of retrieving adulteration-related articles, the overall tech-niques are easily generalizable and can be applied to other, more generic, information retrieval tasks.

(18)

(19)

Chapter 2 Food and adulterant co-existence

Our approach to retrieving articles requires a relatively comprehensive list of co-existing foods and adulterants. Keeping this in mind, the first task we look at is the task of detecting whether a food and adulterant can co-exist with each other. With a classifier capable of predicting food-adulterant co-existence with reasonable accuracy, we can retrieve news articles that describe an incident involving the particular food and adulterant.

2.1 SVM classifier

Our first approach, which is also the most naive approach, is to build a simple Support Vector Machine classifier that classifies (food, adulterant) pairs as positive or negative depending on whether the food and adulterant indeed co-exist or not.

Given a feature vector representation 𝑥 for a food, and a feature vector represen-tation 𝑧 for an adulterant, we construct a feature vector represenrepresen-tation for a (food, adulterant) pair as vec(𝑥𝑧𝑇_{), which allows food features and adulterant features to}

interact jointly (unlike concatenating 𝑥 and 𝑧 for example) – for example, if 𝑥 has a feature which takes value 1 if the food is high in protein content and 0 otherwise, and 𝑧 has a feature which takes value 1 if the adulterant is melamine and 0 otherwise, then the (food, adulterant) pair would have a feature which has value 1 if the food is high in protein content and the adulterant is melamine – such a representation gives

(20)

the classifier more information to work with.

Given the feature vector representation of a (food, adulterant) pair, we try classi-fying different pairs as positive or negative by attempting to find a separator between the positive and negative pairs seen in the training set. Note that this separator need not be a straight line and can take different shapes depending on the type of kernel used.

However, we observed that this approach has the following fundamental problems, ∙ Our training data fundamentally doesn’t have negative examples, because we

can only observe positive examples.

Not observing a (food, adulterant) pairs can mean one of two things – one, the food and adulterant indeed can’t co-exist, or two, the food and adulterant co-exist but we just don’t have evidence of them co-existing.

In our linear classifier approach, we are forced to have some negative examples provided to the classifier as part of the training data; otherwise our classifier will learn to classify every example as positive! We can choose a subset of (food, adulterant) pairs that we haven’t seen to be “negative examples” – it is possible with this approach to label examples that are truly positive to be negative, which would lead the classifier to learn incorrect relationships.

∙ A standard linear classifier is also not able to take full advantage of the similari-ties between different foods and different adulterants. It intuitively makes sense that similar foods are adulterated by similar adulterants and similar adulterants adulterate similar foods, but a standard maximum-margin classifier is not able to leverage this information effectively.

2.2 Collaborative filtering classifier

Collaborative filtering seems more compatible with the task at hand, since collabora-tive filtering inherently attempts to leverage similarity in different domains to make better classification decisions.

(21)

Our collaboratively-filtered classifier can be mathematically formulated as finding a linear classifier of the form 𝑥𝑇_{𝑈 𝑉}𝑇_{𝑧, where 𝑥 is the feature vector representation}

of a food product and 𝑧 is the feature vector representation of an adulterant, and 𝑈 and 𝑉 are weight parameter matrices.

Here, we assume that the food vector 𝑥 has dimension 𝑑1×1, the adulterant vector

𝑧 has dimension 𝑑2× 1, and the paramters 𝑈 and 𝑉 have dimensions 𝑑1× 𝑑 and 𝑑2× 𝑑

respectively (𝑑 is an adjustable parameter).

Training this classifier involves finding the optimal parameters 𝑈 and 𝑉 that minimize the loss function. In this case, we will use the quadratic loss function, but other loss functions can be used as well. Mathematically, our task is to find parameters 𝑈 and 𝑉 that minimize total loss given as,

𝐿(𝑈, 𝑉 ) = ∑︁

(𝑖,𝑗)∈𝑇

(𝑦𝑖𝑗 − 𝑥𝑇𝑖 𝑈 𝑉 𝑇

𝑧𝑗)2

Here, 𝑇 is the set of food-adulterant pairs (𝑖, 𝑗) in our training set, and 𝑦𝑖𝑗 is the

gold label of the pair (𝑖, 𝑗) – as with the linear classifier approach described above, we need to choose a subset of (food, adulterant) pairs to be “negative”. As is evident, this formulation is trying to move predictions as close to true labels as possible.

We can think of this entire setup as a low rank decomposition of a weight matrix 𝑊 .

To prevent overfitting, we introduce a regularization term (the sum of the Frobe-nius norms of 𝑈 and 𝑉 ), to obtain the following expression,

𝐿(𝑈, 𝑉 ) = ∑︁ (𝑖,𝑗)∈𝑇 𝐶 · (𝑦𝑖𝑗− 𝑥𝑇𝑖 𝑈 𝑉 𝑇_𝑧 𝑗)2+ |𝑈 |2𝐹 + 𝑉 | 2 𝐹

Here, 𝐶 is a constant used to trade off between the loss function term and the regularization term – the higher the value of 𝐶, the more the classifier listens to training examples.

We want to find values of 𝑈 and 𝑉 that minimize 𝐿(𝑈, 𝑉 ).

(22)

expression, 𝐿(𝑈, 𝑉 ) = ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 𝐶 · (𝑦𝑖𝑗 − vec(𝑈 )𝑇vec(𝑥𝑖𝑣𝑗𝑇)) 2 ⎞ ⎠+ |𝑈 |2_𝐹 + 𝑉 |2_𝐹 Consider the partial derivative of 𝐿(𝑈, 𝑉 ) with respect to 𝑈 ,

𝜕𝐿 𝜕𝑈 = 0 ⇒ 2𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇

(𝑦𝑖𝑗 − vec(𝑈 )𝑇vec(𝑥𝑖𝑣𝑇𝑗)) · −vec(𝑥𝑖𝑣𝑇𝑗)

⎞ ⎠+ 2 · vec(𝑈 ) = 0 ⇒ vec(𝑈 ) + 𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 vec(𝑥𝑖𝑣𝑇𝑗)vec(𝑥𝑖𝑣𝑇𝑗) 𝑇 ⎞ ⎠· vec(𝑈 ) = 𝐶( ∑︁ (𝑖,𝑗)∈𝑇 𝑦𝑖𝑗vec(𝑥𝑖𝑣𝑗𝑇))

Now, let’s define 𝐴 as

𝐴 = 𝐼 + 𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 vec(𝑥𝑖𝑣𝑗𝑇)vec(𝑥𝑖𝑣𝑗𝑇) 𝑇 ⎞ ⎠ and 𝑏 as 𝑏 = 𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 𝑦𝑖𝑗vec(𝑥𝑖𝑣𝑗𝑇) ⎞ ⎠

Note that 𝐴 is a matrix of dimension 𝑑1𝑑 × 𝑑1𝑑 and 𝑏 is a vector of dimension

𝑑1𝑑 × 1.

With these definitions, we can see that

𝐴 · vec(𝑈 ) = 𝑏 ⇒ vec(𝑈 ) = 𝐴−1𝑏

We can compute the updated value of parameter 𝑉 similarly, with 𝐴′ defined as 𝐴′ = 𝐼 + 𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 vec(𝑧𝑗𝑤𝑖𝑇)vec(𝑧𝑗𝑤𝑇𝑖 ) 𝑇 ⎞ ⎠

(23)

and 𝑏′ defined as 𝑏′ = 𝐶 ⎛ ⎝ ∑︁ (𝑖,𝑗)∈𝑇 𝑦𝑖𝑗vec(𝑧𝑗𝑤𝑇𝑖 ) ⎞ ⎠

(here, again for simplicity of notation, we define 𝑤𝑖 as 𝑈𝑇𝑥𝑖). 𝐴′ is of dimension

𝑑2𝑑 × 𝑑2𝑑 and 𝑏′ is of dimension 𝑑2𝑑 × 1.

With these definitions, we see that

𝐴′ · vec(𝑉 ) = 𝑏′ ⇒ vec(𝑉 ) = 𝐴′−1𝑏′

As with any collaboratively trained model, 𝑈 and 𝑉 are initialized as random matrices of the right dimension. Then, to obtain optimal values of 𝑈 and 𝑉 , we first fix 𝑈 and find optimal 𝑉 , and then fix 𝑉 and find optimal 𝑈 . We repeat the above two steps until convergence.

Since we can have vastly different number of positive and negative examples, we set different values of 𝐶 for examples with positive and negative labels. We set 𝐶+

to be equal to 1/𝑛− and 𝐶− to be equal to 1/𝑛+. This ensures that if we have less

positive examples, mislabeling a positive example is penalized more heavily. This is a fairly standard technique used in other learning tasks as well to account for a disparity in number of different labels available in training.

2.3 Passive-aggressive ranker

Both approaches we’ve described so far fail to deal with the lack of negative examples effectively. We now describe an approach that doesn’t make assumptions about pairs that we haven’t seen so far.

Instead of fundamentally treating this task as a classification problem where we need to predict a label given some properties about a food and adulterant, we can think of this task as a ranking problem where we try to rank pairs that we’ve seen before higher than pairs we haven’t seen so far.

(24)

Let 𝐴 = {𝑦1, 𝑦2, . . . , 𝑦𝑚} be the set of all adulterants we have ever seen and

𝐹 = {𝑥1, 𝑥2, . . . , 𝑥𝑛} be the set of all foods we have ever seen. Then, since we

have seen an incomplete set of (food, adulterant) pairs in our training data, we can represent our training data as,

𝑥1 {𝑦𝑖11, 𝑦𝑖12, . . . , 𝑦𝑖_1𝑠1}

𝑥2 {𝑦𝑖21, 𝑦𝑖22, . . . , 𝑦𝑖_2𝑠2}

..

. ...

𝑥𝑛 {𝑦𝑖𝑛1, 𝑦𝑖𝑛2, . . . , 𝑦𝑖𝑛𝑠𝑛}

where {𝑦𝑖11, 𝑦𝑖12, . . . , 𝑦𝑖_1𝑠1} is the set of adulterants that we know co-exist with the

food 𝑥1, {𝑦𝑖21, 𝑦𝑖22, . . . , 𝑦𝑖_2𝑠2} is the set of adulterants that we know co-exist with the

food 𝑥2, etc. For simplicity of notation, let us denote the set of adulterants that

co-exist with food 𝑥𝑗 as the set 𝑆𝑗.

We now want to calculate a weight vector 𝑊 such that for all 𝑘 ∈ 𝑆𝑗,

𝑥𝑇_𝑗𝑊 𝑦𝑘≥ max 𝑙̸∈𝑆𝑗

𝑥𝑇_𝑗𝑊 𝑦𝑙+ 1

The main intuition here is that all adulterants that are known to co-exist with a particular food should rank (or score) higher than all other adulterants that we have no information about in the context of its co-existence with the particular food.

Note that we have no common scale across foods – we only make comparisons between adulterants for a particular food. Also note that this formulation does not need to make any assumptions about negative examples in the dataset – this ensures that the model doesn’t see incorrect relationships in the training dataset.

There are two ways of going about solving this system of inequations. The first way involves minimizing the norm of the weight matrix, subject to the specified constraints – similar to the way a maximum margin classifier is formulated. This

(25)

however, can blow up computationally.

We can also think of the above system of constraint equations in terms of loss functions. Instead of satisfying the following constraint equation,

𝑥𝑇_𝑗𝑊 𝑦𝑘≥ max 𝑙̸∈𝑆𝑗

𝑥𝑇_𝑗𝑊 𝑦𝑙+ 1

we can think about minimizing the loss,

Loss(𝑥𝑇_𝑗𝑊 𝑦𝑘− max 𝑙̸∈𝑆𝑗

𝑥𝑇_𝑗𝑊 𝑦𝑙)

where Loss(𝑧) = max{1 − 𝑧, 0} (the hinge loss function).

We can now use a passive-aggressive like algorithm to update the weight parame-ters with every encountered training example. If our initial weight vector is 𝑊 , then for every (food, adulterant) pair (𝑥𝑗, 𝑦𝑘) a new weight vector 𝑊′ can be obtained by

minimizing,

Loss(𝑥𝑇_𝑗𝑊 𝑦𝑘− 𝑥𝑇𝑗𝑊 𝑦𝑚) +

𝜆 2 · |𝑊

′ _{− 𝑊 |}2

where 𝜆 is a parameter that controls how close 𝑊′is to 𝑊 , and 𝑚 = argmax_{𝑙̸∈𝑆}_𝑗𝑥𝑇 𝑗𝑊 𝑦𝑙.

(We can think of the second term as a regularization term that attempts to prevent overfitting)

This results in the following update,

𝑊′ = 𝑊 + 𝜂(vec{𝑥𝑗𝑦𝑘𝑇} − vec{𝑥𝑗𝑦𝑚𝑇})

Here, 𝜂 is the learrning rate – the smaller the value of 𝜂, the closer 𝑊′ remains to 𝑊 .

We perform such updates to 𝑊 for every (food, adulterant) pair that exists in our training data, and repeat until convergence.

(26)

2.4 Evaluation

2.4.1 Collaborative filtering classifier

We measure the performance of our collaborative filtering classifier using a metric called F-measure which is commonly used in information retrieval.

Before defining what we mean by F-measure, we define two other metrics, called precision and recall.

∙ Precision: In information retrieval, precision is defined as the fraction of re-trieved documents that are relevant. In our task, we represent a (food, adulter-ant) pair as a document. Mathematically, we can express this as,

Precision = Number of relevant and retrieved documents Number of retrieved documents

∙ Recall: Recall is defined as the fraction of relevant documents that are re-trieved. Mathematically, this can be expressed as,

Recall = Number of relevant and retrieved documents Number of relevant documents

∙ F-measure: F-measure is the harmonic mean of precision and recall. That is,

F-measure = 2 · Recall · Precision Recall + Precision

We present the results of our collaborative filtering model below.

We ran our experiment on a European database of food adulteration incidents. We extracted all foods and adulterants involved in adulteration incidents from the database, and then split all such extracted foods 80 − 20 into training and testing sets. Training and testing foods were then combined with all known adulterants to get training (food, adulterant) pairs and testing (food, adulterant) pairs. Our training set contained 280,000 pairs and our testing set contained 68,000 pairs.

(27)

We use a food feature vector representation that consists of a binary expansion of the food’s food category along with a meta feature vector that is a binary expansion of the origin country and country in which the incident was reported.

The adulterant feature vector consists of a binary expansion of the adulterant’s adulterant ID.

Average F-measure over 5 runs Collaborative filtering with 𝑑 = 1 0.29

Collaborative filtering with 𝑑 = 3 0.36 Collaborative filtering with 𝑑 = 5 0.38

Table 2.1: F-measures obtained with collaborative filtering with different 𝑑 values

We do not present results obtained with our SVM classifier because the results obtained are extremely low, due to the reasons discussed earlier in this chapter.

The results of our collaboratively filtering approach, even though somewhat promis-ing, can still be improved. We hypothesize that not being able to deal with negative examples in a clean way is the main reason why.

2.4.2 Passive-aggressive ranker

We measure the performance of our passive-aggressive ranker using a different metric called Mean Average Precision (MAP). As with F-measure, the MAP score ranges between 0.0 and 1.0, with a higher score indicating better ranking performance.

Before defining what we mean by Mean Average Precision, we explain Average Precision.

Precision @ 𝐾 is defined as the fraction of relevant documents in the top 𝐾 documents retrieved by a query.

Mathematically,

Average Precision = ∑︀𝑅

𝑖=1Precision @ 𝐾𝑖

𝑅

where 𝑅 is the total number of relevant documents, and 𝐾𝑖 is the rank of the 𝑖th

(28)

Mean average precision is the average precision averaged across multiple queries / rankings.

Mean Average Precision is a metric for ranking performance that takes into ac-count percentage of relevant documents retrieved at various cutoff points – it is order-sensitive, in that an ordering with five relevant documents followed by five irrelevant documents would have a much higher score than an ordering with five irrelevant documents followed by five relevant documents.

We present the performance of our passive-aggressive ranker below. We use the same dataset as above, (all foods are split 80 − 20 into training and testing sets). The number of foods in the training set is 1273, and the number of foods in the testing set is 337. The number of adulterants is 124. We train our ranker on the training set, and then rank all adulterants for foods in the test set, given the gold list of adulterants co-existing with the particular food.

As with our collaborative filtering classifier, our food vector representation consists of a binary representation of the food’s category along with metadata information about the incident. Our adulterant vector representation, as before, consists of a binary representation of the adulterant’s adulterant ID.

To evaluate performance, we rank all adulterants for every food in our test set, and compute the MAP score given the gold list of adulterants co-existing with a food in the testing set, extracted from the European dataset.

Average MAP Passive aggressive ranker on training data 0.84

Passive aggressive ranker on testing data 0.59 Passive aggressive ranker on both training and testing data 0.75

Table 2.2: MAP scores on adulteration dataset

MAP has a relatively simple interpretation – the reciprocal of the MAP score gives the rank of the returned, relevant document on average. For example, a MAP score of 0.5 means that on average, returned relevant documents had a rank of 2; a MAP score of 0.25 means that on average, returned relevant documents had a rank of 4, etc.

(29)

Given this interpretation, our MAP score of around 0.59 indicates relatively good performance since it means on average, co-existing adulterants ranked in the top 2 among all adulterants for a previously unseen food product.

(30)

(31)

Chapter 3 Query refinement and relevance

feedback

Query refinement is a problem that is of much interest in the field of Information Retrieval. Given that people often find it hard to accurately express their search needs in the form of a short and concise query, popular search engines such as Google and Bing use query refinement techniques to improve the quality of search results returned to the user. Spelling correction and query expansion (Google’s "did you mean" feature) are popular applications of query refinement.

Very simplistically, query refinement can be thought of as a ranking procedure – given a query, potential query replacements are generated, which can then be sorted using a scoring function to generate the “best” refined queries.

The scoring function used is heavily dependent on the task at hand – for exam-ple, for spelling correction, a good scoring function is the edit distance between the candidate query and the original query, since common misspellings can often be at-tributed to typos. Better scoring functions also take into account context to improve the quality of results – for example, even though “there” and “their” are both valid words with similar spellings, they obviously cannot be used interchangeably.

The major problem with query refinement is that for it to be effective, large amounts of data are needed. In practice, both spelling correction and query ex-pansion depend on query logs (extensive records of how users have searched in the

(32)

Figure 3-1: Example of query refining

past) for both a good list of possible candidates, as well as accurate enough data to meaningfully rank the generated list of candidates.

We don’t really have query logs at our disposal in this problem, which means it is unclear how to use standard query refinement techniques to solve our problem – in fact, the problem we are trying to solve involves generating queries that would retrieve better data, so using a technique that requires a lot of data creates a classic chicken-and-egg problem.

Relevance feedback is a technique frequently used by information retrieval systems to improve the quality of search results returned to a user, and does not require a large amount of data. Relevance feedback uses user input to get a better idea of the type of results the user is looking. Relevance feedback algorithms usually work well when all relevant documents can be clustered together, and when it’s relatively straight forward to distinguish between relevant and irrelevant documents for a particular query.

Since it’s impractical to expect someone to manually annotate all retrieved doc-uments as good or bad in an offline system that does not accept user feedback in an easy way, it is necessary to devise a method to use relevance feedback in a more automated way.

Given a classifier capable of determining if a particular document is a news article reporting an incident of adulteration, we can use relevance feedback on our problem – start with some initial version of a query, rate each document in the set of documents retrieved by that query as appropriate or not using the classifier and then reiterate on the query until a reasonable number of relevant documents are retrieved. We call

(33)

this modified relevance feedback approach pseudo-relevance feedback.

Standard classification algorithms, however, often require a fair amount of training data to generate reasonably correct results, which increases the amount of annotation required. In this chapter, we describe a transductive classification approach that attempts to produce a generalizable classifier with minimal training data.

We begin by describing the general transductive learning setup with pseudo-relevance feedback, and then discuss alternate ways of retrieving relevant documents.

3.1 Pseudo-relevance feedback with a transductive

classifier

We first discuss our transductive classification scheme, and then discuss a potential pseudo-relevance feedback approach.

3.1.1 Transductive learning

In transductive learning, we try to predict the labels of test examples directly by ob-serving training examples. This is in contrast to inductive learning where we attempt to infer general rules from training examples, and then apply these general rules to test examples.

Since transductive learning is trying to solve an easier task (by not trying to learn a generalizable mapping), transductive tasks most often require less training data (and hence less annotation) than inductive tasks.

In a transductive inference setup [6], we are given 𝑛 points −→𝑥1, −→𝑥2, . . . , −→𝑥𝑛, with

true labels 𝑦1, 𝑦2, . . . , 𝑦𝑛.

At training time, we are given the true labels of a subset of points 𝑌 ⊂ {𝑦1, 𝑦2, . . . , 𝑦𝑛}

(known as the training points), and we want to predict the labels 𝑦 ∈ {−1, 1} of all other points as accurately as possible.

The main difference between this setup and that of a standard inductive learning task is that we have access to the testing examples during training.

(34)

We want to train a transductive learner that is self-consistent – that is, our classi-fier should not be overly affected by any one training example. If a classiclassi-fier is overly influenced by a couple of training examples, then it is less generalizable and more likely to make mistakes on examples it hasn’t seen before. A metric for self-consistency is leave-one-out error.

Consider a 𝑘-nearest neighbor classifier that determines the label of an unseen point 𝑥 based on the labels of the majority of 𝑥’s 𝑘 nearest neighbors.

Define 𝛿𝑖,

𝛿𝑖 = 𝑦𝑖

∑︀

𝑗∈𝐾𝑁 𝑁 (−→𝑥𝑖)𝑦𝑗

𝑘

If 𝛿𝑖 < 0, then clearly we will make a leave-one-out error on 𝑥𝑖, since 𝛿𝑖 being less

than 0 implies that a majority of 𝑥𝑖’s 𝑘 nearest neighbors have a label opposite to

𝑥𝑖’s true label.

We can modify the standard 𝑘-nearest neighbor classifier presented above to in-clude weights, creating a weighted 𝑘-nearest neighbor classifier. If 𝑤𝑖𝑗 is the similarity

between data points 𝑖 and 𝑗, then if

𝛿𝑖 = 𝑦𝑖

∑︀

𝑗∈𝐾𝑁 𝑁 (−→𝑥𝑖)𝑦𝑗𝑤𝑖𝑗

∑︀

𝑚∈𝐾𝑁 𝑁 (−→𝑥𝑖)𝑤𝑖𝑚

we will have a leave-one-out error on 𝑥𝑖 if 𝛿𝑖 < 0.

Given that we want to keep leave-one-out error as small as possible, our goal should be to make 𝛿𝑖 as large as possible (larger 𝛿𝑖 implies lower error). We can

formulate this as the following optimization problem.

max₋_→ 𝑦 𝑛 ∑︁ 𝑖=1 ∑︁ 𝑗∈𝑘𝑁 𝑁 (−→𝑥𝑖) 𝑦𝑖𝑦𝑗 𝑤𝑖𝑗 ∑︀ 𝑚∈𝐾𝑁 𝑁 (−→𝑥𝑖)𝑤𝑖𝑚 s.t. 𝑦𝑖 = 1, if 𝑖 ∈ 𝑌 and positive s.t. 𝑦𝑖 = −1, if 𝑖 ∈ 𝑌 and negative ∀𝑗, 𝑦𝑗 ∈ {−1, 1}

(35)

Let us define 𝐴𝑖𝑗 as

𝑤𝑖𝑗

∑︀

𝑚∈𝐾𝑁 𝑁 (−→𝑥𝑖)𝑤𝑖𝑚

Then we see that for our objective to be maximized, we need to maximize

𝑛 ∑︁ 𝑖=1 ∑︁ 𝑗∈𝑘𝑁 𝑁 (𝑥𝑖) 𝑦𝑖𝑦𝑗𝐴𝑖𝑗

for all (𝑖, 𝑗) for which 𝑦𝑖𝑦𝑗 = −1.

We can now think of the 𝑥𝑖’s as vertices in a graph, and 𝐴𝑖𝑗 as the weight of the

edge from 𝑖 to 𝑗. Then if 𝐺+ represents the sub-graph with all positive vertices (all

vertices with label of +1), and 𝐺− represents the sub-graph with all negative vertices

(all vertices with label −1), we need to maximize −cut(𝐺+, 𝐺−) (the sum of edge

weights between vertices in 𝐺+ and 𝐺−), or minimize cut(𝐺+, 𝐺−).

We can think of this classifier now as a modified clustering task, where we’re trying to classify unlabeled examples close to positive training examples as positive and classify unlabeled examples close to negative training examples as negative.

Note that the problem as currently formulated, closely resembles the classic min-cut problem. However, simply finding a solution to this problem may lead to poten-tially highly undesirable solutions – for example, a viable partition could involve one sub-graph having just one vertex, and the other sub-graph having all other vertices. This is a degenerate solution.

Part of the reason for this problem is the fact that our above formulation doesn’t take into account the number of edges crossing the cut – the degenerate solution which includes one vertex on one side, and all other vertices on the other side has a very small number of edges passing through the cut, which automatically prefers cuts of smaller weight.

This problem can be overcome by instead trying to minimize cut(𝐺+, 𝐺−) |𝐺+||𝐺−|

(the average cut weight) subject to the same constraints as before.

Obtaining an exact solution to this problem is extremely hard. However, there exist efficient spectral methods that approximate the solution to this problem. We

(36)

describe one such method below. [4]

Before proceeding, let us define the Laplacian 𝐿 of a graph 𝐺 as the 𝑛 × 𝑛 sym-metric matrix whose (𝑖, 𝑗)th entry is given by,

𝐿𝑖𝑗 = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ∑︀ 𝑘𝐸𝑖𝑘, if 𝑖 = 𝑗

−𝐸𝑖𝑗, 𝑖 ̸= 𝑗 and edge (𝑖, 𝑗) exists

0 otherwise

We claim that for any arbitrary 𝑛 × 1 vector 𝑥, 𝑥𝑇𝐿𝑥 =∑︀

(𝑖,𝑗)∈𝐸(𝑥𝑖− 𝑥𝑗) 2_𝐸

𝑖𝑗.

This is easy to observe from the fact that the Laplacian matrix 𝐿 can be expressed as 𝐼𝐺𝐼𝐺𝑇, where 𝐼𝐺is the incidence matrix of the graph 𝐺 and has dimensions of 𝑛×𝑚.

The column in matrix 𝐼𝐺 representing the edge between vertices 𝑖 and 𝑗 has all

zeros except at the 𝑖th and 𝑗th rows, which has values√︀𝐸𝑖𝑗 and −√︀𝐸𝑖𝑗 respectively.

𝑥𝑇_{𝐿𝑥 can now be expressed as (𝐼}𝑇

𝐺𝑥)𝑇(𝐼𝐺𝑇𝑥).

Observe that 𝐼_𝐺𝑇𝑥 is a column-vector with 𝑒th row value√︀𝐸𝑖𝑗(𝑥𝑖−𝑥𝑗) where 𝑒 is an

edge between vertices 𝑖 and 𝑗. This implies that (𝐼𝑇

𝐺𝑥)𝑇(𝐼𝐺𝑇𝑥) =

∑︀

(𝑖,𝑗)∈𝐸(𝑥𝑖− 𝑥𝑗)2𝐸𝑖𝑗.

If we constrain 𝑧𝑖 to only take values {𝛾+, 𝛾−}, where 𝛾+ =

|{𝑖 : 𝑧𝑖 < 0}| |{𝑖 : 𝑧𝑖 > 0}| and 𝛾− = |{𝑖 : 𝑧𝑖 > 0}| |{𝑖 : 𝑧𝑖 < 0}|

, we see that 𝑧𝑇_{𝐿𝑧 is equal to}

𝑛2 cut(𝐺+, 𝐺−) |{𝑖 : 𝑧𝑖 < 0}||{𝑖 : 𝑧𝑖 > 0}|

provided 𝑧𝑖 = 𝛾+ for all positive points and 𝑧𝑖 = 𝛾− for all negative points.

𝑛 is a constant (total number of examples), which implies that minimizing the average cut subject to constraints is equivalent to minimizing 𝑧𝑇_{𝐿𝑧 subject to the}

constraints.

In addition, observe that from the way 𝑧 was constructed above, 𝑧𝑇_{𝑧 = 𝑛 and}

𝑧𝑇1 = 0

Using all this information, we can formulate our objective as the following opti-mization problem (note that we’re still ignoring our original constraint equations),

(37)

min

𝑧 𝑧 𝑇_𝐿𝑧

s.t. 𝑧𝑇1 = 0 and 𝑧𝑇𝑧 = 𝑛

We try to ensure that the predictions −→𝑧 stay close to the expected predictions −→𝛾 (that is, try to ensure that the constraints are satisfied) by adding in a loss term to the formulation above,

min

𝑧 𝑧

𝑇_{𝐿𝑧 + 𝑐(𝑧 − 𝛾)}𝑇_{𝐶(𝑧 − 𝛾)}

s.t. 𝑧𝑇1 = 0 and 𝑧𝑇𝑧 = 𝑛

Here, C is a matrix that allows us to weight the importance of different labels differently, and −→𝛾 is a vector with the 𝑖th element equal to 𝛾+ if the 𝑖th element has

a positive label, 𝛾− if the 𝑖th element has a negative label, and 0 otherwise.

We can find an approximate solution to the above system by performing an eigen-decomposition on 𝐿, which is possible since 𝐿 is a symmetric matrix. Let’s express 𝐿 as 𝑈 Σ𝑈𝑇_{, where Σ is a diagonal matrix containing the eigen values of 𝐿, and 𝑈 is}

the eigenvalue matrix and is unitary. Also from the way 𝐿 is formulated, we see that −

→

1 is an eigenvector of 𝐿, with a corresponding eigenvalue of 0.

Let 𝑉 be the eigenvector matrix with−→1 removed, and 𝐷 be the eigenvalue matrix with 0 removed.

Now

𝑈 Σ𝑈𝑇 ≈ 𝑉 𝐷𝑉𝑇

.

(38)

min

𝑧 𝑤

𝑇_{𝐿𝑤 + 𝑐(𝑉 𝑤 − 𝛾)}𝑇_{𝐶(𝑉 𝑤 − 𝛾)}

s.t. 𝑤𝑇𝑤 = 𝑛

[8] specifies how to obtain the optimal solution to this problem.

Given an adjacency matrix 𝐴 of a graph, the following steps can be performed to obtain label assignments that minimize the total leave-one-out error,

∙ Compute the Laplacian 𝐿 of the matrix 𝐴.

∙ Compute the smallest 2 to 𝑑 + 1 eigenvalues and eigenvectors of 𝐿 – these form 𝐷 and 𝑉 respectively.

∙ Compute 𝐺 = (𝐷 + 𝑐𝑉𝑇_{𝐶𝑉 ) and 𝑏 = 𝑐𝑉}𝑇_𝐶−→_{𝛾 and find smallest eigen vector}

𝜆* of the matrix, ⎡ ⎣ 𝐺 −𝐼 −1 𝑛 − → 𝑏 −→𝑏 𝑇 𝐺 ⎤ ⎦ ∙ Compute predictions as 𝑧* _{= 𝑉 (𝐺 − 𝜆}*_𝐼)−1−→

𝑏 . These 𝑧’s can be used for ranking.

∙ An appropriate threshold can be selected to get discretized class labels of +1 and −1.

3.1.2 Relevance feedback

The technique described above allows us to easily determine which articles are “posi-tive” and which articles are “nega“posi-tive” given minimum annotation (one or two exam-ples of both classes). A pseudo-relevance feedback algorithm can then make use of these predictions to improve the quality of search results produced.

However, if our classifier isn’t accurate enough, it is possible that we suffer from a problem known as query drift, where the queries obtained from moving queries based

(39)

<ARTIST> performing Live at <VENUE> So excited for <ARTIST> today at <VENUE>

<VENUE> is going down tonight with <ARTIST> performing!

Figure 3-2: Example of patterns extracted from tweets describing music events in New York

on classifier output are in fact worse than the initial queries we started off with, because of incorrect predictions made by the classifier.

In practice, we have observed that there is a fair amount of noise in our classifier output, leading to sub-optimal results. Making the classifier more robust remains an area of future work.

3.2 Query template selection

Instead of relying on a classifier’s performance on unseen data, we can try to improve the retrieval process by exclusively making conclusions from training data.

We use techniques borrowed from a system called Snowball [1] to detect query templates from text. These extracted query templates can be subsequently used to create queries that can be fed through a retrieval system to retrieve articles.

In this task, we want to extract generic phrases that are representative of the contexts records are usually found in unstructured text. For example, let’s say we were trying to extract contexts in which (artist, venue) tuples are found in a corpus of tweets referring to music events in New York. Some of the templates that we would like to extract from these tweets are shown in figure 3-2.

We start off with a list of “seed tuples”. We go through the corpus of documents, and extract all phrases that contain occurrences of all fields in the record or tuple.

These phrases can be converted to generic patterns – every specific occurrence of a record field is replaced by its field type. For example, “John Lennon performed in Madison Square Garden today" can be converted to “<ARTIST> performed in <VENUE> today".

(40)

“performed”, and any arbitrary word to precede “today” – the words in the sentence must be tagged with the tag ARTIST and VENUE respectively. We use a Named-Entity Recognizer (discussed more in the next chapter) to extract tags for every word in the corpus.

We can now use the patterns just extracted to obtain more tuples. These new tuples can subsequently be used to generate more patterns – the process continues until we extract enough records / patterns. The overall system can be understood better from Figure 3-2.

Figure 3-3: Design of snowball system

We use a vector space model to ensure that patterns similar to each other, but not completely identical, are not considered to be the same “pattern” by the system. Once we have a list of possible query templates that capture a number of the contexts in which the records are found in unstructured text, we can combine our templates with metadata about the incidents (in this case, foods and adulterants involved in the adulteration incident) to generate queries that can subsequently be

(41)

fed into a vanilla retrieval system to retrieve new articles that can be added to the dataset.

3.3 Filtering for adulteration-related articles using a

Language Model

Another approach to retrieving new adulteration-related articles is by filtering through a stream of Google News articles for adulteration-related articles. This approach is fundamentally different from the one we’ve discussed so far, where we’re looking for articles describing specific incidents.

There are a couple of different ways we can try filtering for adulteration-related articles. One way is to build a standard classifier that attempts to predict if the topic of the article is “food adulteration”.

Another way is to build a language model [7] that attempts to differentiate between the language used in news articles that talk about adulteration and the language used in news articles that don’t talk about adulteration.

Our hypothesis here is that articles that talk about adulteration use a lot more technical and domain-specific language than articles related to other topics like sports or business.

We use a standard bigram language model with Kneisser-Ney smoothing. One can easily extend these ideas to higher 𝑛-grams.

Smoothing is a technique used in Language Models to deal with data sparsity. On many occasions, we don’t see a 𝑛-gram in its entirety in our training set. To overcome this problem, with smoothing, we backoff to a simpler model.

For the sake of simplicity, we discuss Kneisser-Ney smoothing in the context of bigrams. In a standard backoff model, if we didn’t see a bigram in our training set, we backoff to using a standard unigram model. This works well sometimes; however, it often doesn’t work well since a unigram model completely neglects the context inherent in a sentence.

(42)

Instead, Kneisser-Ney attempts to capture how likely a word appears in an unseen word context, by counting the number of unique words that precede the given word.

Mathematically, we can write this as,

Pr𝐶𝑂𝑁 𝑇(𝑤𝑖) ∝ |{𝑤𝑖−1 : 𝑐(𝑤𝑖−1, 𝑤𝑖) > 0}|

After normalizing the above expression appropriately so that Pr𝐶𝑂𝑁 𝑇 sums to 1

for all 𝑤𝑖, we obtain the following expression for Pr𝐾𝑁,

Pr𝐾𝑁(𝑤𝑖|𝑤𝑖−1) =

max(𝑐(𝑤𝑖, 𝑤𝑖−1) − 𝛿, 0)

𝑐(𝑤𝑖−1)

+ 𝜆|{𝑤𝑖−1: 𝑐(𝑤𝑖−1, 𝑤𝑖) > 0}| |{𝑤𝑗−1 : 𝑐(𝑤𝑗−1, 𝑤𝑗) > 0}|

where 𝜆 is a normalizing constant.

3.3.1 Evaluation

We train our Language model on a subset of our corpus of adulteration-related articles (200 articles), and then evaluate performance on the remaining subset of adulteration-related articles along with non-adulteration related articles scraped from Google News.

We decide an average negative log likelihood to decide whether a news articles talks about adulteration or not. Using this approach, we obtain an F-measure of 0.61. We hypothesize that this approach will do a lot better with a more extensive training dataset.

(43)

Chapter 4 Article clustering and record

extraction

For the purposes of our dataset creation task, the mere retrieval of relevant news articles is not sufficient, simply because raw articles are unstructured and hence dif-ficult to understand for a computer program. An important goal of this project is to extract fields from the articles we manage to retrieve. In addition, it is useful to de-duplicate articles on the basis of the incident they are describing. Later in this section, we will describe an approach that attempts to accomplish both these tasks in a single, unified manner.

Before describing our combined approach, we describe some well-known ways of going about solving the individual problems described above – one, the problem of extracting records from unstructured text, and two, clustering texts according to their content.

4.1 Named Entity Recognition using Conditional

Ran-dom Fields

Conditional Random Fields are a well-established way of extracting records from text. CRFs are a class of machine learning algorithms that are used for structured

(44)

Figure 4-1: CRF model – here 𝑥 is the sequence of observed variables and 𝑠 is the sequence of hidden variables

prediction.

CRF aims at utilizing information inherently prevalent in a sequence to make better classification decisions. It is possible to perform a task such as Part-of-Speech tagging by trying to predict the part-of-speech tag of each word in a sentence in isolation. But this approach won’t work well, because there exists a number of words that function as different parts-of-speech in different contexts – for example, the word “address” can function as both a noun and an adjective, so without the context in which the word is used, it would be impossible to always predict the right tag associated with the word.

4.1.1 Mathematical formulation

Mathematically, we can think of this as a task where we try to predict the optimum label sequence 𝑙 given a sentence 𝑠. In other words, we want to find the label sequence 𝑙 that maximizes the conditional probability Pr(𝑙|𝑠).

Given a sentence 𝑠, we can score a labeling 𝑙 of 𝑠 using a scoring function,

score(𝑙|𝑠) = 𝑚 ∑︁ 𝑗=1 𝑛 ∑︁ 𝑖=1 𝜆𝑗𝑓𝑗(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) = 𝑛 ∑︁ 𝑖=1 ¯ 𝜆 · ¯𝑓 (𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1)

Observe from this formulation that a label depends on its preceding label, the word it corresponds to and the entire sentence the word belongs to.

Here, 𝜆𝑗 is used to weight the different 𝑓𝑗 differently.

(45)

Pr(𝑙|𝑠) = _∑︀exp[score(𝑙|𝑠)] 𝑙′exp[score(𝑙′|𝑠)] = exp[ ∑︀𝑚 𝑗=1 ∑︀𝑛 𝑖=1𝜆𝑗𝑓𝑗(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1)] ∑︀ 𝑙′exp[ ∑︀𝑚 𝑗=1 ∑︀𝑛 𝑖=1𝜆𝑗𝑓𝑗(𝑠, 𝑖, 𝑙 ′ 𝑖, 𝑙′𝑖−1)]

For a classic part-of-speech tagging problem, possible 𝑓𝑗’s that can be used are

∙ 𝑓1(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) = 1 if 𝑙𝑖 = VERB; 0 otherwise.

∙ 𝑓2(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) = −1 if 𝑙𝑖 = NOUN and 𝑙𝑖−1= ADJECTIVE; 0 otherwise.

∙ 𝑓3(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) = 1 if 𝑠[𝑖] = THE; 0 otherwise.

∙ 𝑓4(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) = tf(𝑠[𝑖]) · idf(𝑠[𝑖]), where tf(𝑤) is the number of times the word

𝑤 occurs in the sentence 𝑠, and idf(𝑤) is the inverse document frequency of the word 𝑤.

4.1.2 Training

A problem that we skirted in the previous section was how to set the feature weights 𝜆𝑗. It is clear that setting 𝜆𝑗 to arbitrary values will not work well, since we may be

giving importance to features that are in reality not that important.

It turns out that we can find optimal weight parameters using gradient ascent. Gradient ascent is an optimization algorithm that attempts to find the local max-imum of a multivariable function 𝐹 (𝑥), making use of the fact that the function 𝐹 (𝑥) at some point 𝑡 increases fastest in the direction of the gradient of 𝐹 (𝑥) at the point 𝑡.

Essentially, we start off with a randomly assigned initial weight vector, and then try to move towards more optimal weight vectors by moving in the direction of the gradient 𝜕 log Pr(𝑙|𝑠)

𝜕𝜆 .

For every feature 𝑗 we can compute the gradient of log Pr(𝑙|𝑠) with respect to the corresponding weight 𝜆𝑗, 𝜕 log Pr(𝑙|𝑠) 𝜕𝜆𝑗 = 𝑚 ∑︁ 𝑖=1 𝑓𝑗(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) − ∑︁ 𝑙′ Pr(𝑙′|𝑠) 𝑚 ∑︁ 𝑖=1 𝑓𝑗(𝑠, 𝑖, 𝑙′𝑖, 𝑙 ′ 𝑖−1)

(46)

However, note that the number of possible 𝑙′ is exponential in the length of the sentence, so this update looks intractable. However, observe that

𝜕 log Pr(𝑙|𝑠) 𝜕𝜆𝑗 = 𝑚 ∑︁ 𝑖=1 𝑓𝑗(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) − 𝑚 ∑︁ 𝑖=1 ∑︁ 𝑙′ Pr(𝑙′|𝑠)𝑓𝑗(𝑠, 𝑖, 𝑙𝑖′, 𝑙 ′ 𝑖−1) = 𝑚 ∑︁ 𝑖=1 𝑓𝑗(𝑠, 𝑖, 𝑙𝑖, 𝑙𝑖−1) − 𝑚 ∑︁ 𝑖=1 ∑︁ 𝑙′_𝑖−1,𝑙_𝑖′ Pr(𝑙′_𝑖−1, 𝑙_𝑖′|𝑠)𝑓𝑗(𝑠, 𝑖, 𝑙′𝑖, 𝑙 ′ 𝑖−1)

Here we marginalized Pr(𝑙′|𝑠) over all labels in the label sequence 𝑙′ _{that are not}

at indices 𝑖 − 1 and 𝑖. Computing the probability of all pairs of labels (𝑙′_𝑖−1, 𝑙_𝑖′) requires 𝑂(𝑘2) time where 𝑘 is the number of possible labels – this is clearly tractable.

We can now move 𝜆𝑗 in the direction of this gradient, to obtain a new weight

value 𝜆′_𝑗,

𝜆′_𝑗 = 𝜆𝑗+ 𝛼

𝜕 log Pr(𝑙|𝑠) 𝜕𝜆𝑗

Here, as stated before, we are moving the weight vector in the direction of max-imum change of the log-probability of a label sequence given a sentence. 𝛼 is the learning rate. The larger the value of 𝛼, the faster 𝜆 moves in the direction of the gradient; if the value of 𝛼 is very high however, the updates will be inaccurate, and 𝜆 will move far away from its true optimal value.

We make updates to ¯𝜆 until convergence, or for a fixed number of iterations.

4.1.3 Finding optimal label sequence

Given a trained CRF, we want to find the optimal label sequence of a given sentence 𝑠. One way of doing this is to compute Pr(𝑙′|𝑠) for every possible label sequence 𝑙′ _of

a sentence 𝑠 to figure out which label sequence 𝑙′ is most likely given a sentence 𝑠. However, for a sentence of length 𝑛, if the total number of labels for each word is 𝑘, then the total number of labeling sequences for the sentence 𝑠 is 𝑘𝑛_{, which is}

exponential in the length of the sentence.

(47)

algorithm [2] to cut the complexity down to 𝑂(𝑛𝑘2).

4.1.4 Entity recognition

We can utilize the above framework to recognize entities within an unstructured blob of text. We can tag all entities that interest us with their respective tags, and all words encountered in our training corpus that aren’t specific entities can be labeled with a generic tag – let us call this generic tag a ‘O’.

Now, when we run our trained model on our testing corpus, any contiguous se-quence of words tagged with the same tag is treated as a recognized entity of that particular tag.

4.1.5 Evaluation

We run Stanford’s NER code [5] on our dataset of adulteration-related news articles, trying to recognize foods and adulterants. We split our dataset of approximately 150 articles 70 − 30 into training and testing sets.

We obtain the following precision, accuracy and recall for the different fields.

Entity Precision Recall F-measure ADULTER 0.846 0.653 0.737

FOOD 0.998 0.973 0.985

Total 0.923 0.796 0.854

Table 4.1: Precision, recall and F1 of NER on training data

Entity Precision Recall F-measure ADULTER 0.522 0.224 0.313

FOOD 0.798 0.326 0.463

Total 0.667 0.279 0.393

Table 4.2: Precision, recall and F1 of NER on testing data

We see that on examples not yet seen, there is still a fair amount of noise in the accuracy of the fields extracted, especially for the adulterant field. As is expected,

(48)

performance on the training set is significantly better than performance on the testing set.

Slightly more surprising is the difference in performance of foods versus adulter-ants. This can be probably attributed to the fact that the names of adulterants are often extremely technical, which means the number of unseen adulterant tokens in the test dataset is a lot higher than the number of unseen food tokens.

(49)

4.2 Clustering

Our dataset consists of a number of articles that describe instances of adulteration. These articles are collected from a number of very different sources, and are often de-scribing the same adulteration incident, even though the exact content of the articles may not be completely identical.

Since we can have multiple duplicates for every incident, it makes sense to cast this as a clustering problem, where ideally, two articles are in the same cluster if they describe the same incident, and articles in different clusters describe different incidents.

4.2.1 Clustering using k-means

K-means is one of the more popular clustering algorithms. Like most clustering algorithms, it is completely unsupervised.

Mathematical formulation

Given 𝑛 points (𝑥1, 𝑥2, . . . , 𝑥𝑛) where each 𝑥𝑖 is a 𝑑-dimensional vector, we want to

assign each 𝑥𝑖 to a set 𝑆𝑗 where 1 ≤ 𝑗 ≤ 𝑘, such that 𝑆1 ∪ 𝑆2 ∪ . . . 𝑆𝑘−1 ∪ 𝑆𝑘 =

{𝑥1, 𝑥2, . . . , 𝑥𝑛} and each of the sets 𝑆𝑗 are disjoint (that is 𝑥𝑖 can belong to exactly

one cluster).

We can mathematically formulate this task as trying to find the set 𝑆 = {𝑆1, 𝑆2, . . . , 𝑆𝑘}

that minimizes, 𝑘 ∑︁ 𝑖=1 ∑︁ 𝑥∈𝑆𝑖 ||𝑥 − 𝜇𝑖||2

Here, we can think of 𝜇𝑖 as the representative element of the cluster 𝑖.

Minimizing the above expression with respect to 𝜇𝑖 shows us that 𝜇𝑖 must be equal

(50)

Description

Every iteration of k-means clustering involves two important steps,

1. Assignment step: Each point 𝑥1, 𝑥2, . . . , 𝑥𝑛 must be assigned to one of 𝑘

clusters. This is done by assigning each point to the cluster with nearest repre-sentative point. 𝑆_𝑖(𝑡) = {𝑥𝑝 : ||𝑥𝑝− 𝜇 (𝑡) 𝑖 || 2 _{≤ ||𝑥} 𝑝− 𝜇 (𝑡) 𝑗 || 2 _{∀𝑗, 1 ≤ 𝑗 ≤ 𝑘}}

2. Update step: Once every point is assigned to a cluster, we need to update the centroid of each cluster.

Each new centroid is given by the expression,

𝜇𝑡+1_𝑖 = 1 |𝑆_𝑖(𝑡)|

∑︁

𝑥𝑗∈𝑆𝑖(𝑡)

𝑥𝑗

The above two steps are repeated until convergence. Every point 𝑥𝑖 is initially

assigned to a random cluster.

4.2.2 Clustering using HCA

Hierarchical clustering is a method of clustering that tries to build a hierarchy of clusters. It circumvents the problem of initializing data points to clusters by starting every point as its own cluster of size 1, and then collapsing clusters that are similar to obtain larger clusters.

Formulation

As with k-means, the first step in this process involves vectorizing the news article. Once we have a vector representation of each news article, we can run HCA on the resulting vectors to obtain clusters, such that all vectors in a cluster are similar to each other.

(51)

(52)

In Hierarchical Agglomerative clustering, every article starts off as its own cluster – that is, if we start with 𝑛 articles, we initially start with 𝑛 different clusters.

Every iteration of the algorithm, we attempt to merge the two closest clusters. Note that we can have different inter-cluster similarity metrics. We describe them in greater detail below,

∙ Single-link clustering: The distance between two clusters is the distance between their most similar members. This inter-cluster similarity metric’s main disadvantage is that it disregards the fact that the furthest points in clusters that are going to be merged may in fact be really far apart.

𝑑𝐶𝐿𝑈 𝑆(𝐴, 𝐵) = min

𝑥∈𝐴,𝑦∈𝐵𝑑(𝑥, 𝑦)

∙ Complete-link clustering: The distance between two clusters is the distance between their most dissimilar members.

𝑑𝐶𝐿𝑈 𝑆(𝐴, 𝐵) = max

𝑥∈𝐴,𝑦∈𝐵𝑑(𝑥, 𝑦)

∙ Average-link clustering: The distance between two clusters is the average of the distances between all pairs of points in the two clusters. Mathematically,

𝑑𝐶𝐿𝑈 𝑆(𝐴, 𝐵) = 1 |𝐴| · |𝐵| ∑︁ 𝑥∈𝐴 ∑︁ 𝑦∈𝐵 𝑑(𝑥, 𝑦)

(53)

Pseudocode HAC(𝑥1, 𝑥2, . . . , 𝑥𝑛, 𝑡) 1 Initialize 𝐶 2 for 𝑖 = 1 to 𝑛 3 𝐶[𝑖] = 𝑥𝑖 4 _{maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐶)} 5 while maxClusterSimilarity < 𝑡 6 _{𝑖, 𝑗 = FindMostSimilarClusters(𝐶)} 7 _{MergeClusters(𝐶, 𝑖, 𝑗)} 8 _{maxClusterSimilarity = ComputeMaxClusterSimilarity(𝐶)}

Figure 4-3: Pseudocode of HCA algorithm, with termination inter-cluster similarity threshold

4.2.3 Evaluation

We run our two clustering algorithms on two datasets – a set of news articles scraped from Google news, and our dataset of adulteration-related articles.

We first present results obtained on the Google news corpus. We use tf-idf vector representations of the text, with and without keyword extraction. We chose 20 dis-tinct incidents scraped from the landing page of news.google.com, and then tried clustering the 76 obtained articles into 20 clusters.

Clustering algorithm F-measure with keywords F-measure w/o keywords

HCA (Complete-link) 0.83 0.69

HCA (Single-link) 0.67 0.25

HCA (Average-link) 0.87 0.63

K-means 0.84 0.77

Table 4.3: F-measure of clustering algorithms with and w/o keyword extraction on Google News articles

We next present results obtained on adulteration-related articles. We run our experiments on two subsets of the adulteration article dataset at our disposal – one with 52 articles and another with 116 articles. Both times, we tried to cluster the dataset of articles into 15 clusters.

(54)

Clustering algorithm F-measure with k/w extr. F-measure w/o k/w extr.

K-means 0.13 0.65

Table 4.4: F-measure of clustering algorithms with and w/o keyword extraction on set of 53 adulteration-related articles

Clustering algorithm F-measure with k/w extr. F-measure w/o k/w extr.

Table 4.5: F-measure of clustering algorithms with and w/o keyword extraction on set of 116 adulteration-related articles

Most of the clustering algorithms seem to perform well on the Google news cor-pus dataset – this may partially be due to the fact that a number of these articles come from very different domains (Sports, business, current affairs) so distinguishing between them using word frequencies actually works pretty well in practice.

Adulteration articles on the other hand are probably pretty similar on a word token level, which means performance on the adulteration dataset is far worse. More clever feature representations probably exist, but we hypothesize that the best possible performance of these algorithms is not much better than that presented in Table 4.4 and Table 4.5.

(55)

4.3 Combining entity recognition and clustering

Both approaches that we described above are not perfect and include a lot of noise in their results – sometimes, the NER algorithm extracts incorrect fields from the article and sometimes, the clustering algorithm inserts articles into the wrong cluster. With the knowledge that articles in the same cluster should have the same entities and articles with the same entities should belong to the same clusters, we claim that precision of both clustering and entity recognition can be improved [3].

Before, describing our approach in detail, we describe some tools needed to un-derstand the approach better.

4.3.1 Mathematical tools

Coordinate ascent

Coordinate ascent is a technique used to optimize a multivariable function 𝐹 (𝑥). We can maximize a multivariable function 𝐹 (𝑥) by maximizing along one dimension at a time.

Given a setting of variables 𝑥𝑘 _{at the end of some iteration 𝑘, we can find the}

optimal setting of variables 𝑥𝑘+1 at the end of iteration 𝑘 + 1 by finding 𝑥𝑘+1_𝑖 for each dimension 𝑖, by computing, 𝑥𝑘+1_𝑖 = arg max 𝑦∈ℛ 𝑓 (𝑥 𝑘+1 1 , 𝑥𝑘+12 , . . . , 𝑥𝑘+1𝑖−1, 𝑦, 𝑥 𝑘 𝑖+1. . . , 𝑥𝑘𝑛)

We then start off with some arbitrary initial vector 𝑥0, and subsequently compute 𝑥1_{, 𝑥}2_{, . . ..}

By doing line search, we get 𝐹 (𝑥0) ≤ 𝐹 (𝑥1) ≤ 𝐹 (𝑥2) ≤ . . .. Note that this is slightly different from gradient ascent where we attempt to modify all parameters in one go, instead of in a sequential manner.

Building and processing a dataset containing articles related to food adulteration

Building and processing a dataset containing articles

related to food adulteration

by

Deepak Narayanan

Submitted to the Department of Electrical Engineering and Computer

Science

in partial fulfillment of the requirements for the degree of

Masters of Engineering, Computer Science and Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

June 2015

c

○ Massachusetts Institute of Technology 2015. All rights reserved.

Author . . . .

Department of Electrical Engineering and Computer Science

May 22, 2015

Certified by . . . .

Regina Barzilay

Professor

Thesis Supervisor

Accepted by . . . .

Albert Meyer

Chairman, Masters of Engineering Thesis Committee

Building and processing a dataset containing articles related

to food adulteration

by

Deepak Narayanan

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Retrieval of relevant news articles

Curation of dataset

Use cases of this dataset

Chapter 2

Food and adulterant co-existence

2.1

SVM classifier

2.2

Collaborative filtering classifier

2.3

Passive-aggressive ranker

2.4

Evaluation

2.4.1

Collaborative filtering classifier

2.4.2

Passive-aggressive ranker

Chapter 3

Query refinement and relevance

feedback

3.1

Pseudo-relevance feedback with a transductive

classifier

3.1.1

Transductive learning

3.1.2

Relevance feedback

3.2

Query template selection

3.3

Filtering for adulteration-related articles using a

Language Model

3.3.1

Evaluation

Chapter 4

Article clustering and record

extraction

4.1

Named Entity Recognition using Conditional

Ran-dom Fields

4.1.1

Mathematical formulation

4.1.2

Training

4.1.3