Latent variable modelling of user interaction in image retrieval

(1)

Thesis

Reference

Latent variable modelling of user interaction in image retrieval

MORRISON, Donn Alexander

Abstract

Cette thèse étudie les modèles à variables latentes sur les interactions utilisateur avec l'objectif d'améliorer la recherche d'images. Les historiques de recherche, appelés query logs, où l'interaction entre les utilisateurs et le système de recherche est enregistrée, contiennent souvent les indications d'intention sous la forme de jugements de pertinence donnés sur les documents dans le contexte d'une recherche. Selon la nature du système de recherche et de l'interaction qu'il permet, ces jugements peuvent être explicites ou implicites, et, une fois agrégé un grand nombre des recherches effectuées par de nombreux utilisateurs, ils peuvent être exploités pour améliorer divers aspects du système de recherche. Cette thèse propose un modèle des historiques de recherche, le Modèle de Pertinence Utilisateur, où les jugements de pertinence sont issus d'un processus génératif par lequel l'utilisateur juge (soit implicitement soit explicitement) un document comme pertinent s'il partage un degré de recouvrement avec la requête en termes de concepts, et non pertinent dans le cas contraire.

MORRISON, Donn Alexander. Latent variable modelling of user interaction in image retrieval . Thèse de doctorat : Univ. Genève, 2011, no. Sc. 4305

URN : urn:nbn:ch:unige-159470

DOI : 10.13097/archive-ouverte/unige:15947

Available at:

http://archive-ouverte.unige.ch/unige:15947

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

D ´ EPARTMENT D’INFORMATIQUE DR. ST ´ EPHANE MARCHAND-MAILLET

Latent variable modelling of user interaction in image retrieval

TH ` ESE

PR ´ ESENT ´ EE ` A LA FACULT ´ E DES SCIENCES DE L’UNIVERSIT ´ E DE GEN ` EVE POUR OBTENIR LE GRADE DE DOCTEUR ` ES SCIENCES

MENTION INFORMATIQUE

par

Donn Morrison de Comox (Canada)

TH ` ESE No. 4305

Gen`eve 2011

(3)

Abstract

This thesis studies latent variable modelling of user interaction with the goal of improving image retrieval.

User interaction in information retrieval systems can take many forms, but this thesis focuses on interaction that can be viewed as relevance judgements. Query logs, where the interaction between users and the retrieval system is stored, often contain indications of user intent in the form of relevance judgements provided for documents in the context of a query. Depending on the nature of the retrieval system and the interaction it affords, these judgements can be explicit or implicit, and, when collected from many users performing many queries, can be exploited to improve various aspects of the retrieval system. This thesis proposes a query log model, called the User Relevance Model, where relevance judgements result from a generative process in which a user rates (implicitly or explicitly) a document as relevant if it contains a degree of concept overlap with the query, and non-relevant otherwise. The model explicitly accounts for noise, sparsity, and distri- butional constraints and forms the basis of a framework that can be used to simulate user interaction for the evaluation of long-term learning methods. By borrowing the bag-of-words assumption from text modelling for document-query relevance co-occurrences derived from query logs, the benefits of latent variable models in the long-term learning domain are demonstrated. These latent variable models provide a means of inferring the underlying concepts posited to exist in the documents and queries as specified by the User Relevance Model. A probabilistic inference model, named the Probabilistic User Relevance Model, is introduced to address limitations of existing inference models and is evaluated on both artificial and real-world query logs.

(4)

(5)

R´esum´e

Cette thèse étudie les modèles à variables latentes sur les interactions utilisateur avec l’objectif d’améliorer la recherche d’images. L’interaction des utilisateurs peut prendre plusieurs formes, mais cette thèse se concen- tre sur les jugements de pertinence. Les historiques de recherche, appelésquery logs, où l’interaction entre les utilisateurs et le système de recherche est enregistrée, contiennent souvent les indications d’intention sous la forme de jugements de pertinence donnés sur les documents dans le contexte d’une recherche. Selon la nature du système de recherche et de l’interaction qu’il permet, ces jugements peuvent être explicites ou implicites, et, une fois agrégé un grand nombre des recherches effectuées par de nombreux utilisateurs, ils peuvent être exploités pour améliorer divers aspects du système de recherche. Cette thèse propose un modèle des historiques de recherche, le Modèle de Pertinence Utilisateur, où les jugements de pertinence sont issus d’un processus génératif par lequel l’utilisateur juge (soit implicitement soit explicitement) un document comme pertinent s’il partage un degré de recouvrement avec la requête en termes de concepts, et non pertinent dans le cas contraire. Le modèle prend en compte explicitement le bruit, la densité faible, et les contraintes de distribution et constitue un cadre qui peut être utilisé pour simuler l’interaction des utilisateurs pour l’évaluation des méthodes d’apprentissage à long terme. Sous l’hypothèse bag-of-words pour le modelage de texte, les avantages des modèles à variables latentes dans le domaine d’analyse à long terme sont démontrés pour les cooccurrences document-requête dérivés des historiques de recherche. Ces modèles

à variables latentes fournissent une façon de déduire les concepts sous-jacents existant dans les documents et dans les contextes des recherches. Un modèle d’inférence probabiliste, le Modèle Probabilistique de Per- tinence Utilisateur, est présenté pour corriger les limites des modèles d’inférences existants et est évalué sur les historiques de recherche artificiels et tirés de cas réels.

(6)

(7)

List of Figures

1.1 General layout of the thesis. . . 4

2.1 The classical IR interaction scenario. . . 10

2.2 An example query-by-example IR interface. . . 13

2.3 An example web IR image search interface. . . 14

2.4 An example document-query matrix. . . 19

2.5 Bipartite graph representation of the document-query matrix. . . 20

2.6 An example of the power law. . . 23

2.7 Power law observed in the 20 Newsgroups dataset. . . 25

2.8 Example message from the 20 Newsgroups corpus. . . 25

2.9 Example images with meta-data from the GIFT dataset. . . 26

2.10 Power law observed in the GIFT dataset. . . 28

2.11 Relevance judgements of a typical query from the GIFT dataset. . . 29

2.12 Example query from the Belga dataset. . . 31

2.13 Power law observed in the Belga dataset. . . 32

3.1 Example of alternate query suggestions from Google. . . 38

4.1 Salton’s vector space model. . . 55

4.2 LSA viewed as a factor decomposition. . . 56

4.3 Toy example demonstrating the benefits of LSA. . . 58

4.4 The graphical model of PLSA. . . 59

4.5 The graphical models for LDA. . . 62

4.6 An illustration of the topic simplex with 3 topics. . . 64

4.7 Model selection and average precision per category for the 20 Newsgroups data. . . 70

5.1 Effect of the power law scale parameteraon sparsity . . . 78

5.2 The power law observed in relevance data generated using the URM. . . 79

5.3 The graphical model for the probabilistic URM. . . 80

5.4 Latent factors uncovered from artificial data using PLSA. . . 83

5.5 Latent factors uncovered from artificial data using P-URM. . . 84

5.6 Experimental results on artificial relevance feedback data (1/2). . . 87

(12)

5.7 Experimental results on artificial relevance feedback data (2/2). . . 88

5.8 Illustration of SVM for sparsity reduction. . . 90

6.1 Model selection for document ranking on the Belga corpus. . . 97

6.2 Average precision per concept for the merged Belga data. . . 99

6.3 Model selection for document ranking on the Belga corpus. . . 100

6.4 Average precision per concept for the largest connected component of the Belga data. . . 100

6.5 Model selection for document ranking on the GIFT corpus. . . 103

6.6 Queries in the latent space for the GIFT dataset. . . 104

6.7 Model selection for query recommendation on the Belga corpus. . . 106

6.8 Topics from the merged Belga corpus. . . 108

6.9 Topics from the Belga corpus. . . 109

6.10 Topics from the GIFT dataset. . . 110

(13)

List of Tables

2.1 Modes of user interaction in information retrieval and collaborative filtering . . . 8

2.2 Datasets used in the experiments . . . 24

2.3 The categories of the 20 Newsgroups dataset. . . 24

2.4 Corel image categories used in the GIFT retrieval demonstration. . . 27

2.5 The 25 concepts manually annotated from the Belga corpus. . . 33

4.1 Analogous components of the latent variable models. . . 64

4.2 PLSA topics sorted by word proportions. . . 71

5.1 Fixed parameter values for artificially generated data. . . 86

6.1 Results on weighting implicit relevance judgements in the Belga corpus. . . 95

6.2 Query recommendation examples. . . 107

A.1 Results for model selection for the 20 Newsgroups dataset. . . 117

A.2 Results for model selection for artificial data. . . 118

A.3 Results showing the effect of increasing query sessions for artificial data. . . 118

A.4 Results showing the effect of increasing sparsity for artificial data. . . 119

A.5 Results showing the effect of increasing noise for artificial data. . . 119

A.6 Document ranking results for the unmerged Belga dataset. . . 120

A.7 Document ranking results for the merged Belga dataset. . . 120

A.8 Document ranking results for the largest connected component of the unmerged Belga dataset.120 A.9 Document ranking results for the sparse GIFT dataset. . . 121

A.10 Document ranking results for the sparsity-reduced GIFT dataset. . . 121

A.11 Query recommendation results for the unmerged Belga dataset. . . 121

A.12 Query recommendation results for the merged Belga dataset. . . 122

(14)

(15)

Chapter 1

Introduction

1.1 Problem definition and motivation

The growth of the Internet and the technology explosion have contributed to a high demand for efficient methods of information filtering and retrieval. Information retrieval (IR) of multimedia documents such as text, image, video and audio is a research area with problems spanning many disciplines including psychol- ogy, mathematics and computer science. The central concern is two-fold. First, there exists the burden of

“information overload,” a term first coined in Alvin Toffler’s 1970 bookFuture Shock, which describes the problem of having too much information to process efficiently (Toffler, 1970; Frankowski et al., 2007). The scale of multimedia data on the Internet motivates the need for efficient filtering methods in order to help users find what they are looking for quickly and effortlessly.

The second concern is the inherent dichotomy between computational and human intelligence. In computer vision, the oft cited term thesemantic gap(Wenyin et al., 2001; Fournier and Cord, 2002; Kanade and Uchihashi, 2004; Cord and Gosselin, 2006) describes the lack of a principled and accurate mapping between the low-level information computers can understand and process and the high-level abstract concepts which humans can perceive, understand, and describe. A litany of research in the last decade has attempted to nar- row the semantic gap by automatic means (Jeon et al., 2003; Gao et al., 2006; Tang et al., 2006; Monay and Gatica-Perez, 2007; Ismail et al., 2008). Lately, however, it has become obvious that fully automatic means are still many years away from being reliable enough to have a meaningful impact in wider application domains such as information retrieval on the web.

Instead, the focus has shifted to include semantic information gleaned from human users. The most well-known example is manual annotation ortaggingof multimedia documents with meta-data or keywords.

Many online web services such as Flickr,¹YouTube,²Vimeo,³among others rely on manual tagging to index content and encourage users to provide meaningful tags during or after the content has been added. However, given the large quantities of multimedia documents generated and indexed, it is often unrealistic to expect users to dedicate time and resources to manual annotation and indexing of document collections.

1http://www.flickr.com

2http://www.youtube.com

3http://www.vimeo.com

(16)

A number of novel methods for eliciting semantic data from users have been proposed recently in the literature. For example, von Ahn and Dabbish (2004) procured incentive to label images on the Internet using a game-based approach which paired random users together to find agreement on semantic labels. Users are motivated to participate in the game through the use of a high score ranking for successful players. Other similarly spawned crowdsourcing games include Peekaboom, which aims to aid object recognition by having randomly paired users guess objects incrementally uncovered by their opponent (von Ahn et al., 2006b), and Phetch, which aims to semantically describe images to improve web accessibility for blind users (von Ahn et al., 2006a). In the LabelMe project (Russell et al., 2008), incentive for scientists to use a large-scale region-level labelled image database for computer vision research encourages participation in the labelling process itself. Users can elect to annotate image regions by drawing boundaries around objects and concepts and submitting relevant keywords for each region.

While explicit user involvement such as these examples can potentially yield high-quality semantic data, the drawback is that the incentive must surpass the reluctance of users to participate in order make the technique worthwhile. If, however, the interaction can be engineered in such a way as to make the process less explicit, users can provide the semantic data as a byproduct of another task. A successful incarnation that follows this strategy is the reCAPTCHA system of von Ahn et al. (2008) where optical character recognition (OCR) errors in digitised books are corrected by posing the misunderstood words as a human verification test used in spam prevention online. A user wishing to register for a web-based email account is proposed a test where two words are displayed, one of which is known, the other of which is unknown (i.e. it has previously confused the OCR software). By asking the user to enter both words, the unknown word can be reliably corrected. A similarly motivated approach is TagCaptcha (Morrison et al., 2009b) which aims to incrementally annotate an image database using a test composed of two annotated (known) images and one unannotated (unknown) image. By describing the concepts contained in all images in order to pass the test, the user unwittingly annotates the latter image.

In information retrieval, the action of clicking on a document or providing relevance feedback during a search for information, while having little cognitive load on the user, signifies a preference for that document over other documents in the result set (Joachims, 2002b). This preference can be seen as a co-occurrence between the query and the selected document. When these co-occurrences are collected from many users performing many queries and stored in search engine transactionlogs, they can be exploited to improve various aspects of the IR system from which it is based upon. This thesis advocates a generative view of relevance whereby relevance judgements for selected documents are generated by mixtures of latent topics underlying the queries. Given this generative nature and the underlying causes (the topics), a class of mixture models termed latent variable models (LVM) provides a convenient means of topic inference.

Latent variable models applied in the IR domain are not new. For example, Deerwester et al. (1990) developed latent semantic analysis (LSA) for ad-hoc text retrieval to alleviate the problem of polysemy and synonymy in IR, and subsequently proposed models such as probabilistic LSA (PLSA) (Hofmann, 1999) and latent Dirichlet allocation (LDA) (Blei et al., 2003) have also been adapted for the same purpose. These models are not only applicable to word co-occurrences for text, but also feature co-occurrences for images and multimedia (Lee and Seung, 1999; Monay and Gatica-Perez, 2003, 2007) and binary presence co-occurrences in other domains (Bingham et al., 2009). LVMs have also seen limited yet promising use on query log data (Heisterkamp, 2002; Lin et al., 2005), however, the capabilities of these models have not been

(17)

fully explored on implicit and explicit feedback, and more specifically, in the image retrieval domain.

1.2 Thesis statement and contributions

It is in this realm that I position my contribution. My aim is to improve image retrieval by examining and exploiting the underlying structure and nature of search transaction logs. This aim is realised by modelling the processes that govern decisions and the corresponding actions taken by users during the query. Further- more, the methods explored in this thesis are designed to be content-ignorant. In other words the sole source of information used in the learning process is the interaction data provided by users performing queries in a typical retrieval system. The only divergence from this is where sparsity reduction is employed in an attempt to show that further performance improvements can be realised.

This thesis attempts to answer the following research hypotheses (denoted RH1-RH5):

1. Similarities calculated in latent spaces are more accurate than those calculated in high dimensional spaces (Chapters 4, 5 and 6);

2. Negative relevance judgements are beneficial to long-term learning (Chapter 5);

3. Different implicit user actions in IR, when placed on an ordinal (Likert) scale, improve document rankings (Chapters 5 and 6);

4. Merging identical queries based on query text improves document rankings (Chapter 5); and

5. Sparsity reduction directly improves document rankings that stem from both high dimensional and latent spaces (Chapters 5 and 6).

In addition to answering the above hypotheses, this thesis brings the following contributions to the domain of query log mining:

1. A User Relevance Model (URM) that assumes that observed relevance judgements stem from noisy linear combinations of underlying factors orconcepts(Morrison et al., 2009a);

2. A probabilistic user relevance model that provides a generative view of relevance feedback and a principled statistical handling of negative relevance feedback; and

3. A framework for generating artificial query log data for IR evaluation.

1.3 Layout of the thesis

This thesis is organised as follows and is presented as a flow diagram in Figure 1.1. The background serv- ing as the foundation for the contributions outlined above is contained in Chapters 2, 3, and 4. Chapter 2 discusses the nature of user interaction in information retrieval. More specifically, the chapter details user interaction stemming from implicit and explicit sources and how these modes of interaction relate to the notion of relevance in IR. The chapter goes on to formalise different interpretations and scales of relevance used in previous work, and furthermore introduces the fundamental representations and pre-processing procedures

(18)

URM Simulation

Query logs

Users IR system

Topic inference

P-URM LVMs

Latent space

Indexing

Document ranking Query recommendation Real-world interaction

Φ,Θ

Figure 1.1: The general layout of the thesis.

of IR system query logs. These aspects represent the first steps of the greater goal of improving information retrieval by exploiting historical queries contained in such logs. The chapter then illustrates two problematic issues of implicit and explicit relevance provided by users: sparsity, the problem stemming from the limited view of the document collection; and noise, which has causes including user disagreement and subjectivity and spurious and exploratory clicks or interactions. Finally, the chapter introduces the datasets that are used throughout the remainder of the thesis.

Chapter 3 examines ways in which both classical and web IR systems can benefit from query log mining by reviewing related work. The literature review is divided into two sections. The first section examines research on implicit and explicit interaction and its application to improving IR while the second investigates recent research on user modelling and simulation in IR with an emphasis on query logs.

In Chapter 4, several latent variable models (LVM) are introduced. Specifically, I review the evolution in text modelling from Salton’s early vector space model to latent semantic analysis (LSA) by Deerwester et al. (1990), probabilistic LSA (PLSA) by Hofmann (1999), non-negative matrix factorisation (NMF), and latent Dirichlet allocation (LDA) by Blei et al. (2003). The chapter attempts to formalise these models under a common representation that will be extended and used in the experiments contained in the second portion of the thesis. Common LVM evaluation strategies are then reviewed followed by experiments on a popular text collection showing the efficacy of calculating document similarities in the latent space.

Chapter 5 focuses on the main contribution, the User Relevance Model (URM). The model is built on several assumptions that govern the generation of relevance judgements, either implicit or explicit, observed in query logs. The URM provides a flexible framework from which artificial query log data can be generated, upon which a principled and controlled benchmarking of various models can be systematically tested. The chapter then introduces the Probabilistic User Relevance Model (P-URM), an extension of Hofmann’s PLSA model where both positive and negative relevance ratings are treated in a generative, probabilistic fashion.

Experimental results are presented showing the efficacy of the proposed method over the LVMs introduced in Chapter 4. The chapter also introduces two methods for alleviating the problem of sparsity in relevance judgements contained in query logs.

(19)

Chapter 6 comprises experiments highlighting how LVMs can be applied to improve some of the IR tasks introduced in Chapter 3. More precisely, I present experiments detailing the utility of adapting the latent spaces inferred by LVMs for use on document ranking and query suggestion. In addition to the quantitative results, I show visual samples of the topics inferred by the various models.

Finally, I conclude the thesis in Chapter 7 and summarise future extensions to this research.

(20)

(21)

Chapter 2

User interaction in information retrieval

2.1 Introduction

In this chapter, user interaction as it is known in IR is formalised and the contexts in which it can be captured are introduced. A number of different scales and gradients have been proposed to record a relevance judgement from both users of IR systems and assessors creating groundtruths for evaluation of IR systems.

In addition to having implications on the granularity of the data captured, and varying levels of cognitive effects on the users, different scales also have different underlying assumptions that have been debated since the early years of IR. I introduce data structures for storing and manipulating the data, as well as pre-processing steps that are typically undertaken before any data mining is performed. This chapter also introduces the datasets acquired and used throughout the remainder of this thesis.

2.2 The nature of user interaction

User interaction in its most basic interpretation as the “exchange of information between users and a system”

(Saracevic, 1997; Jansen, 2006) has been studied extensively in previous research (Efthimiadis and Robert- son, 1989; Bates, 1990; Belkin et al., 1995; Hancock-Beaulieu et al., 2000; Saracevic, 1997; Jansen, 2006).

In the context of IR, user interaction encompasses the initial query, the results returned by the retrieval system, and subsequent query refinement and any clicking, viewing, or selecting of the search results that is communicated back to the retrieval system (Jansen, 2006).

User interaction can beimplicitorexplicitin nature and is distinguished principally by whether the user has knowledge that the interaction will be used by the system (Kelly and Teevan, 2003), the existence of a rating scale (whether binary or graded), and whether or not the data extracted from the interaction is inferred from the user’s behaviour. Table 2.1 lists the modes and attributes of user interaction and gives examples of each.

(22)

Table 2.1: Modes of user interaction in information retrieval and collaborative filtering

Mode Characteristics Examples

Implicit Inferred from user behaviour; user not necessarily informed that behaviour is used as RF

Clickthrough in IR (web, text, image, video); purchase history (Amazon, etc.); Browsing, viewing, printing, selecting Explicit Obtained from assessors of relevance; defined as explicit only

when users know it will be interpreted as relevance judgements; binary or graded

Profile creation to alleviate cold start; relevance judgements in IR evaluation; film ratings for recommendation & other forms of collaborative filtering

2.2.1 Implicit user interaction

Implicit user interaction is distinguished by an absence of cognitive effort by users to make rating or preference information known to the system (Kelly and Teevan, 2003). In other words, the user is not necessarily under the impression that interaction will be used by the system and thus the resulting data is considered to be inferred from user behaviour. Examples of implicit interaction in IR are clicking, selecting, viewing, printing, watching (video), and in collaborative filtering are purchase choices and browsing patterns. The interaction is often a byproduct of some other sequence of actions or choices made by the user. In the example of purchase choice, implicit data would be items purchased by a user over a period of time.

In web IR, a well-known form of implicit interaction isclickthroughorclickdata. At its simplest the data comprises clicks on search results following the execution of a query. These clicks are widely seen as carrying some degree of relevance information, often referred to as “weak” relevance judgements (Smith and Ashman, 2009; Craswell and Szummer, 2007). Furthermore, some research has provided evidence that implicit and explicit behaviour can be substituted for one-another (Joachims, 2003; White et al., 2001), although implicit feedback has been shown to be less reliable (Joachims, 2003).

The abundance of implicit interaction, especially in IR systems and collaborative filtering (CF) contexts, as well as the convenient lack of cognitive overhead from the user’s perspective place the processing of implicit interaction in a high standing among techniques aimed at improving IR and recommendation systems.

Commercial search engines see millions of queries per day, and by extension collect implicit interaction data surrounding those queries.

The nature of implicit data carries with it privacy implications; many users are uneasy about their behaviour being recorded and subsequently used without their explicit consent, even if it is beneficial. Worse yet is the fear that their search histories may be published and de-anonymised, as was the case with the bungled AOL data release to researchers interested in mining search engine query logs (Hafner, 2006). The effect is that real-world query logs from search engines are notoriously difficult for researchers to acquire.

2.2.2 Explicit user interaction

Explicit interaction is differentiated from implicit interaction by the characteristic that a user is consciously making preference or rating information known to the system. While the films that a person watches may be implicitly chosen, the action of recording those choices constitutes explicit interaction.

Explicit interaction, while more cognitively taxing on the user, is generally accepted to be a more accurate assessment of user intent and choice due to the fact that the users have elected to provide this information.

Because the purpose is normally to improve some sort of recommendation (in the case of collaborative filtering), the user makes this information available with full knowledge and the intention of improving

(23)

recommendations from the system.

In the form of relevance judgements, explicit interaction is often used as the evaluation ground truth for retrieval systems (White et al., 2001; Joachims et al., 2007a). Assessors are presented with example queries paired with a set of results. The assessors create the evaluation ground truth by marking documents relevant or non-relevant (according to a predefined scale; see Section 2.4). In this scenario, the cost of creating a ground truth for evaluation of a retrieval system is very high, e.g. the manual assessments required for the annual Text REtrieval Conferences (TREC).

Examples of explicit user interaction include profile creation, where a new user indicates preferences manually from a list (alleviating the cold start problem in CF), relevance judgements in IR evaluation and film ratings for recommendation and other forms of CF.

A distinction must be made between explicit relevance judgements elicited for assessment purposes and relevance judgements gleaned from retrieval systems where users are actively seeking to satisfy a legitimate information need. In the former, the judgements provided must cover a significant portion of the dataset in order to be of use for evaluation. The latter kind of explicit judgements will only involve documents which were deemed useful for improving the query, and thus will not have the same level of coverage. For the explicit judgements used in this work, I consider (and simulate) only those elicited from users interacting with retrieval systems. Many studies use assessments for evaluation, however (White et al., 2001; Macdonald and Ounis, 2009).

2.2.3 Correlations between explicit and implicit interaction

Recently, research has shown that implicit relevance judgements are correlated with explicit relevance judgements (Joachims, 2003), and the idea is then to use these implicitly collected examples as a replacement or supplicant to the ground truth, reducing the cost of evaluation. Joachims et al. (2005) showed that differences between implicit and explicit relevance judgements were less than originally thought (see also (Joachims et al., 2007a)). This has led the way for research to consider implicit relevance judgements as training data for various machine learning-based improvements to IR (Tsikrika et al., 2009; Joachims, 2003; Macdonald and Ounis, 2009; Joachims et al., 2007b) (see Chapter 3).

2.3 User interaction in information retrieval

User interaction in IR is inherently different from that in collaborative filtering due to the nature of the information exchange. Users purchasing or rating items for recommender systems provide implicit or explicit interaction without the same immediacy as in IR. Users explicitly rating films usually do so in a long-term sense, i.e., to build a preference profile in order to have films they may enjoy recommended to them. In IR, the information need is an immediate goal, meaning users click on results (implicit) or refine the query by providing relevance feedback (explicit).

2.3.1 Queries, sessions, and the information need

Theinformation need, a term first coined in the context of asking questions by Taylor (1962), defines the underlying motivation of an observed quest for knowledge. For example, prior to using a commercial web

(24)

Figure 2.1: A typical classical IR interaction scenario showing the initial information need, the specification of the need as a query to the retrieval system, evaluation of the results, and the query reformulation loop (from (Baeza-Yates and Berthier, 1999) (pp. 263)).

(25)

search engine to find a particular recipe, a user will have first formed an information need (i.e. “how many eggs are required for a cr`eme brul´ee?”). The precise nature of the information need is never explicitly defined; research generally assumes the notion exists within the consciousness of the user and is expressed through a query submitted to a retrieval system. In this work, I am only interested in the observed expression of the information need and the associated implicit or explicit relevance judgements.

The expression of the information need is thequerysubmitted to the IR system by the user Crestani et al. (1998). A query may be specified in many ways, including query-by-example (QBE) typically seen in image retrieval (Squire et al., 2000), or by keywords (regardless of whether the material sought is text or multimedia) where the query is specified as a string of text or keywords which the user deems suitably descriptive of the information need.

Jansen (2006) defines asessionas comprising one or more queries, each of which may not necessarily reflect the same search topic. In the context of interaction, a session simply comprises a set of queries submitted consecutively by a unique user. The duration of the search session is the time the user spends interacting with the search engine (from the first query submitted until the user stops interacting) and is generally short; Jansen (2006) remarks that this value is generally between five and 120 minutes.

Conflicting nomenclature in the literature adds to the ambiguity around the definitions of queries and sessions. For example, Baeza-Yates et al. (2005); Dupret and Mendoza (2006) define aquery sessionconsisting of the query text and any subsequent documents clicked as a result of the query:

QueryS ession:=(query,(clickedURL)∗). (2.1) In Dupret et al. (2006),query sessionsare defined as repeated queries (queries sharing identical query terms) submitted by different users at different times.

In this thesis, I maintain the notation described by Jansen (2006), where a session is defined as a set of queries submitted by a user during the quest for one or more information needs. Hence, the query itself corresponds to the query session of Eq. (2.1) defined by Baeza-Yates et al. (2005) and is coupled with relevance judgements, explicit or implicit, specified for a set of documents in the collection. More formally, for a keyword-based image retrieval system:

Querykw:=(queryT ext,(clickedImage)∗), (2.2) and for a QBE image retrieval system:

Queryqbe:=(queryT ext∗,(relevantImage)∗,(nonrelevantImage)∗). (2.3) In the latter QBE scenario, the resultant query is the product of one or more iterations of relevance feedback and so the recorded user interaction is intimately related to whether or not the user found the documents that were sought. In this case, the existence ofqueryTextdepends on whether the system allows queries to be specified with keywords, or if an example image must be provided to start the query. Relevance feedback can be viewed as an ad-hoc query construction process: the final query is the product of one or more iterations of relevance feedback.

Figure 2.1 shows a classical IR scenario beginning with the information need and including a reformula-

(26)

tion loop where the user is afforded the chance to refine the query by some means such as relevance feedback or query reformulation (i.e. manually adding or removing query terms and resubmitting the query).

2.3.2 User models underlying search

Different retrieval domains often cater to different user models and these different models have implications on the interaction data obtained from the system. In addition to the three retrieval needs (informational, navigational, transactional) commonly seen in web IR discussed in Chapter 3 (Broder, 2002), there exist a wide range of user models restricted to much narrower search domains. These models can be broadly classed into two groups: precision- and recall-motivated (Keskustalo et al., 2008).

Precision-motivated search is often seen in areas where the searcher does not have time to exhaustively inspect each document in the ranking, but rather requires the most relevant documents to be ranked first in order to save time. Keskustalo et al. (2008) give an example of a family practice physician who does not have time during a patient visit to scan through long lists of results. Often, precision-motivated searchers know that a certain document exists in the collection, and wish to retrieve only that document. This can be likened to the navigational need defined by Broder (2002), also known as theknown itemsearch task.

Recall-motivated search is found in areas such as patent and law case retrieval (Magdy and Jones, 2010) among others. In these domains users need to examine longer lists of results in order to find or rule out similar patents or to determine precedent in law. User interaction in a setting where high recall is preferred may be less sparse as the user inspects more documents by clicking on results.

2.3.3 Search interfaces

The search interface, where the user interacts with the retrieval system, largely dictates what interaction may be captured between the user and the system. Classical IR, with its experimental nature, may allow much more interaction to be captured than, for example, web IR, where simplicity and efficiency are important factors.

The nature of search interaction has evolved (with the shift from classical IR to web IR) from explicit to implicit relevance judgements due in part to the fact that traditional explicit relevance feedback was under- utilised and awkward in the web IR setting (Bernard et al., 1999).

Figure 2.2 shows an example image retrieval interface from the GNU Image Finding Tool (GIFT) (Squire et al., 2000). Figure 2.3 shows the limited form of relevance feedback typical in the domain of web IR. Such simplistic forms of relevance feedback are common in web IR because most users are neither familiar with the concept of relevance feedback nor how the reformulation involved affects the results (Baeza-Yates and Berthier, 1999). In the explicit sense, a user can use the “Find similar images” link to find similar images.

However, in the implicit sense, the query logs of the IR system may be examined to determine which images were clicked by the user in the context of the query (Baeza-Yates and Berthier, 1999).

2.4 Scales of relevance

In many classical IR retrieval systems, particularly query-by-example in image retrieval, a user will have the opportunity to specify from the result set a subset of documents that are relevant or irrelevant to the current

(27)

Figure 2.2: An example image retrieval interface following the query-by-example (QBE) paradigm. The query image appears at top-left with a red text caption. A user may iteratively provide relevant and/or non- relevant examples from the result list, or implicitly specify neutral relevance by leaving the judgement at its default value (pictured). The interface is a web-based client for the GNU Image Finding Tool (GIFT) developed by the Viper group at the University of Geneva, Switzerland (Squire et al., 2000).

(28)

Figure 2.3: An example Google Images search showing the typical form of relevance feedback in web IR. A user may click on the “Find similar images” link (highlighted in red) in order to specify relevance feedback limited to only one instance ofpositivedocument relevance, but may do the same on the following list of results, allowing a traversal of similar documents. Relevance feedback in web IR is usually implemented in such a rudimentary form due to the fact that typical users are not familiar with the concept and more useful alternatives exist such asquery modification(by the user adding or deleting query terms) andquery suggestion.

(29)

query. The earliest and still the most pervasive (Ruthven and Lalmas, 2003) iterative relevance feedback algorithm proposed by Rocchio (1971) allows the reformulation of a query based on positive and negative examples from the result set:

Q1=Q0+β1

|R|

X

i=1

Ri−γ1

|R|¯

X

i=1

R¯i, (2.4)

whereRand ¯Rrepresent the vectors of positively and negatively marked example documents, respectively,Q₀ is the original query andβ, γweight the positively and negatively marked example documents, respectively.

2.4.1 Binary versus graded ratings

The binary nature of Rocchio’s algorithm reflects the dichotomy of relevance in early IR literature, although some early work was published proposing graded relevance scales (Cooper, 1971). Positive and negative relevance judgements easily fit standard evaluation measures such asprecisionandrecallwhere documents are assumed to be either relevant and/or non-relevant with respect to a query. Furthermore, many recent machine learning (ML) approaches to learning from RF pose the problem as that of a 2-class classification problem with positive and negative examples¹(Bruno et al., 2008).

Formally, a binary relevance judgement between a documentdi and a queryqj , denotedri j, is represented as:

r_{i j}=











1 where documentdiis relevant to queryqj

−1 where documentd_iis not relevant to queryq_j.

The notationri j and ¯ri j is often used (respectively) for the above possible values on the binary scale. It should be noted that while the theory of relevance in IR does not allow for missing values (i.e. a document is either relevant or it is not), missing values frequently occur in ratings given by users and assessors in the context of user interaction and relevance assessments. I describe missing data in detail in Section 2.5.3 of this chapter.

One criticism that is often levelled towards the traditional binary relevance scale is that it assumes an unrealistic relevant/non-relevant dichotomy in the document collection with respect to the query (Kekäläinen, 2005; Järvelin, 2009). Researchers argue that the binary scale encourages liberaljudgements. That is, there is no difference between a marginally relevant document and a highly relevant document on the binary scale. Realistically, the relevance of documents is more naturally represented by different grades of relevance (Ruthven and Lalmas, 2003; Järvelin, 2009). In the early TREC evaluations, only binary judgements were considered (Voorhees, 1998), but tracks added later such as the Web IR and relevance feedback tracks moved to a three point scale: “irrelevant”, “partially relevant”, “relevant” (Sormunen, 2002).

Other variations of graded relevance scales have been proposed in the literature, such as that by Rui et al.

(1998): a 5-valued relevance scale where a relevance scoreS_l ∈ {+3,+1,0,−1,−3}corresponds to “highly relevant”, “relevant”, “no opinion”, “non-relevant”, and ‘highly non-relevant”, respectively. The authors postulate that while more levels of relevance would lead to more accurate feedback, it becomes more of a

1Support vector machines (SVM), which aim to draw an optimal margined hyperplane between classes, are an example of 2-class classifiers.

(30)

burden to the user, and that they “find 5 levels is a good trade-offbetween convenience and accuracy”.

For the INitiative for the Evaluation of XML Retrieval (INEX) tasks, a four point relevance scale is employed: “none”, “marginally”, “fairly”, and “highly” (Lalmas and Tombros, 2007). Kek¨al¨ainen (2005) compared the effects of binary and the same four point scale on rankings in IR systems. They found that the correlation between rankings from binary and graded scales decreases when fairly and highly relevant documents are weighted higher. Based on a reassessment of TREC-7 and -8 relevance judgements using the same aforementioned graded scale, Sormunen (2002) found that only a very small percentage of documents (16%) are highly relevant, while about 50% of the relevant documents are only marginally relevant, support- ing the argument in favour of graded relevance judgements. In a later study, Vakkari and Sormunen (2004) concluded similar findings. Among their recommendations is a suggestion to the IR community to think about “the consequences of using liberal relevance criterion in IR evaluation”. If liberal assessments are assumed in conjunction with the binary scale, it is likely that the majority of relevant documents are actually only marginally relevant.

As Kekäläinen (2005) shows, ranking measures such as cumulative gain (CG) (Järvelin and Kekäläinen, 2000) and its variants (discounted (DCG) and normalised discounted cumulative gain (NDCG)), which put greater emphasis on highly relevant documents, support the notion of graded relevance scales, but the authors remark that evaluators may find it difficult to choose a successful weighting scheme. Cumulative gain is analogous to precision when used with binary relevance judgements.

Documents marked weakly relevant have been shown to result from users who are unsure about their information need or who have little knowledge of the domain in which they are searching (Ruthven and Lalmas, 2003; Janes, 1991), the conclusion being that weak relevance judgements are often associated with a change in search topic (Ruthven and Lalmas, 2003). In a study examining the relationship between term specificity and users’ relevance judgements, Kim (2006) noted that three-levelled relevance yielded a stronger statistical significance compared to a binary relevance scale.

Although rating scales are normally associated with explicit feedback, it is also possible to adapt scales (either binary or graded) to implicit feedback by weighting certain actions over others to infer more or less relevance. For example, printing or saving a document could be seen as implying more relevance than simply viewing a document because there is more commitment to the document in the former action compared to the latter, where the user might simply be inspecting the document to assess its relevance.

Affording the user greater choice associated with a graded relevance scale can have negative effects with respect to the cost of eliciting responses (Sormunen, 2002). The added cognitive effort required by the user in allowing graded relevance judgements may negatively affect the retrieval process by overwhelming the user with choice (Rui et al., 1998; Baeza-Yates and Berthier, 1999) and the difficulty in accounting for user rating biases should not be overlooked. It has been noted that users tend to use the ends of graded scales (Janes, 1993; Lavrenko, 2004) and that agreement between assessors is higher for binary judgements (Lavrenko, 2004), which suggests that users reduce cognitive effort by ignoring the additional choices. Furthermore, as Lavrenko (2004) points out, the framework for evaluating retrieval effectiveness using graded relevance scales is much less mature than that for the binary scale (Kek¨al¨ainen, 2005).

For these simplifying reasons, many researchers prefer the well-established dichotomous nature of relevance, i.e. that documents are either relevant or they are not Harman (1992); Lavrenko (2004).

(31)

2.4.2 Discrete versus continuous scales

The majority of research on the effectiveness of relevance feedback employs discrete rating scales (Rocchio, 1971; Salton and Buckley, 1990; Harman, 1992; Rui et al., 1998; Kekäläinen, 2005; Järvelin, 2009). The alternative is the continuous scale, defined on the line of real numbers, and was explored in the early years of IR. This notion of an underlying continuous relevance scale, or “synthema”, is discussed by Robertson (1977), extending the model proposed by Cook (1975), who posits that it is the experimenter who defines discrete relevance categories on this continuous line. The translation from continuous to discrete is then left to the user (orjuror), who assesses “the position of each document on the continuous underlying scale, and then compares this assessment with his understanding of the relevance intervals”.

Ruthven et al. (2003) investigated the incorporation of implicit user behaviour into relevance feedback.

A continuous slider was used which corresponded to a graded relevance score between 0-10 and this score was incorporated into the relevance feedback algorithm as a fraction of a relevance assessment. Thus, a rating of 10 would correspond to fully relevant (1) whereas a rating of 5 would correspond to half relevant (0.5). Despite increased flexibility in allowing the user to give ratings on a continuous scale, the authors do not justify its use over a simple discrete scale. When comparing the (continuous) graded scale with binary relevance judgements, the authors found no statistically significant difference. They did note, however, that users preferred the terms suggested by the algorithm using the graded relevance scale over the binary one.

Affording the user a continuous rating scale is generally seen as a further complication in the query process which does not have enough of a beneficial effect to be warranted (Lavrenko, 2004). Furthermore, given the relative and subjective nature of relevance feedback, it is unclear why such precise scales are sought. A discrete scale not only suits discrete latent variable models (Chapters 4 and 5), but also unifies implicit and explicit interaction into the discrete realm. This thesis advocates the simplified view of relevance and considers only discrete scales of relevance.

2.4.3 Negative relevance judgements

Relevance feedback in IR often allows the user to select documents which are explicitly non-relevant to the query. It has been suggested that negative relevance feedback examples may not have a beneficial effect for traditional RF algorithms such as Rocchio’s algorithm (Baeza-Yates and Berthier, 1999; Dunlop, 1997) (Ruthven and Lalmas, 2003). This can be attributed to the lack of concerted discriminative information contained in negative examples. Groups of positively marked documents usually share traits in common, such as visual concepts, keywords, or topics. However, groups of negatively marked documents may be non-relevant for a wide variety of reasons (X. S. Zhou, 2000; Ruthven and Lalmas, 2003; M¨uller et al., 2004).

Further evidence of the minimal effect of negative relevance judgements can be seen in a study by Salton and Buckley (1990), where several RF algorithms were compared. The authors discovered that the optimal weight for negative feedback for Rocchio’s algorithm (Eq. (2.4)) was γ= 0.25, whereas positive RF was β=0.75, suggesting that negative feedback should have less of an effect on the retrieval process than positive feedback (Harman, 1992).

Some studies have even reported that users may fear marking documents as non-relevant due to the negative effects it may have on the search process. Ruthven and Lalmas (2003) note that “the potential harm

(32)

that a negative assessment may do to a search is not apparent because the user cannot see what documents have been suppressed by the feedback action”. Because of these issues pertaining to negative relevance judgements, Ruthven and Lalmas (2003) and others suggest that graded relevance scales be used instead of binary.

Other research highlights the importance of negative relevance judgements. Bruno et al. (2008) show that in a learning context, where RF will be used to classify documents according to relevance, negative judgements are beneficial for boundary classifiers such as support vector machines. In this thesis, I consider negative RF as important with respect to explicit feedback and support this with the experiments in Chapters 5 and 6.

2.4.4 Implicit feedback and the problem of position bias

In text web IR, it has been argued that implicit user interaction such as clickthrough data can only be interpreted asrelativerelevance (Joachims, 2002a). Joachims (2002b); Joachims et al. (2005) showed that clicks on documents in text IR are biased in favour of those documents higher in the ranking. In other words, a document lower in the ranking will typically receive fewer clicks than a document of equal relevance higher in the ranking. Research has been explored in an attempt to negate the effects of rank bias by directly per- turbing search rankings (Radlinski and Joachims, 2006) or to account for its effect by explicitly including position information into the interaction model (Dupret et al., 2006).

This is contrasted to most research on explicit interaction that assumesabsoluterelevance (White et al., 2001; Heisterkamp, 2002; Macdonald and Ounis, 2009), e.g. assessments used for evaluation or relevance feedback provided to a QBE system. Joachims (2002a) posit that because a user only sees a small portion of the documents indexed by the IR system, those results that are clicked cannot represent absolute judgements.

The view is shared by other researchers (Yue et al., 2010) in text IR, but little research has made either assumption explicit in the image retrieval domain.

Position bias, as described above, which leads to the relative relevance assumption described by Joachims et al. (2005), has been shown to occur in web image IR (Poblete et al., 2010). Although this finding is important, in this thesis I posit that clicks can in fact be considered as weak indications of absolute relevance, consistent with other research Baeza-Yates and Tiberi (2007); Craswell and Szummer (2007); Tsikrika et al.

(2009); Poblete et al. (2010).

2.5 Transaction logs

User interaction, such as relevance judgements detailed above, whether implicit or explicit, can be captured at some stage between the user’s computer (e.g. web browser or client interface), the communication subsystem or network (e.g. proxy server), or by the retrieval system itself (e.g. web server or system-side logs). These interactions are often stored as transaction log files (M¨uller et al., 2004; Jansen and Spink, 2006) whose entries consist of a timestamp, an IP address, session identifier and/or user identifier, the action performed (e.g. query), and data relevant to that action (e.g. query text). The study of these transaction logs is called transaction log analysis(TLA) (Jansen, 2006) and the goal is to gain a better understanding of the interaction between the users and the system (Spink and Jansen, 2004).

(33)

Query

Document







q1 q2 q3 q4 q5 q6 q7 q8 q9 . . . q_N

d1 0 1 −1 0 0 1 0 0 1 . . . 0

d2 −1 0 1 1 0 0 0 0 0 . . . 1

d3 0 0 0 0 0 0 0 0 0 . . . −1

d4 0 −1 1 0 1 −1 1 −1 −1 . . . 0

d5 1 0 0 0 0 0 0 1 0 . . . 0

d6 1 0 0 0 −1 0 0 0 1 . . . 0

d₇ −1 0 −1 −1 0 0 0 1 0 . . . 0

d₈ 0 0 0 0 0 0 0 0 1 . . . −1

d9 0 0 0 0 0 1 1 1 0 . . . 0

... ... ... ... ... ... ... ... ... ... ... ...

dM 0 −1 1 1 0 0 −1 0 0 . . . 0







Figure 2.4: An example document-query matrix comprising relevance judgement co-occurrences on a binary scale. Zeros represent missing values, which may or may not have underlying causes (see Section 2.5.3).

2.5.1 Formal representations of search log data

Depending on the intended analysis, transaction logs can be stored and represented in a variety of formats.

For example, Jansen (2006) describes the use of a relational database system to store, process, and retrieve the entries of search logs. In this section, I describe two representations used in this thesis: matrices and undirected bipartite graphs. Both representations implicitly carry thebag-of-wordsassumption. Simply put, the bag-of-words model ignores word or feature order. This is equivalent toexchangeablerandom variables.

I will discuss implication in detail in Chapter 4.

Matrices

As I will demonstrate in Chapter 5, the modelling of interaction in search log data is naturally suited to representations in matrix format. Throughout this thesis, this representation is called the document-query matrix. The matrix has dimensions M ×N with M equal to the number of documents and N equal to the number of queries. Each element corresponds to a relevance judgement between a documenti and a query j, which yields a representation such as that in Figure 2.4. Such a representation is also discussed by Fuhr (1992) in the context of an event space between queries and documents, as well as by several works on long-term learning involving latent semantic analysis (Heisterkamp, 2002; He et al., 2003; Koskela and Laaksonen, 2003; Kanade and Uchihashi, 2004; Morrison et al., 2009a).

Graphs

A representation equivalent to the above-mentioned matrix form is the bipartite graph (Beeferman and Berger, 2000; Craswell and Szummer, 2007; Baeza-Yates and Tiberi, 2007; Hosseini and Abolhassani, 2009).

A graphG =(D,Q,E) is defined where the documents and queries are represented (respectively) by ver- ticesD,Qwhich are connected by edgesE(see Figure 2.5). Each edgeEindicates a weighted (in the case of graded relevance) or binary (for binary relevance or implicit interaction such as clicks) rating between a

(34)

Figure 2.5: Bipartite graph representation of the document-query matrix. Edges represent (weighted) relevance judgements for documents (D) over queries (Q).

document and a query with the number of edges|E|equal to the number of relevance judgements. For relevance scales where negative relevances are afforded, a unified graphical representation is not straightforward because of the differences in meanings between positive and negative relevance scores. This is often solved by representing the positive and negative ratings as separate graphs (Clements et al., 2009).

A graphical representation of the data is useful for determining characteristics such as the number of connected components. A connected component is defined as a connected subgraph ofGwhich is not part of a larger connected component (Tutte, 1947). The storage representation of the graph can be represented either as an adjacency matrix or an adjacency list, the latter being more storage efficient, requiring the same storage as the sparse matrix format, growing linearly with the number of observations (relevance judgements) in the data.

2.5.2 Data preprocessing

Given the representations of relevance data introduced in the previous section, I will now briefly describe preprocessing steps commonly used to normalise and clean user interaction data extracted from search logs.

1. Removal of non-human interaction

One of the first preprocessing steps undertaken is to filter any non-human interaction from the data.

Robots and automated scripts can perform a large number of queries in a short amount of time, and when studying transaction logs for user behaviour, it is desirable to remove all possible interaction that may be non-human. Jansen (2006) state that sessions involving more than 100 queries may be removed from the data because it is unlikely that they were performed by a human searcher.

2. Removal of junk or nonsense queries

(35)

Meaningless queries such as those consisting of only numbers (with the exception of numerically formatted dates) or punctuation may be stripped from the search logs. If the goal is modelling search semantics, this should have little or no effect.

3. Cleaning of query text

The query text associated with each query may be cleaned to remove numbers (again, with the exception of numerically formatted dates), stopwords, and conjunctions such as “and” and “or” (Tsikrika et al., 2009). This is mainly useful if the queries are to be subsequently processed, such as the merging of queries described in Chapter 5 Section 5.5.2.

4. Removal of documents and queries with less than one relevance judgement

Any documents or queries existing in the data that have less than one relevance judgement can also be safely removed without any effect on the system. Since the goal will be to process user interaction, empty documents and queries are of little use. In the matrix format described in Section 2.5.1, this corresponds to the sum over rows where a column is equal to zero, and the sum over columns where a row is equal to zero.

2.5.3 Sparsity, missing values, and implicit non-relevance

The interaction data provided by users is inherently sparse. Ignoring for a moment the underlying user model (i.e. recall- or precision-oriented), there are two causes of sparsity in interaction data, particularly relevance feedback:

• the documents in question were not ranked high enough in the result set and therefore were not available to be interacted with by the user, known as rank bias (Joachims, 2002a);

• the documents were ranked high enough to be visible in the result set, but were not relevant enough to elicit a rating (either implicit or explicit) from the user.

Consider a retrieval system with a large index of documents. At any time, a user querying this system will see results that comprise only a small fraction of the total number of documents, even after several iterations of relevance feedback. Therefore, the total number of document ratings for a given query will be very small. Craswell and Szummer (2007) discuss the sparsity problem in MSN image query logs, stating that “documents that are relevant but not clicked constitute sparsity in the click data”. Concerning the second point, there are documents that are included in the result set but which are not relevant enough to elicit a rating from the user. These documents can be considered implicitly non-relevant in a binary scale, but, according to a graded scale, may in fact be minimally or somewhat relevant.

Because the theory of relevance in IR assumes that every document has a relevance rating²with respect to a given query (Crestani et al., 1998; Lavrenko, 2004), sparsity can be viewed as missing valuesin the document-query relevance matrix. Filling in missing values is the goal of ratings prediction, such as in collaborative filtering (Marlin, 2004, 2008).

2A document is either relevant (binary or graded) or it is not.

(36)

2.5.4 Noise in relevance data

Noise in relevance assessments may arise due to several causes. First, queries sharing similar search terms may simply be polysemic: users are searching for unrelated topics using the same query terms (e.g., “bank”

as in river, and “bank” as in the monetary institution), and therefore the relevance judgements will not be consistent for both queries.

Second, different users searching for the same topic may disagree on what constitutes a relevant document. In studies on clickthrough data, noise is often seen as uncertainty between the implicit judgements from clicks and explicit groundtruth assessments provided by assessors (Joachims et al., 2005, 2007a; Smith and Ashman, 2009). He et al. (2004) describe noise as those relevance judgements which differ from the groundtruth (i.e. are “incorrect”). Similarly, Craswell and Szummer (2007) define noise in click data as documents that are clicked but that are not relevant to the query. Auer and Leung (2009) model noise from user feedback in an active learning context and define noise as erroneous feedback from example images lying on the decision boundary that are difficult for the user to classify. As will be discussed discussed in Chapter 3, research indicates that spurious clicks on otherwise irrelevant documents is more prevalent in text search compared to image search due to the high information content of thumbnails in results sets (i.e.

overall, clicks in image search more closely match explicit relevance judgements compared to text search) (Smith and Ashman, 2009; Tsikrika et al., 2009).

Radlinski (2007) highlights another growing cause of noise in implicit feedback caused by “clickspam”.

As the use of clickthrough data to enhance web search results grows, so too will the exploitation of such data for commercial means. The authors categorise clickspam as noise because it is an unwanted component of the data and thus the objective is to counteract it by filtering it from the transaction logs prior to using the click data.

Other explanations for noise, for example, include situations when a user may find an irrelevant document summary interesting enough to click on or view, or when a user clicks on the wrong result by mistake.

However, these cases of noise can be expected to be negligible. Furthermore, as Dupret et al. (2006) point out,

“occasional user selection mistakes will be cancelled out while averaging over a large number of selections”, depending on the quantity of queries available in the query log. This assumption, that the signal outweighs the noise, is also taken in this thesis. I specifically examine the quantity of queries required to achieve reasonable results for document rankings under noisy conditions in Chapter 5 Section 5.4.1.

2.5.5 Observed probability distributions

Implicit feedback (e.g. clicks), like many other forms of user behaviour, has been to be globally distributed according to the power law (Lucchese et al., 2007; Baeza-Yates and Tiberi, 2007; Hardtke et al., 2009). The power law is defined as:

y=x^a, (2.5)

and states that the frequency of an itemyis proportional to the power of an attribute x, with the constant abeing the scale. The power law dictates that a small number attributes contribute to the majority of the observations, while the remainder make up the tail. Figure 2.6 shows an example power law curve where the y-axis denotes the frequency of a search and the x-axis denotes the search topic. Power law distributed data

Latent variable modelling of user interaction in image retrieval

Thesis

Reference

Latent variable modelling of user interaction in image retrieval

MORRISON, Donn Alexander

MORRISON, Donn Alexander. Latent variable modelling of user interaction in image retrieval . Thèse de doctorat : Univ. Genève, 2011, no. Sc. 4305

URN : urn:nbn:ch:unige-159470

DOI : 10.13097/archive-ouverte/unige:15947

D ´ EPARTMENT D’INFORMATIQUE DR. ST ´ EPHANE MARCHAND-MAILLET

Latent variable modelling of user interaction in image retrieval

TH ` ESE

PR ´ ESENT ´ EE ` A LA FACULT ´ E DES SCIENCES DE L’UNIVERSIT ´ E DE GEN ` EVE POUR OBTENIR LE GRADE DE DOCTEUR ` ES SCIENCES

MENTION INFORMATIQUE

par

Donn Morrison de Comox (Canada)

TH ` ESE No. 4305

Abstract

R´esum´e

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Problem definition and motivation

1.2 Thesis statement and contributions

1.3 Layout of the thesis

Chapter 2

User interaction in information retrieval

2.1 Introduction

2.2 The nature of user interaction

2.2.1 Implicit user interaction

2.2.2 Explicit user interaction

2.2.3 Correlations between explicit and implicit interaction

2.3 User interaction in information retrieval

2.3.1 Queries, sessions, and the information need

2.3.2 User models underlying search

2.3.3 Search interfaces

2.4 Scales of relevance

2.4.1 Binary versus graded ratings

2.4.2 Discrete versus continuous scales

2.4.3 Negative relevance judgements

2.4.4 Implicit feedback and the problem of position bias

2.5 Transaction logs

2.5.1 Formal representations of search log data

2.5.2 Data preprocessing

2.5.3 Sparsity, missing values, and implicit non-relevance

2.5.4 Noise in relevance data

2.5.5 Observed probability distributions