HAL Id: hal-01567065
https://hal.archives-ouvertes.fr/hal-01567065
Submitted on 21 Jul 2017
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Informativeness for Adhoc IR evaluation: a measure that prevents assessing individual documents
Romain Deveaud, Véronique Moriceau, Josiane Mothe, Eric San Juan
To cite this version:
Romain Deveaud, Véronique Moriceau, Josiane Mothe, Eric San Juan. Informativeness for Adhoc IR
evaluation: a measure that prevents assessing individual documents. 38th European Conference on
Information Retrieval (ECIR 2016), Mar 2016, Padoue, Italy. pp. 818-823. �hal-01567065�
Open Archive TOULOUSE Archive Ouverte (OATAO)
OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible.
This is an author-deposited version published in : http://oatao.univ-toulouse.fr/
Eprints ID : 16883
The contribution was presented at ECIR 2016 : http://ecir2016.dei.unipd.it/
To cite this version : Deveaud, Romain and Moriceau, Véronique and Mothe, Josiane and San Juan, Eric Informativeness for Adhoc IR evaluation: a measure that prevents assessing individual documents.
(2016) In: 38th European Conference on Information Retrieval (ECIR 2016), 20 March 2016 - 23 March 2016 (Padoue, Italy).
Any correspondence concerning this service should be sent to the repository
administrator: [email protected]
Informativeness for Adhoc IR Evaluation:
A measure that prevents assessing individual documents
Romain Deveaud
1, V´eronique Moriceau
2, Josiane Mothe
3, and Eric SanJuan
11
LIA, Univ. Avignon, France,
[email protected] [email protected]
2
LIMSI-CNRS, Univ. Paris-Sud, Univ. Paris-Saclay, France, [email protected]
3
IRIT UMR5505, Univ. Toulouse, France, [email protected]
Abstract. Informativeness measures have been used in interactive in- formation retrieval and automatic summarization evaluation. Indeed, as opposed to adhoc retrieval, these two tasks cannot rely on the Cran- field evaluation paradigm in which retrieved documents are compared to static query relevance document lists. In this paper, we explore the use of informativeness measures to evaluate adhoc task. The advantage of the proposed evaluation framework is that it does not rely on an ex- haustive reference and can be used in a changing environment in which new documents occur, and for which relevance has not been assessed.
We show that the correlation between the official system ranking and the informativeness measure is specifically high for most of the TREC adhoc tracks.
Keywords: Information retrieval, Evaluation, Informativeness, Adhoc retrieval
1 Introduction
Information Retrieval (IR) aims at retrieving the relevant information from a large volume of available documents. Evaluating IR implies to define evaluation frameworks. In adhoc retrieval, Cranfield framework is the prevailing framework [1]; it is composed of documents, queries, relevance assessments and measures.
Moreover, document relevance is considered as independent from the document rank and generally as a Boolean function (a document is relevant or not to a given query) even though levels of relevance can be used [7]. Effectiveness measurement is based on comparing the retrieved documents with the reference list of relevant documents. Moreover, it is based on the assessment assumption, that is the relevance of documents is known in advance for each query. It implies that the collection is static since it is assessed by humans. Cranfield paradigm facilitates reproductibility of experiments: at any time it is possible to evaluate a new IR method and to compare it against previous results; this is one of its main strengths. However, such a framework is not usable in changing environments when new documents are continuously added.
As opposed to Cranfield document relevance independency assumption, in-
formativeness expresses the dependency of document relevance and takes into
account the interactive nature of IR [8]. Indeed, one limitation of Cranfield-based evaluation is that relevance is encoded by documents [4]. Moreover, document relevance assessment is a clear limitation in dynamic context, when new docu- ments are continuously added.
Nugget-based evaluation has been introduced to tackle this problem: rather than considering document relevance, it considers information relevance [4]. This method makes it possible to consider documents that have not been evaluated to be labeled as relevant or not, simply because they contain relevant information or not. Similar assumption is considered in automatic translation and automatic summarization evaluation. However this type of measure has not been intensively used in adhoc retrieval evaluation.
Our goal is to develop a method to evaluate adhoc IR using an informa- tiveness measure to ensure reproductibility in dynamic document collections. To evaluate our method, we compare the system rankings we obtained using the informativeness measure proposed in [6] with the official system rankings based on document relevance, considering various TREC collections on adhoc tasks.
We show that the correlation between the official system rankings and the informativeness measure is specifically high for most of the TREC adhoc tracks.
The rest of the paper is organized as follows. Section 2 presents our evaluation framework which makes use of n-grams for informativeness-based evaluation ap- plied to adhoc retrieval. We also present the adhoc retrieval collections we will be using. Section 3 presents and discusses the comparison of system rankings when using informativeness-based measures with the official ranking for the various adhoc retrieval collections/sub-tasks. Finally, Section 4 concludes this paper.
2 N-gram based measure for Adhoc Retrieval
The evaluation method we developed makes an informativeness measure being usable in the case of adhoc retrieval. We use it on various TREC adhoc tracks.
We use the generic Log Similarity (LogSim) informativeness measure initially introduced to evaluate tweet contextualization in 2011 CLEF-INEX QA task [5]. LogSim is based on pools of relevant passages extracted from the document collection (Wikipedia in the CLEF/INEX lab case) called t-rels. t-rels are chunks of texts that are marked-up as relevant by human assessors. By considering each n-gram word in these t-rels as a relevant item, the LogSim normalized measure is based on n-gram precision and graded using log frequencies.
Given a reference R and a summary S, the Log Similarity on n-grams (LogSim) measure stands as:
LogSim(S|R) = X
w∈FStn(R)∩FStn(S)
log(min(P (w|S), P (w|R)).|R| + 1)
log(max(P (w|S), P(w|R)).|R| + 1) .P (w|R) (1)
where P (w|X) =
fX|X|(w)corresponds to the frequency f
X(w) of n-gram w in
X over the length |X| of text X, ∩F
Stn(S(X) is the set of n-grams of stem words
from X , and X is either R or S .
To build such textual references over document ad-hoc q-rels in order to easily apply informativeness to adhoc IR tasks, one approach consists in extracting from documents information nugget candidates; it has been shown that this is possible over non-spammed document collections like TREC robust track or Gov collections [4]. This paper aims at showing that similar results can be obtained without requiring a prior extraction of relevant nuggets. Indeed we propose a direct conversion of relevant documents into a textual reference and experiment plain informativeness measures over it.
For that, we introduce the concept of content interpolated precision at length λ (cP
λ). Assuming that a user reads the retrieved documents following the rank- ing given by an IR system, cP
λevaluates the informativeness of the reading after λ words.
2.1 Evaluating Precision based on Document Content
Consider an adhoc task and its document q-rels. Assume that runs are ranked according to Mean Average Precision or Interpolated Precision at several recall levels. Runs can be converted into textual outcomes by concatenating ranked documents and q-rels can be converted into a textual reference by merging to- gether all relevant documents per topic. Runs can be then evaluated by applying informativeness metrics to measure the overlap between submission and refer- ence at various recall levels.
Let D = (D
i)
1≤i≤dbe a ranked list of d documents. We consider as text T = (t
1, ..., t
n) the concatenation of these documents (where n = P
i=di=1
|D
i|).
For each integer λ , we denote by T
λthe truncated text T
λ= ( t
1, ..., t
λ) and by D
n,kλthe set of n-grams with gap k . D
n,0λor D
λnbeing the set of n -grams.
Similarly, given a set R = {R
i: 1 ≤ i ≤ r} of r relevant documents we shall consider: R
n,kthe set of n-grams with gap k occurring in at least one reference.
In the case of TREC tracks, D is a run, each D
λnis the set of n-grams occurring in one of the m top ranked documents such that P
i=mi=1
|D
i| ≤ λ meanwhile R
nis the set of n-grams appearing at least once in the relevant documents from the corresponding q-rels.
We apply the English Porter stemming algorithm
4to all documents after removing all stop words and all document identifiers like TREC doc-ID, etc. This is not only to reduce data, but to convert q-rels into reusable textual relevance judgments (t-rels) than can be applied to non official runs including documents not in the initial collection.
Given a run D and a reference R, we define for n ∈ {1, 2}, k ∈ {0, 2} the content interpolated precision cP
λn,kas:
cP
λn,k(D, R) = LS(D
n,kλ|R
n,k) (2) Observe that, if D
j∈ R then :
cP P
1,0i=di=1|Di|
( D, R ) ≥ |D
j| P
i=di=1
|D
i|
4