Report from TREC-9

(1)

1 Report from TREC-9

Donna Harman, Ellen Voorhees Retrieval Group Information Access Division National Institute of Standards and

Technology 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001

TREC Tasks

Answers, not documents Web searching

Beyond text

Beyond just English

Human-in-the-loop Streamed text Static text

Interactive X?Euro., Chin., Arab.

Chinese Spanish Video Speech OCR Web

Very large collection Q&A

Ad Hoc Filtering Routing

TREC-9 Tracks

• Cross-language (English to Chinese)

• Filtering

• Interactive

• Query

• Question Answering

• Spoken Document Retrieval

• Web

• Task: ad hoc search for documents written in one language using topics in another language

– 25 topics in English created by bilingual assessors; Chinese version also available – 126,937 documents; 188 MB in BIG5

– Hong Kong newspapers donated by Wiser Ltd.

• Hong Kong Commercial Data (Aug 98-Jul 99)

• Hong Kong Daily News (Feb 99-July 99)

• Takongnao (Oct 98-Mar 99)

Cross Language Track

Relevance Judgments

• Judged highest priority mono- and crosslingual run from each group

– 39 cross (75%) / 13 mono (25%)

– 51 auto / 1 manual (Thank you, Berkeley!)

• Added top 50 documents from each judged run to the pool

• Mean actual pool size = 598 (39% of max) within expected range

% Contributions to Pool by Run Type ( Relevant documents)

Monolingual Crosslingual

13 59

28

(2)

2 Participants

BBN Technologies Fudan University

IBM T.J. Watson Research Center Johns Hopkins University

Korea Advanced Institute of Science and Technology

Microsoft Research, China MNIS-TextWise Labs National Taiwan University

More participants

Queens College, CUNY RMIT University

Telcordia Technologies, Inc.

The Chinese University of Hong Kong Trans-EZ Inc.

University of California at Berkeley University of Maryland

University of Massachusetts

Resources: dictionaries/word lists – LDC English - Mandarin word list

(~120,000 pairs)

– Chinese-English Translation Assistance (CETA) dictionary – KingSoft online bilingual dictionary – WordNet

– other local (proprietary) dictionaries

Resources: software & services

• MT

– HuaJian MT system

– IBM AlphaWorks translation server – Alis Gist-in-Time MT system

• English analysis

– InXight LinguistX (English linguistic analysis) – Apple Pie parser

– Brill’s POS tagger

• Chinese analysis/conversion

– Various Chinese segmenters (e.g., NMSU’s ch_seg) – BIG5->GB converters (e.g., NJStar’s)

• Miscellaneous

– CMU’s WEAVER translation-pair extraction – Yahoo search

English to Chinese Results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Precision

BBN9XLA msrcn1 fdut9xl2 CHUHK00XEC1 pir0XHxD INQ7XL3 ibmcl9a KAIST9xlqm

Cross-language vs. Monolingual

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall

Precision

BBN9MONO BBN9XLA msrcn3 msrcn1 pir0Xori pir0XHxD ibmcl9m ibmcl9a

(3)

3 Average Precision by Topic:

Crosslingual

0 0.2 0.4 0.6 0.8 1

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79

What was learned from the Chinese CLIR track?

• Many approaches to English to Chinese topic translation, including use of various dictionaries, word lists, parallel text, and commercial MT systems

• Extensive set of Chinese retrieval experiments performed ranging from various n-gram methods to word based to complete language modeling

• Because of the tight focus of this track, cross-system comparison is possible

TREC 2001

• Cross language

– Chinese ? NTCIR workshop (NII, Japan) – TREC task will be English, French?Arabic

• Filtering track using new Reuters corpus

• Interactive to investigate live web

• Expanded web and QA tracks

• New video track