Structured and Collaborative Search: An integrated approach to share documents

(1)

Faculté des sciences appliquées Service de mécanique analytique et CFAO

Structured and Collaborative Search: An integrated approach to share documents

among users

Promoteur : Dissertation originale

Alain D ELCHAMBRE présentée par Pascal F RANCQ

en vue de l’obtention du grade

Année Académique 2002–2003 de docteur en sciences appliquées

(2)

(3)

Acknowledgments

I want to first thank Professor A LAIN D ELCHAMBRE , the director of the CAD/CAM department for his help. The present work would not exist without his encouragement. Professor M ARCO

S AERENS made multiple suggestions concerning the ideas, and our long discussions have greatly influenced the results of this work.

T HOMAS L’E GLISE read this work carefully and proposed many improvements to the pre- sentation of the concepts. Anything that remains unclear is due to my inability to manipulate English correctly.

This work is part of the HyperPRISME

¹

and GALILEI

²

projects funded by the “Region wal- lonne”. All the people who have worked on this project must be thanked for their help, and par- ticularly Professor H ERVÉ G ILSON , whose tragic demise took place during the GALILEI project.

He left us with an immense gap that we cannot fill.

All the members of the CAD/CAM department contributed indirectly to this work by en- suring an enjoyable working environment and stoically enduring me during the writing of this thesis.

I want to thanks all those people who have agreed to form part of my jury, namely Professors H UGUES B ERSINI , P IERRETTE B OUILLON (Université de Genève), A LAIN D ELCHAMBRE , P IERRE

G ÂSPART , F ^RANÇOIS H ÊINDERYCKX and M ÂRCO S ÂERENS (Université Catholique de Louvain).

All the softwares used during the realization of this thesis are open source or free softwares.

All the contributors to these projects must be congratulated for the quality of their work. The open source and free softwares community proposes a realistic alternative to the ultra-liberal society in which we presently live.

Finally, I am grateful to L EO F ENDER and G EORGES J ENNY , the fathers of electric guitars and electronic keyboards respectively. Without them, many bands such as Deep Purple, Rainbow, Led Zeppelin, Black Sabbath, the Ramones, Pink Floyd, Whitesnake, die Totenhosen, UFO, the Who, Kraftwerk, Van Halen, Rammstein and Diary of Dreams would not be able to express their creativity, which would decrease a large part of my inspiration and sanity.

1

Contract 9613457

2

Contract 01/1/4675

(12)

(13)

Preface

One of the most important problems today in the field of computer science is document man- agement. Several approaches such as search engines, text categorization or collaborative filtering exist to help users finding relevant documents. The aim of this thesis is to develop a set of tools to tackle the problem of collaborative search in document oriented contexts. The users of the system are described in terms of profiles, with each profile corresponding to one area of interest.

These profiles are based on assessments made by the users on the documents consulted and their content. The users’ profiles are then grouped together to form “virtual communities” so that information such as documents of interest can be shared between users of a same community.

The first chapters study all the concepts of “structured and collaborative search” and presents a method of validation. After a brief overview of the systems most frequently used in searching for documents today, and in particular search engines, the philosophy behind the thesis will be explained. The different modules of the system will present the principal concepts such as documents, profiles and virtual communities. An introduction to genetic algorithms will precede the chapter on the virtual communities because the method to group the virtual communities is based on them.

In the last chapters, the author will set out the applications of the thesis. An implementation of the system in a “real” environment will then illustrate how an unsophisticated user may interact with the concepts. Finally, the author will draw some conclusions in terms of achievements, limits of the algorithms and methodologies. He introduces some possible solutions and presents future works.

All the algorithms and methods described in this thesis are implemented in a ready-to-use system. The development platform chosen is a GNU/Linux

^³

operating system running the graphical environment KDE, using the GNU developments tools and KDevelop. The CVS system manages the source code, and the Bugzilla system manages the submission of bug reports.

The developed softwares currently consists of about

lines of C

⁺⁺

source code, mostly developed by the author. The project is largely based on the basic R Library developed by the au- thor and consists of about

lines of C

⁺⁺

source code. By way of a comparison, the complete multi-platform Linux 2.2.18 kernel and drivers consists of about

lines of C source code.

3

In fact, Linux



is not an operating system but a kernel. The other tools, such as compilers or shells needed to form

an operating system, are developed by the Free Software Foundation in the GNU project. So, when talking about the

operating system based on the Linux



kernel, it is more accurate to talk about the GNU/Linux



system.

(14)

(15)

Chapter 1

Introduction

In this introduction chapter the problem treated in this thesis is exposed. It is organized as fol- lows. The scope of the research is presented in section 1.1. The different elements of information retrieval environments, which deal with description, storage and representation of, and access to items of information, is presented in section 1.2. A Comparison between information and tradi- tional data retrieval is described in section 1.3. Section 1.4 presents the aim of the thesis. Section 1.5 introduces the mathematical notations used throughout this thesis. A simple document set that will be used as case study in the thesis is presented in section 1.6. Finally, the structure of the present document as well as a guide to the reader are presented in section 1.7.

1.1 Scope of the Research

During the nineties and the generalization of document-oriented systems, such as the World Wide Web (WWW) [15], the amount of information available through electronic documents has exploded [132, 133]. On the Internet in 2000, 2.1 billion available documents exist on the Internet, and this number was growing at an explosive rate of more than 7 million pages each day. The delay between the developments of « information technologies » with regard to the retrieval and the organization of electronic information, and the techniques used to store and transfer this information has also increased.

Finding pertinent documents or relevant information within a set of documents is generally speaking a time consuming process: the longer the set is, the more complex is the operation. It is however vital for document-oriented systems that information can be retrieved when necessary.

Plain paper libraries have already encountered this problem for ages: books need to be correctly stored, arranged and indexed in order to be found by the reader. Because the amount of informa- tion becomes too large, the users are therefore often not able to find all the relevant information they search for. A recent article [59] still underlined these problems by reporting the death of a volunteer within the framework of a study on asthma because one of the researchers had not found information relating to the potential dangers of an inhalation.

1.2 Information Retrieval Environments

This section presents the main tasks of a user in information retrieval environments as well as the role played by the systems.

1.2.1 User Part

In information retrieval systems the user translates his interest through a set of interactions. In the

case of search engines, the user constructs a query, which is most frequently a logical expression

(16)

of words representing the corresponding semantic, and receives a list of documents similar to the semantic of the query. Another method is where the user rates a given set of documents concerning his interest and receives a list of documents similar to those with the highest rating.

In such cases, we say that the user performs a retrieval task. Nowadays, this is the most frequently employed operation to find pertinent information.

When retrieving information about “genetic algorithms”, the user may find a document deal- ing with the most recent discoveries in “bio-genetic techniques” and start to focus on this theme.

In this case, the user performs a browsing task, which is also a way to retrieve information.

Each time a user performs a series of tasks and waits to receive pertinent information, these tasks are called pull actions. Figure 1.1 shows the different tasks involved in pull actions and used by users to interact with the system.

PSfrag replacements User

Retrieval

Browsing

Database

Figure 1.1: Basic Interaction between users and retrieval systems.

Another type of action is the so called push action, where a users employs software agents and filtering techniques to automatically retrieve information such as extracting all the documents discussing sport from newspapers sites. In this case, the user does not need to perform any sort of action to receive information.

1.2.2 Role played by the System

In offering users pertinent information it is necessary for the systems to “understand” the elec- tronic documents, i.e. they need a logical view of them. Historically, these documents were represented by a set of keywords automatically deducted from their content. Such a representa- tion implies a lost of semantic information because the sequences of the words in the documents are not stored. The full text is, of course, the most complete logical view of a document in this representation, even through most information retrieval systems apply some operations on the documents to reduce the amount of information needed to represent their logical view.

Once the logical views are known, an information retrieval system needs a model to represent the documents and the users’ interests, and to make comparisons to match them. For the search engines, the users’ interests are queries and the model is known as an Information Retrieval Model.

1.3 Information versus Data Retrieval

In data retrieval systems the information referred as data is defined using a precise structure

and semantic. For example, in relational databases the information is stored in tables and is

described by a finite set of fields, while each entry in a table is called a record, and represents

a piece of information. In querying these systems, users make use of data retrieval language

such as the Stuctured Query Language (SQL) to formulate a query to find the information that

exactly matches a set of given criteria. For example, to find all the documents containing the

word “genetic” and “algorithms” a pseudo-SQL query could be:

(17)

SELECT doc WHERE word_in=’genetic’ AND word_in=’algorithms’

Some problems will appear when this query is applied to a database containing a bibliogra- phy. In some books, “genetic algorithms” are referenced as “evolutionary meta-heuristics”, but the records without the words “genetic” and “algorithms” would not be selected by this query.

It would nevertheless be nice if documents containing “evolutionary meta-heuristics” were re- trieved as well

¹

. So this method is not suited to find bibliographical entries dealing with specific subjects or topics because it is hard to express a subject by using such types of queries. The reason is that natural languages are not as well structured as databases, and can be semantically ambigu- ous. Information retrieval systems must therefore “understand” the content of the documents to catch their semantic load, and this requires their analysis.

Despite their disadvantages, data retrieval systems are widely used nowadays, particularly in libraries and bibliographical systems, because their type of information can be easily circum- scribed with a small amount of precise information, such as the authors, titles, keywords and abstracts. Because the information is stored in a static structure

²

, an advantage of data retrieval systems is that they can be optimized by using techniques like index keys. Data retrieval systems are therefore faster than information retrieval systems, which is the main reason why they are still used even if they are not the most efficient solution from the retrieval point of view.

1.4 Aim of the Thesis

This thesis lies within the scope of the research projects HyperPRISME and GALILEI. These projects propose an integrated approach for sharing documents among users called structured and collaborative search [142, 67, 68]. The philosophy of this approach is to describe users of the system in terms of profiles, with each profile corresponding to one area of interest. While browsing through a collection of documents, users’ profiles are computed on the basis of both the content of the retrieved documents and a relevance feedback process. These profiles are then grouped into virtual communities so that documents of interest can be shared between members of a same community. Section 2.5 details these concepts. This approach is the result of an analysis of in- formation retrieval systems [81] and must be seen as a complementary solution to those already existing.

The aim of this thesis is to establish the feasibility of developing an information retrieval system based on this approach. Related to this objective, the following main tasks could be identify:

Computing methods and algorithms must be proposed for the different elements of the system, in particular for the profiles clustering. The integration of these methods constitutes also an important step in order to propose a complete methodology.

A validation method must be developed to test these methods and algorithms.

An architecture suited for a complete system must be proposed.

1.5 Notations and Terminologies

The purpose of this section is to introduce most of the notations and terminologies used in this thesis.

Let us first introduce a basic notation that will be used to represent a set of elements: a set,

, consists of a given number

of entities,

. The indexes change in function of the nature of the entities in the set.

1

If it is known that the genetic algorithms are evolutionary meta-heuristics, the query could be adapted. But such adaptations are limited by the knowledge of the user on the field of research and by the way the information is indexed.

2

It is, of course, possible to change the structure of a database, but such changes are very rare in a database life cycle

once it is into production.

(18)

As explained earlier, most of the time the logical views of documents are described by sets of index terms representing their semantic load. Note that all the words in a document are not necessarily used as index terms, but the description of the different techniques to compute the semantic load of a document are postponed until chapter 5 (they are not necessary for the com- prehension of the models presented in chapter 2).

Each term in a given set of index terms induced does not have the same importance for a given document. Determining the importance of each term in a document is not an easy task.

Consider a large collection of documents. A word,

, contained in 80% of them is not very discriminatory, because a query based on this word will retrieve too many documents. For the same reason, a word,

, given in only 1% of the documents is highly discriminatory. If the words

and

are contained in the same document, each of them has a different importance from the discriminatory point of view. To catch this importance, a numerical weight is associated to each index term in a document. Each model uses a specific method to compute these weights.

D EFINITION 1.1 (S ET OF I NDEX T ERMS )

If

is the number of index terms in the system and

^!"

a generic index term,

^$#&%! ^(''(') ^!+*^,-*/.

represents the set of index terms.

D EFINITION 1.2 (D OCUMENTS )

There is a collection,

⁰

, of

⁰¹

documents

²³

. Each document consists of a set of index terms. A weight,

⁴ ⁶⁵³⁸⁷

, is associated with each index term,

^!

, for each document,

²³

. When an index term does not appear in a document,

⁴⁹⁶⁵³ ^#

.

With each document,

² ³

, a vector of index terms

²^:³ ^#<;=4 ⁵³ ^'(''(^4?>@5^3BA

is associated. The func- tion

^C

is defined as

^{C ;}²^:^3BA ^#D4E65³

.

D EFINITION 1.3 (D OCUMENTS AND I NDEX T ERMS )

Let

⁰

be the set of

⁰

documents containing the index term

^!

, i.e.

⁴ ⁼⁵^3F7

. Let

⁰ ⁼⁵^G

be the set of

⁰ ⁶⁵^G

documents containing the index terms

^!

and

^! ^G

, i.e. for which

⁴ ⁶⁵^3H7

and

⁴ ^G=5^3H7

.

Most of the time the index terms are considered as mutually independent, i.e. knowing the term

⁴ ⁶⁵³

for the pair

^;I! ²³ ^A

says nothing about

⁴ ^KJ ⁵³

of the pair

^;I! ^KJ ²³ ^A

. In certain situa- tions, it constitutes a simplification. For example, let us suppose that the terms “algorithm” and

“genetic” are used as indexes for documents dealing with genetic algorithms. In this field, a document containing the term “algorithm” will certainly include the term “genetic”. But, taking advantages of correlations is not a simple task, and it is not actually clear if taking such depen- dences into account is an advantage in information retrieval. So, in the models presented, if not specified, the assumption is made that the index terms are independent.

The task of finding interesting documents is, mostly the result of finding documents relevant to a given query.

D EFINITION 1.4 (D OCUMENT S ETS )

For a given query,

^L

, representing a user interest,

^M

, an information retrieval system retrieves a set,

^N

, of

^NO

documents from the whole collection,

⁰

. It is possible to define:

0QP

The subset of the collection

⁰

containing the

^0RP"

documents relevant to

^M

.

0QS

The subset of the collection

⁰

containing the

^0RST

documents irrelevant to

^M

.

N P

The subset of the retrieved set

^N

containing the

^N ^P

documents relevant to

^M

.

N P 5

The subset of the retrieved relevant documents

^N ^P

of the

^N ^P ⁵

relevant documents contain- ing the index term

^!

.

N S

The subset of the retrieved set

^N

containing the

^N ^S

documents irrelevant to

^M

.

(19)

Subsets

⁰ ^P

and

⁰ ^S

are mostly unknown. To construct subsets

^N ^P

and

^N ^S

from

^N

, the infor- mation retrieval system needs external help as, for example, the user emitting a assessment on the relevance of documents in

^N

. In an ideal information retrieval system, subsets

^NUP

and

^N9S

are always identical to subsets

^0QP

and

^0QS

.

1.6 Documents Set “DB”

Since all the concepts presented in this thesis are based on sets of documents, an imaginary doc- uments set “DB” is introduced for didactic purposes. It consists of

documents issued from

different topics (or subset) in which

different index terms appear:

A topic of

^V

documents (

^2W

–

^2X

) is related to the rock band “Deep Purple”.

A topic of

^V

documents (

^2"Y

–

^2"Z

) is related to the rock band the “Beatles”.

Table 1.1 presents the documents, the index terms and the number of occurrences

³

of the index terms in the documents.

Deep Purple Beatles

[)\ [^] [_ [)` [)a [b [)c [(d [)e [)f

g^]

: Beatles

^h ^h ^h ^h ^h ⁱ ⁱ ^j ⁱ ⁱ

g _

: Blackmore

ⁱ ⁱ ^j ^k ^k ^h ^h ^h ^h ^h

g(`

: child

^j ⁱ ^j ⁱ ⁱ ^h ^h ^h ^h ^h

g(a

: Deep

^h ⁱ ⁱ ⁱ ⁱ ^h ^h ^h ^h ^h

gb

: Gillian

ⁱ ^j ⁱ ⁱ ^j ^h ^h ^h ^h ^h

g(c

: Glover

^j ^k ⁱ ⁱ ⁱ ^h ^h ^h ^h ^h

g d

: guitar

ⁱ ^k ^k ⁱ ⁱ ⁱ ⁱ ⁱ ⁱ ⁱ

g(e

: help

^h ^h ^h ^h ^h ^h ⁱ ^h ⁱ ⁱ

g(f

: hard

^k ⁱ ^k ⁱ ⁱ ^h ^h ^h ^h ^h

g^]6\

: Harrisson

^h ^h ^h ^h ^h ⁱ ⁱ ^j ⁱ ⁱ

g ]l]

: Lennon

^h ^h ^h ^h ^h ^k ⁱ ⁱ ^k ^k

g^]I_

: Lord

^k ⁱ ⁱ ^j ⁱ ^h ^h ^h ^h ^h

g^]6`

: McCartney

^h ^h ^h ^h ^h ⁱ ^k ⁱ ^j ^k

g^]6a

: Michelle

^h ^h ^h ^h ^h ⁱ ^j ^h ⁱ ^h

g^]Ib

: music

^k ⁱ ^k ^k ^k ⁱ ^j ⁱ ⁱ ^j

g ]6c

: Paice

ⁱ ^k ^h ^k ^h ^h ^h ^h ^h ^h

g^]Id

: please

^h ^h ^h ^h ^h ⁱ ^h ^k ^h ^k

g^]6e

: Purple

^j ⁱ ^j ⁱ ^j ^h ^h ^h ^h ^h

g^]6f

: Ringo

^h ^h ^h ^h ^h ⁱ ^j ⁱ ^k ^j

g _@\

: rock

ⁱ ^k ⁱ ^j ^k ^k ⁱ ^k ⁱ ⁱ

g_]

: roll

^h ^h ^h ^h ^h ⁱ ^k ⁱ ^k ⁱ

g_l_

: smoke

^k ^h ^k ^h ⁱ ^h ^h ^h ^h ^h

g_@`

: time

^k ⁱ ⁱ ^k ^k ^h ^h ^h ^h ^h

g_@a

: water

^k ^k ^k ^k ^k ^h ^h ^h ^h ^h

Table 1.1: “DB” documents set.

The “DB” set has several interesting properties:

some index terms only appear in one of the topics and not in the other one (

^!

“Beatles” or

!m

“Glover”).

some index terms appear in both topics (

^!"n

“Guitar” or

^!"oW

“rock”).

some index terms do not appear in each document constituting a topic (

^! ^m

“Paice” or

^! ^X

“Michelle”).

It is also supposed that both users Pal and Popeye are interested in “Deep Purple”, and that users Alain and Marco are interested in the “Beatles”. Two queries represented by a set of index terms are used to find the documents relevant to each topic:

3

These occurrences will be used as a basis for the index term weights in the different models presented in chapter 2.

(20)

Query

^L

is used to find documents about “Deep Purple” and contains the four words

“Deep Purple hard rock”.

Query

^L

is used to find documents about the “Beatles” and contains the four words “Beat- les Lennon McCartney rock”.

It is interesting to note that the same index term

^!"W

”rock” is used in both queries.

Remark: Of course, the “DB” document set cannot be used to validate any kind of method. The purpose is only to show how each method works. It will be used for this purpose in chapter 2.

1.7 Structure of the Thesis

A flowchart of the dissertation structure is shown in Figure 1.2 and discussed below. Dotted lines represent non essential paths to the appendices.

2 Information Retrieval Systems Chapter 2 describes the principal types of existing informa- tion retrieval systems. The information retrieval system proposed in this thesis and called GALILEI is also described.

A crucial issue is the evaluation of the performances of GALILEI. Two chapters propose solutions for this problem:

3 Validation One of the key topics when discussing problems and computed solution is to evalu- ate their quality. It becomes therefore necessary to introduce quality measures and a specific validation methodology that simulates real users. Chapter 3 proposes these tools.

4 Evaluation of the Subprofile Clusterings In a real-world situation, the ideal solution is, of course, unknown and it is necessary to evaluate the quality of a solution. In chapter 4, some criteria are proposed that can be used to measure this quality.

The GALILEI system involves computing methods and algorithms. Five chapters present them and study the performances of GALILEI:

5 Documents Chapter 5 details how the documents are handled, in particular the important concept of information entity is introduced.

6 Profiles Chapter 6 details how the users are modeled as profiles and subprofiles, and the dif- ferent methods used to describe them.

7 Virtual Communities Chapter 7 details an important problem of this thesis, the subprofile clustering into virtual communities.

8 Results Chapter 8 presents the main results, and performs a study of the influence of the dif- ferent parameters of the information retrieval system proposed.

Since this thesis lies within the scope of a research project whose goal is the development of a complete information retrieval system, the algorithms and methods needed by GALILEI were developed as a set of C

⁺⁺

classes. Two chapters are dedicated to the implementation of GALILEI as a complete integrated system:

9 Architecture To implement a system, it is necessary to develop an underlying architecture that makes the different components of the system communicated. Chapter 9 discusses such an architecture for GALILEI.

10 The Client Prototype Chapter 10 describes how a client prototype can be developed and in-

tegrated in a user environment. A client is a user application providing an interface to a

given set of services proposed by the information retrieval system of this thesis.

(21)

Grouping problems

and DocXML Schema

PSfrag replacements

1. Introduction

2. Information Retrieval Systems

3. Validation 4. Evaluation of the Subprofile Clusterings

5. Documents

6. Profiles A. Algorithmic

Conventions

B. XML Language C. DocXML DTD

and DocXML Schemas

E. Multi-Criteria

Grouping Problems

D. The PROMETHEE Method

7. Virtual Communities

8. Results

9. Architecture 10. The Client Prototype

11. Conclusions

Figure 1.2: Flowchart of the dissertation structure.

Finally:

11 Conclusions In chapter 11, the main results of the previous chapters are summarized and discussed.

The appendices are related to: a short introduction to the algorithmic conventions used and

some specific algorithms adapted for this thesis (appendix A); an overview of the XML language

(appendix B); a formal definition of the internal document representation used in this thesis (ap-

pendix C); an overview of the multi-criteria decision-aid method PROMETHEE (appendix D);

(22)

a general introduction to grouping and multi-criteria problems (appendix E) containing useful information to fully understand chapter 7.

Reading Suggestion

The chapters and sections listed below are essential for a good understanding of this thesis:

Chapter/Section Title Page

1 Introduction 1

2.4 Why Another Solution? 35

2.5 Structured and Collaborative Search 37 3.1.1 Recall and Precision in Information Retrieval 48

3.2 Evaluation Methodology 51

3.3 Validation Measures 58

4.4 What is a Good Solution? 74

5 Documents 93

6 Profiles 125

7.2 Genetic Virtual Communities Algorithm (GVCA) 155

8.10 Conclusions of the Results 230

9.4 Architecture 242

10 The Client Prototype 253

11 Conclusions 267

Besides, the reader is suggested to read the introduction and the conclusion of each chapter.

(23)

Chapter 2

Information Retrieval Systems

Since it is vital for document-oriented systems that information can be retrieved when necessary, information retrieval systems has developed into an important field of research in Computer Science. This chapter is devoted to the various solutions existing today. The principal types of information retrieval systems in existence will be described:

Document Rankings (section 2.1).

Text Categorization (section 2.2).

Collaborative Filtering (section 2.3).

An overall comparison (section 2.4) of these methods throws some good results and pertinent features, but also some important drawbacks. The work presented in this thesis aims at preparing a new approach for this difficult problem, entitled structured and collaborative search (section 2.5).

Section 2.6 briefly introduces the semantic web. Finally, a conclusion is presented in section 2.7.

2.1 Document Rankings

This approach is the same as for search engines: the users enter a query and the systems propose a list of potentially interesting documents ranked so that the most interesting documents are at the top of the list.

As explained earlier, most information retrieval systems represent documents by means of index terms. This means that the semantic load of the documents and the users interests can be expressed through a set of index terms, which is clearly a simplification of the matter. Thus, when documents are retrieved, it is normal that a lot of them are irrelevant. For example asking for documents containing “genetic algorithms” will retrieve a document containing the sentence

“this document has nothing to do with genetic algorithms” which, of course, has nothing to do with the subject. It is therefore important for information retrieval systems to evaluate which documents may be relevant and which not.

To decide upon the relevance of a document, the systems usually need ranking algorithms to order their documents by relevance, i.e. the documents at the top of the list are the most relevant.

These ranking algorithms are the core of the systems. There are three different basic information

retrieval models [92] labeled Boolean, vector, and probabilistic. In the Boolean model, where

documents and queries are represented through sets of index terms, the model is said to be set

theoretical. In the vector model, documents and queries are represented as vectors in the word

space, and the model is said to be algebraic. In the probability model, the representation of docu-

ments and queries is based on probabilistic theory, and the model is said as probabilistic. Through

the years, the basic models have been extended.

(24)

2.1.1 Set Theoretical Models

2.1.1.1 The Boolean Model

This section presents the Boolean model, which is a very simple model based on set theory and Boolean algebra. It is easy for run-of-the-mill users of Information Retrieval (IR) systems to un- derstand because the concept of a set is very intuitive, and queries are expressed through Boolean expressions, which have precise semantic loads. Moreover, Boolean operations and their imple- mentations are well known [228] in the field of information retrieval. This model was widely used in the earlier years of information retrieval because of its simplicity.

The Boolean model considers that a given index term is either present or absent in a docu- ment, so its weight can only be

or , i.e.

⁴⁹⁼⁵^3qp ^% ⁺⁽ ^.

. A query,

^L

, consists of index terms linked by the operators or, and, and not. For example, the query

^rL #s!tUuv;I!"wyx{z|!"} A@~

(see Figure 2.1) retrieves all the documents containing the term

^!t

and those containing the term

^!w

, but not those containing the term

^!"}

. Is is always possible to express a query as a disjunction of conjunctive vectors, i.e. to construct its disjunctive normal form (DNF) [116]. For example, query

^L

can be writ- ten as

^r:

L(()H#;

(

A x;

(

A x;

A@~

, where each component is a binary vector associated with the tuple

^;I!"t ^!"w ^!} ^A

.

PSfrag replacements

g)

g(

g)

kokok

kokh)

khBIh)

Figure 2.1: Query

^rLH#!"t?u;I!"wTx1z|!"} A@~

.

D EFINITION 2.1 (I NDEX T ERM W EIGHTS IN THE B OOLEAN M ODEL )

In the Boolean model, the weight associated with a pair

^%! ² ³ ^.

is represented by a binary vari- able, i.e.

⁴⁹⁼⁵^3qp ^% ^.

.

D EFINITION 2.2 (Q UERIES IN THE B OOLEAN M ODEL ) Let

:

L(()

be the disjunctive normal form for a given query

^L

. Let

:

Ll

be any of the conjunctive components of

:

L(()

. Let

be the number of conjunctive components of

:

L((

and

^L(((l5

a reference

to the i-th conjunctive component.

:

L((

can be written

:

L)((|#L)((l5

xL)((l5yx '('('

x2((l5

(2.1)

The model uses the similarity between a given query and the documents of the systems to

retrieve those that are relevant.

(25)

D EFINITION 2.3 (S IMILARITY IN THE B OOLEAN M ODEL ) The similarity of a document,

² ³

, to query

^L

is given by

;62

3 L A #

F

:

L)lU;

:

L(l

p :

L)((

A

u; !

Co;

:

2

3BA

#DC;

:

L)l

AoA

¡¢£+¤)¥¦

¤

(2.2)

If

^;62 ³ ^L ^A ^#

, the model supposes that the document

² ³

is relevant to a query

^L

, and retrieves

it, otherwise,

² ³

is considered as not being relevant.

In other words, the system evaluates the proposition on each document and retrieves those for which the evaluation is true.

2.1.1.2 “DB” Document Sets and the Boolean Model

The similarity given for the Boolean Model in D EFINITION 2.3 shows that a document is consid- ered to be relevant if at least one of the conjunctive components of the query is present in it. So the first question to answer is how to interpret the queries, i.e. which logical operators must be used to form the Boolean expression with these index terms. There are two trivial choices:

1. Only and operators are used.

2. Only or operators are used.

It can be seen from Table 1.1 on page 5 that the index term

^!oW

“rock” appears in every document of the collection. Therefore, if only or operators are used, all the documents will be considered as being relevant to

^L

and

^L

because the index term

^!"oW

“rock” is contained in both queries, and the user interested in “Deep Purple” will retrieve documents about the “Beatles” and vice versa. When only and operators are used, query

^L

will not retrieve document

^2W

because the index term

^!X

“Deep” does not appear in it. Query

^LB

will retrieve the corresponding relevant sub-set because each document of the topic contains the index terms used in the corresponding query. Table 2.1 shows the document retrieved (

^§

) by the different combinations of queries and logical operators.

Deep Purple Beatles

[ \ [ ] [ _ [ ` [ a [ b [ c [ d [ e [ f

¨ ]

and and operators

^© ^© ^© ^©

¨ ]

and or operators

¨ _

and and operators

^© ^© ^© ^© ^©

¨ _

and or operators

Table 2.1: “DB” documents set and Boolean Model.

This simple example shows the principal problem with Boolean queries, namely that it is difficult to find the right expression to retrieve the relevant sub-set of the collection.

Structured and Collaborative Search: An integrated approach to share documents

Faculté des sciences appliquées Service de mécanique analytique et CFAO