• Aucun résultat trouvé

Structured and Collaborative Search: An integrated approach to share documents

N/A
N/A
Protected

Academic year: 2021

Partager "Structured and Collaborative Search: An integrated approach to share documents"

Copied!
355
0
0

Texte intégral

(1)

Faculté des sciences appliquées Service de mécanique analytique et CFAO

Structured and Collaborative Search: An integrated approach to share documents

among users

Promoteur : Dissertation originale

Alain D ELCHAMBRE présentée par Pascal F RANCQ

en vue de l’obtention du grade

Année Académique 2002–2003 de docteur en sciences appliquées

(2)
(3)

Contents

Acknowledgments ix

Preface xi

1 Introduction 1

1.1 Scope of the Research . . . . 1

1.2 Information Retrieval Environments . . . . 1

1.2.1 User Part . . . . 1

1.2.2 Role played by the System . . . . 2

1.3 Information versus Data Retrieval . . . . 2

1.4 Aim of the Thesis . . . . 3

1.5 Notations and Terminologies . . . . 3

1.6 Documents Set “DB” . . . . 5

1.7 Structure of the Thesis . . . . 6

2 Information Retrieval Systems 9 2.1 Document Rankings . . . . 9

2.1.1 Set Theoretical Models . . . 10

2.1.2 Algebraic Models . . . 12

2.1.3 Probabilistic Models . . . 22

2.1.4 Brief Comparison of the Basic Models . . . 28

2.2 Text Categorization . . . 28

2.2.1 Probabilistic Classifiers . . . 29

2.2.2 Artificial Neural Networks . . . 31

2.2.3 Example-Based Classifiers . . . 32

2.2.4 Brief Comparison of the Text Categorization Methods . . . 34

2.3 Collaborative Filtering . . . 34

2.4 Why Another Solution? . . . 35

2.5 Structured and Collaborative Search . . . 37

2.5.1 Basic Concept . . . 37

2.5.2 In Practice . . . 38

2.5.3 Hypotheses . . . 39

2.5.4 Profiles and Assessments . . . 40

2.5.5 Virtual Communities . . . 41

2.5.6 Language and Subprofiles . . . 42

2.6 Semantic Web . . . 43

2.7 Conclusions . . . 44

3 Validation 47 3.1 Evaluation of Document Rankings Performance . . . 47

3.1.1 Recall and Precision in Information Retrieval . . . 48

3.1.2 The Harmonic Mean . . . 50

(4)

3.1.3 The E Measure . . . 51

3.2 Validation Methodology . . . 51

3.2.1 Aim of the Validation . . . 52

3.2.2 Topics . . . 52

3.2.3 Paradigm of the Methodology . . . 53

3.2.4 Initial Process . . . 55

3.2.5 Initial Assessments . . . 55

3.2.6 Feedback Process . . . 57

3.2.7 Remarks on the Methodology . . . 57

3.3 Validation Measures . . . 58

3.3.1 Adjusted R AND index . . . 59

3.3.2 Precision . . . 61

3.3.3 Recall . . . 61

3.3.4 Comparison of Measures . . . 62

3.4 Erroneous Assessments . . . 63

3.5 Automated Creation of Subprofile Clusterings . . . 64

3.5.1 Paradigm of the Creation . . . 65

3.5.2 Randomly Merging Virtual Communities . . . 67

3.5.3 Randomly Splitting a Virtual Community . . . 68

3.6 Conclusions . . . 69

4 Evaluation of the Subprofile Clusterings 71 4.1 Exact and Approximate Multi-Criteria Problems . . . 71

4.2 Subprofiles Documents Sets . . . 72

4.3 Agreement and Disagreement Ratios . . . 73

4.4 What is a Good Solution? . . . 74

4.5 Similarity Criterion . . . 75

4.5.1 Prototypes . . . 76

4.5.2 The Measure of the Compactness of the Virtual Communities . . . 77

4.5.3 The Measures of Quality of the Virtual Communities . . . 77

4.5.4 Measures of Intra- and Inter-clusters Similarities . . . 81

4.5.5 Comparisons between the Measures . . . 85

4.6 Information Criteria . . . 86

4.6.1 Measure of Entropy . . . 86

4.6.2 Measure of Likelihood . . . 88

4.6.3 The Number of Virtual Communities . . . 89

4.6.4 Comparisons between the Information Criteria . . . 89

4.7 Agreement and Disagreement Criteria . . . 89

4.7.1 Agreement Criterion . . . 89

4.7.2 Disagreement Criterion . . . 89

4.8 Social Criterion . . . 90

4.9 Conclusions . . . 90

5 Documents 93 5.1 Document Preprocessing . . . 94

5.1.1 Lexical Analysis . . . 95

5.1.2 Elimination of Stopwords . . . 95

5.1.3 Stemming . . . 96

5.1.4 The Reduction of Dimensionality . . . 97

5.1.5 Thesauri . . . 98

5.1.6 Computational Linguistics . . . 99

5.2 Multimedia Indexing and Searching . . . 100

5.2.1 Spatial Access Methods . . . 101

5.2.2 GEneric Multimedia object INdexIng (GEMINI) . . . 101

(5)

5.3 Internal Representation of Document . . . 102

5.3.1 Metadata, Content and Links . . . 103

5.3.2 Dublin Core Metadata Initiative . . . 103

5.3.3 DocXML Structure . . . 104

5.3.4 DocXML and Non-Text Information . . . 108

5.4 Information Entity . . . 108

5.5 Description of Documents . . . 109

5.6 Document Analysis . . . 110

5.6.1 Stoplists . . . 110

5.6.2 Extraction . . . 111

5.6.3 Document Language . . . 111

5.6.4 Index Term Selection . . . 113

5.7 Inverse Document Frequency Factor . . . 114

5.8 Similarities between Documents . . . 114

5.8.1 Local Similarity . . . 115

5.8.2 Global Similarity . . . 115

5.8.3 Comparisons of Similarities . . . 115

5.9 Statistics on Document Collections . . . 116

5.9.1 Statistics on the Full Collections . . . 116

5.9.2 Limitation of the Similarity Measure . . . 118

5.9.3 Statistics on Reduced Collections . . . 118

5.10 Information Entity Filtering . . . 120

5.10.1 Minimum Number of Appearance Filtering . . . 120

5.10.2 Minimum Number of Occurrence Filtering . . . 121

5.10.3 Overall Filtering . . . 122

5.11 Conclusions . . . 122

6 Profiles 125 6.1 User Relevance Feedback . . . 126

6.1.1 Query Expansion and the Vector Model . . . 127

6.1.2 Term Re-weighting for the Probabilistic Model . . . 129

6.1.3 A variant of Probabilistic Term Re-weighting . . . 131

6.1.4 Evaluation of Relevance Feedback Strategies . . . 133

6.2 Description of the Profiles and Subprofiles . . . 133

6.3 Statistical Subprofile Computing Method . . . 134

6.4 Feedback Subprofile Computing Methods . . . 135

6.4.1 Optimist Feedback Subprofile Computing Method . . . 137

6.4.2 Pessimist Feedback Subprofile Computing Method . . . 137

6.4.3 Feedback Methods and Subprofiles Descriptions . . . 138

6.5 Inverse Subprofile Frequency Factor . . . 139

6.6 Similarities between Subprofiles . . . 139

6.6.1 Local Similarity . . . 139

6.6.2 Global Similarity . . . 140

6.7 “DB” Subprofiles . . . 140

6.7.1 Statistical Subprofile Computing Method and “DB” Subprofiles . . . 140

6.7.2 Optimist Feedback Subprofile Computing Method and “DB” Subprofiles . 141 6.7.3 Pessimist Feedback Subprofile Computing Method and “DB” Subprofiles . 143 6.8 Comparisons of the Subprofile Computing Methods . . . 144

6.9 Similarities between Documents and Subprofiles . . . 148

6.9.1 Local Similarity . . . 148

6.9.2 Global Similarity . . . 148

6.9.3 Comparisons of Similarity . . . 149

6.9.4 Similarities and the English-language Set . . . 150

6.10 Conclusions . . . 152

(6)

7 Virtual Communities 153

7.1 What is a Good Clustering Method? . . . 154

7.2 Genetic Virtual Communities Algorithm (GVCA) . . . 155

7.2.1 The Structure of Genetic Algorithms . . . 155

7.2.2 The Structure of the Genetic Virtual Communities Algorithm . . . 156

7.2.3 Algorithm Input . . . 157

7.2.4 Information Retrieval Heuristic (IRH) . . . 157

7.2.5 Initialization Operator . . . 163

7.2.6 Local Optimization Operator . . . 165

7.2.7 Exit Condition . . . 171

7.2.8 Sub-Evaluation Problem . . . 172

7.3 Description of Virtual Communities . . . 172

7.3.1 Prototype Method . . . 173

7.3.2 Centroid Method . . . 173

7.4 Inverse Community Frequency Factor . . . 174

7.5 Similarities between Virtual Communities . . . 174

7.5.1 Local Similarity . . . 175

7.5.2 Global Similarity . . . 175

7.6 Virtual Community Based Cooperation . . . 175

7.6.1 Composition of the Virtual Communities . . . 176

7.6.2 Suggested Documents for Subprofiles . . . 176

7.6.3 Other Documents Relevant to Virtual Communities . . . 177

7.7 Relevant Virtual Communities to Queries . . . 181

7.7.1 Local Similarity . . . 182

7.7.2 Global Similarity . . . 182

7.7.3 Search for Virtual Communities . . . 183

7.7.4 Automatic Query Construction . . . 183

7.7.5 Search using the English-Language Set . . . 185

7.8 Conclusions . . . 187

8 Results 189 8.1 Test Procedure . . . 190

8.1.1 Paradigm . . . 190

8.1.2 Test Procedure Parameters . . . 190

8.1.3 Grouping Parameters . . . 191

8.1.4 Random Clustering Method . . . 192

8.2 The Results of Clustering . . . 192

8.3 Influence of the Test Procedure Parameters . . . 196

8.3.1 Number of Topics . . . 196

8.3.2 Erroneous Assessments . . . 196

8.3.3 Initial Process . . . 197

8.3.4 Feedback Process . . . 201

8.3.5 Conclusions . . . 203

8.4 Influence of the Subprofile Descriptions . . . 203

8.5 The Influence of the Grouping Parameters . . . 205

8.5.1 Importance of the Local Optimization Operator . . . 206

8.5.2 Number of Generations . . . 206

8.5.3 Maximum Number of k-MeansCosinus Steps . . . 207

8.5.4 Minimum Agreement and Disagreement Thresholds . . . 207

8.5.5 Minimum Similarity Threshold . . . 208

8.5.6 Conclusions . . . 208

8.6 Other Criteria . . . 209

8.6.1 The Measures of Quality of the Virtual Communities . . . 209

8.6.2 Measures of Intra- and Inter-clusters Similarities . . . 209

(7)

8.6.3 Information Criteria . . . 209

8.6.4 Regression . . . 211

8.6.5 Conclusions . . . 212

8.7 New Profiles . . . 212

8.7.1 The Adapted Test Procedure . . . 213

8.7.2 New Profiles Linked to Existing Topics . . . 214

8.7.3 New Profiles Linked to New Topics . . . 214

8.7.4 New Profiles Linked to Existing and New Topics . . . 215

8.7.5 Simulation of a Real System . . . 215

8.7.6 Conclusions . . . 223

8.8 Other Document Collections . . . 223

8.8.1 The “Le Soir” Document Collection . . . 224

8.8.2 The “Ziff” Document Collection . . . 224

8.8.3 The “AFP” Document Collection . . . 225

8.8.4 Conclusions . . . 226

8.9 Other Clustering Methods . . . 226

8.9.1 The k-MeansProtos Algorithm . . . 227

8.9.2 The CURE Algorithm . . . 228

8.9.3 The SUP k-Means Algorithm . . . 228

8.9.4 Comparaison of Clustering Methods . . . 229

8.10 Conclusions . . . 230

9 Architecture 231 9.1 Importance of an Implementation Part . . . 231

9.2 Architectural Paradigms . . . 231

9.2.1 Client-Server Architecture . . . 232

9.2.2 Peer-to-Peer Architecture . . . 234

9.3 Parallelization . . . 236

9.3.1 Document Module . . . 237

9.3.2 Profile Module . . . 239

9.3.3 Virtual Communities Module . . . 240

9.3.4 Various Services . . . 241

9.3.5 Single Identifiers and Parallelization . . . 241

9.3.6 Data and Computing Processes . . . 242

9.4 Architecture . . . 242

9.4.1 Specifications . . . 242

9.4.2 Main Architecture . . . 243

9.4.3 Client-side . . . 244

9.4.4 Server-side . . . 246

9.4.5 Protocols . . . 249

9.4.6 User Permissions . . . 250

9.5 Conclusions . . . 250

10 The Client Prototype 253 10.1 Basic Features . . . 253

10.2 Integration into the Graphical User Environment . . . 254

10.3 Interactions with External Softwares . . . 255

10.3.1 File Manager . . . 256

10.3.2 Web Browser . . . 257

10.3.3 Universal Viewer . . . 257

10.3.4 Desktop COmmunication Protocol (DCOP) . . . 257

10.3.5 Non-GNU/Linux

Applications . . . 259

10.4 Graphical User Interface . . . 259

10.4.1 Settings . . . 260

(8)

10.4.2 History . . . 262

10.4.3 The Composition of the Virtual Communities . . . 263

10.4.4 Suggested Documents . . . 264

10.5 Conclusions . . . 265

11 Conclusions 267 11.1 Summary of the Results . . . 267

11.2 Further Research . . . 268

11.2.1 Validation . . . 268

11.2.2 Evaluation of the Solutions . . . 269

11.2.3 Documents . . . 269

11.2.4 Profiles . . . 270

11.2.5 Virtual Communities . . . 270

11.2.6 Architecture . . . 271

11.2.7 Client . . . 272

11.3 Thesis::~Thesis(void) . . . 272

Appendices 273 A Algorithmic Conventions 275 A.1 Sets and Entities . . . 275

A.2 Basic Instructions . . . 275

A.2.1 The if-then-else Construct . . . 275

A.2.2 The for Loop . . . 275

A.2.3 The while Loop . . . 276

A.2.4 The repeat-until Loop . . . 276

A.2.5 Functions . . . 276

A.2.6 Comments . . . 276

A.3 Random Functions . . . 276

A.4 Grouping Genetic Algorithms Operators . . . 277

A.4.1 Tournament Selection . . . 277

A.4.2 Inversion Operator . . . 277

A.4.3 Mutation Operator . . . 278

A.4.4 Crossover Operator . . . 279

B XML Language 281 B.1 XML Format . . . 281

B.2 Document Type Declaration . . . 282

B.3 XML Namespaces . . . 284

B.4 XML Schemata . . . 284

B.5 Resource Description Framework . . . 286

B.6 Important XML Related Specifications . . . 287

C DocXML DTD and XML Schema 289 C.1 DTD of DocXML . . . 289

C.2 XML Schema of DocXML . . . 290

D The PROMETHEE Method 293 D.1 Enrichment of the Preference Structure . . . 293

D.2 Enrichment of the Dominance Relation . . . 294

D.3 Exploitation for Decision Aid . . . 294

D.4 PROMETHEE I Ranking . . . 295

D.5 PROMETHEE II Ranking . . . 295

(9)

D.6 An example of PROMETHEE . . . 296

E Multi-Criteria Grouping Problems 297 E.1 Algorithms, Heuristics and Meta-Heuristics . . . 297

E.1.1 Problems . . . 297

E.1.2 Algorithms . . . 298

E.1.3 Heuristics . . . 298

E.1.4 Meta-heuristics . . . 298

E.2 What are Genetic Algorithms? . . . 299

E.2.1 Definition of a Genetic Algorithm . . . 299

E.2.2 Origin of Genetic Algorithms . . . 300

E.2.3 Genetic Algorithms and other Methods . . . 301

E.2.4 Paradigm . . . 301

E.2.5 Theoretical Grounds . . . 307

E.3 Grouping Problems . . . 310

E.3.1 Partitional Algorithms . . . 310

E.3.2 Nearest Neighbor Clustering . . . 313

E.4 Grouping Genetic Algorithms . . . 313

E.4.1 Encoding . . . 313

E.4.2 Identical Solutions but Different Chromosomes . . . 314

E.4.3 Initialization . . . 315

E.4.4 Crossover . . . 315

E.4.5 Mutation . . . 316

E.4.6 Inversion . . . 316

E.5 Multiple Objective Problems . . . 316

E.5.1 P ARETO Fronts . . . 317

E.5.2 Use of Aggregating Functions . . . 317

E.5.3 Non-P ARETO Approaches . . . 318

E.5.4 P ARETO -based Approaches . . . 318

E.6 Multiple Objective Grouping Genetic Algorithms . . . 319

E.6.1 Philosophy . . . 319

E.6.2 Control Strategy . . . 320

E.6.3 Branching on Populations . . . 321

E.6.4 Verification of the Best Solutions . . . 322

E.7 Conclusions . . . 322

F Acronyms 323

Bibliography 327

(10)
(11)

Acknowledgments

I want to first thank Professor A LAIN D ELCHAMBRE , the director of the CAD/CAM department for his help. The present work would not exist without his encouragement. Professor M ARCO

S AERENS made multiple suggestions concerning the ideas, and our long discussions have greatly influenced the results of this work.

T HOMAS L’E GLISE read this work carefully and proposed many improvements to the pre- sentation of the concepts. Anything that remains unclear is due to my inability to manipulate English correctly.

This work is part of the HyperPRISME

1

and GALILEI

2

projects funded by the “Region wal- lonne”. All the people who have worked on this project must be thanked for their help, and par- ticularly Professor H ERVÉ G ILSON , whose tragic demise took place during the GALILEI project.

He left us with an immense gap that we cannot fill.

All the members of the CAD/CAM department contributed indirectly to this work by en- suring an enjoyable working environment and stoically enduring me during the writing of this thesis.

I want to thanks all those people who have agreed to form part of my jury, namely Professors H UGUES B ERSINI , P IERRETTE B OUILLON (Université de Genève), A LAIN D ELCHAMBRE , P IERRE

G ASPART , F RANÇOIS H EINDERYCKX and M ARCO S AERENS (Université Catholique de Louvain).

All the softwares used during the realization of this thesis are open source or free softwares.

All the contributors to these projects must be congratulated for the quality of their work. The open source and free softwares community proposes a realistic alternative to the ultra-liberal society in which we presently live.

Finally, I am grateful to L EO F ENDER and G EORGES J ENNY , the fathers of electric guitars and electronic keyboards respectively. Without them, many bands such as Deep Purple, Rainbow, Led Zeppelin, Black Sabbath, the Ramones, Pink Floyd, Whitesnake, die Totenhosen, UFO, the Who, Kraftwerk, Van Halen, Rammstein and Diary of Dreams would not be able to express their creativity, which would decrease a large part of my inspiration and sanity.

1

Contract 9613457

2

Contract 01/1/4675

(12)
(13)

Preface

One of the most important problems today in the field of computer science is document man- agement. Several approaches such as search engines, text categorization or collaborative filtering exist to help users finding relevant documents. The aim of this thesis is to develop a set of tools to tackle the problem of collaborative search in document oriented contexts. The users of the system are described in terms of profiles, with each profile corresponding to one area of interest.

These profiles are based on assessments made by the users on the documents consulted and their content. The users’ profiles are then grouped together to form “virtual communities” so that information such as documents of interest can be shared between users of a same community.

The first chapters study all the concepts of “structured and collaborative search” and presents a method of validation. After a brief overview of the systems most frequently used in searching for documents today, and in particular search engines, the philosophy behind the thesis will be explained. The different modules of the system will present the principal concepts such as documents, profiles and virtual communities. An introduction to genetic algorithms will precede the chapter on the virtual communities because the method to group the virtual communities is based on them.

In the last chapters, the author will set out the applications of the thesis. An implementation of the system in a “real” environment will then illustrate how an unsophisticated user may interact with the concepts. Finally, the author will draw some conclusions in terms of achievements, limits of the algorithms and methodologies. He introduces some possible solutions and presents future works.

All the algorithms and methods described in this thesis are implemented in a ready-to-use system. The development platform chosen is a GNU/Linux

3

operating system running the graphical environment KDE, using the GNU developments tools and KDevelop. The CVS system manages the source code, and the Bugzilla system manages the submission of bug reports.

The developed softwares currently consists of about

lines of C

++

source code, mostly developed by the author. The project is largely based on the basic R Library developed by the au- thor and consists of about

lines of C

++

source code. By way of a comparison, the complete multi-platform Linux 2.2.18 kernel and drivers consists of about

lines of C source code.

3

In fact, Linux

is not an operating system but a kernel. The other tools, such as compilers or shells needed to form

an operating system, are developed by the Free Software Foundation in the GNU project. So, when talking about the

operating system based on the Linux

kernel, it is more accurate to talk about the GNU/Linux

system.

(14)
(15)

Chapter 1

Introduction

In this introduction chapter the problem treated in this thesis is exposed. It is organized as fol- lows. The scope of the research is presented in section 1.1. The different elements of information retrieval environments, which deal with description, storage and representation of, and access to items of information, is presented in section 1.2. A Comparison between information and tradi- tional data retrieval is described in section 1.3. Section 1.4 presents the aim of the thesis. Section 1.5 introduces the mathematical notations used throughout this thesis. A simple document set that will be used as case study in the thesis is presented in section 1.6. Finally, the structure of the present document as well as a guide to the reader are presented in section 1.7.

1.1 Scope of the Research

During the nineties and the generalization of document-oriented systems, such as the World Wide Web (WWW) [15], the amount of information available through electronic documents has exploded [132, 133]. On the Internet in 2000, 2.1 billion available documents exist on the Internet, and this number was growing at an explosive rate of more than 7 million pages each day. The delay between the developments of « information technologies » with regard to the retrieval and the organization of electronic information, and the techniques used to store and transfer this information has also increased.

Finding pertinent documents or relevant information within a set of documents is generally speaking a time consuming process: the longer the set is, the more complex is the operation. It is however vital for document-oriented systems that information can be retrieved when necessary.

Plain paper libraries have already encountered this problem for ages: books need to be correctly stored, arranged and indexed in order to be found by the reader. Because the amount of informa- tion becomes too large, the users are therefore often not able to find all the relevant information they search for. A recent article [59] still underlined these problems by reporting the death of a volunteer within the framework of a study on asthma because one of the researchers had not found information relating to the potential dangers of an inhalation.

1.2 Information Retrieval Environments

This section presents the main tasks of a user in information retrieval environments as well as the role played by the systems.

1.2.1 User Part

In information retrieval systems the user translates his interest through a set of interactions. In the

case of search engines, the user constructs a query, which is most frequently a logical expression

(16)

of words representing the corresponding semantic, and receives a list of documents similar to the semantic of the query. Another method is where the user rates a given set of documents concerning his interest and receives a list of documents similar to those with the highest rating.

In such cases, we say that the user performs a retrieval task. Nowadays, this is the most frequently employed operation to find pertinent information.

When retrieving information about “genetic algorithms”, the user may find a document deal- ing with the most recent discoveries in “bio-genetic techniques” and start to focus on this theme.

In this case, the user performs a browsing task, which is also a way to retrieve information.

Each time a user performs a series of tasks and waits to receive pertinent information, these tasks are called pull actions. Figure 1.1 shows the different tasks involved in pull actions and used by users to interact with the system.

PSfrag replacements User

Retrieval

Browsing

Database

Figure 1.1: Basic Interaction between users and retrieval systems.

Another type of action is the so called push action, where a users employs software agents and filtering techniques to automatically retrieve information such as extracting all the documents discussing sport from newspapers sites. In this case, the user does not need to perform any sort of action to receive information.

1.2.2 Role played by the System

In offering users pertinent information it is necessary for the systems to “understand” the elec- tronic documents, i.e. they need a logical view of them. Historically, these documents were represented by a set of keywords automatically deducted from their content. Such a representa- tion implies a lost of semantic information because the sequences of the words in the documents are not stored. The full text is, of course, the most complete logical view of a document in this representation, even through most information retrieval systems apply some operations on the documents to reduce the amount of information needed to represent their logical view.

Once the logical views are known, an information retrieval system needs a model to represent the documents and the users’ interests, and to make comparisons to match them. For the search engines, the users’ interests are queries and the model is known as an Information Retrieval Model.

1.3 Information versus Data Retrieval

In data retrieval systems the information referred as data is defined using a precise structure

and semantic. For example, in relational databases the information is stored in tables and is

described by a finite set of fields, while each entry in a table is called a record, and represents

a piece of information. In querying these systems, users make use of data retrieval language

such as the Stuctured Query Language (SQL) to formulate a query to find the information that

exactly matches a set of given criteria. For example, to find all the documents containing the

word “genetic” and “algorithms” a pseudo-SQL query could be:

(17)

SELECT doc WHERE word_in=’genetic’ AND word_in=’algorithms’

Some problems will appear when this query is applied to a database containing a bibliogra- phy. In some books, “genetic algorithms” are referenced as “evolutionary meta-heuristics”, but the records without the words “genetic” and “algorithms” would not be selected by this query.

It would nevertheless be nice if documents containing “evolutionary meta-heuristics” were re- trieved as well

1

. So this method is not suited to find bibliographical entries dealing with specific subjects or topics because it is hard to express a subject by using such types of queries. The reason is that natural languages are not as well structured as databases, and can be semantically ambigu- ous. Information retrieval systems must therefore “understand” the content of the documents to catch their semantic load, and this requires their analysis.

Despite their disadvantages, data retrieval systems are widely used nowadays, particularly in libraries and bibliographical systems, because their type of information can be easily circum- scribed with a small amount of precise information, such as the authors, titles, keywords and abstracts. Because the information is stored in a static structure

2

, an advantage of data retrieval systems is that they can be optimized by using techniques like index keys. Data retrieval systems are therefore faster than information retrieval systems, which is the main reason why they are still used even if they are not the most efficient solution from the retrieval point of view.

1.4 Aim of the Thesis

This thesis lies within the scope of the research projects HyperPRISME and GALILEI. These projects propose an integrated approach for sharing documents among users called structured and collaborative search [142, 67, 68]. The philosophy of this approach is to describe users of the system in terms of profiles, with each profile corresponding to one area of interest. While browsing through a collection of documents, users’ profiles are computed on the basis of both the content of the retrieved documents and a relevance feedback process. These profiles are then grouped into virtual communities so that documents of interest can be shared between members of a same community. Section 2.5 details these concepts. This approach is the result of an analysis of in- formation retrieval systems [81] and must be seen as a complementary solution to those already existing.

The aim of this thesis is to establish the feasibility of developing an information retrieval system based on this approach. Related to this objective, the following main tasks could be identify:

Computing methods and algorithms must be proposed for the different elements of the system, in particular for the profiles clustering. The integration of these methods constitutes also an important step in order to propose a complete methodology.

A validation method must be developed to test these methods and algorithms.

An architecture suited for a complete system must be proposed.

1.5 Notations and Terminologies

The purpose of this section is to introduce most of the notations and terminologies used in this thesis.

Let us first introduce a basic notation that will be used to represent a set of elements: a set,

, consists of a given number

of entities,

. The indexes change in function of the nature of the entities in the set.

1

If it is known that the genetic algorithms are evolutionary meta-heuristics, the query could be adapted. But such adaptations are limited by the knowledge of the user on the field of research and by the way the information is indexed.

2

It is, of course, possible to change the structure of a database, but such changes are very rare in a database life cycle

once it is into production.

(18)

As explained earlier, most of the time the logical views of documents are described by sets of index terms representing their semantic load. Note that all the words in a document are not necessarily used as index terms, but the description of the different techniques to compute the semantic load of a document are postponed until chapter 5 (they are not necessary for the com- prehension of the models presented in chapter 2).

Each term in a given set of index terms induced does not have the same importance for a given document. Determining the importance of each term in a document is not an easy task.

Consider a large collection of documents. A word,

, contained in 80% of them is not very discriminatory, because a query based on this word will retrieve too many documents. For the same reason, a word,

, given in only 1% of the documents is highly discriminatory. If the words

and

are contained in the same document, each of them has a different importance from the discriminatory point of view. To catch this importance, a numerical weight is associated to each index term in a document. Each model uses a specific method to compute these weights.

D EFINITION 1.1 (S ET OF I NDEX T ERMS )

If

is the number of index terms in the system and

!"

a generic index term,

$#&%! (''(') !+*,-*/.

represents the set of index terms.

D EFINITION 1.2 (D OCUMENTS )

There is a collection,

0

, of

01

documents

23

. Each document consists of a set of index terms. A weight,

4 65387

, is associated with each index term,

!

, for each document,

23

. When an index term does not appear in a document,

49653 #

.

With each document,

2 3

, a vector of index terms

2:3 #<;=4 53 '(''(4?>@53BA

is associated. The func- tion

C

is defined as

C ;2:3BA #D4E653

.

D EFINITION 1.3 (D OCUMENTS AND I NDEX T ERMS )

Let

0

be the set of

0

documents containing the index term

!

, i.e.

4 =53F7

. Let

0 =5G

be the set of

0 65G

documents containing the index terms

!

and

! G

, i.e. for which

4 653H7

and

4 G=53H7

.

Most of the time the index terms are considered as mutually independent, i.e. knowing the term

4 653

for the pair

;I! 23 A

says nothing about

4 KJ 53

of the pair

;I! KJ 23 A

. In certain situa- tions, it constitutes a simplification. For example, let us suppose that the terms “algorithm” and

“genetic” are used as indexes for documents dealing with genetic algorithms. In this field, a document containing the term “algorithm” will certainly include the term “genetic”. But, taking advantages of correlations is not a simple task, and it is not actually clear if taking such depen- dences into account is an advantage in information retrieval. So, in the models presented, if not specified, the assumption is made that the index terms are independent.

The task of finding interesting documents is, mostly the result of finding documents relevant to a given query.

D EFINITION 1.4 (D OCUMENT S ETS )

For a given query,

L

, representing a user interest,

M

, an information retrieval system retrieves a set,

N

, of

NO

documents from the whole collection,

0

. It is possible to define:

0QP

The subset of the collection

0

containing the

0RP"

documents relevant to

M

.

0QS

The subset of the collection

0

containing the

0RST

documents irrelevant to

M

.

N P

The subset of the retrieved set

N

containing the

N P

documents relevant to

M

.

N P 5

The subset of the retrieved relevant documents

N P

of the

N P 5

relevant documents contain- ing the index term

!

.

N S

The subset of the retrieved set

N

containing the

N S

documents irrelevant to

M

.

(19)

Subsets

0 P

and

0 S

are mostly unknown. To construct subsets

N P

and

N S

from

N

, the infor- mation retrieval system needs external help as, for example, the user emitting a assessment on the relevance of documents in

N

. In an ideal information retrieval system, subsets

NUP

and

N9S

are always identical to subsets

0QP

and

0QS

.

1.6 Documents Set “DB”

Since all the concepts presented in this thesis are based on sets of documents, an imaginary doc- uments set “DB” is introduced for didactic purposes. It consists of

documents issued from

different topics (or subset) in which

different index terms appear:

A topic of

V

documents (

2W

2X

) is related to the rock band “Deep Purple”.

A topic of

V

documents (

2"Y

2"Z

) is related to the rock band the “Beatles”.

Table 1.1 presents the documents, the index terms and the number of occurrences

3

of the index terms in the documents.

Deep Purple Beatles

[)\ [^] [_ [)` [)a [b [)c [(d [)e [)f

g^]

: Beatles

h h h h h i i j i i

g _

: Blackmore

i i j k k h h h h h

g(`

: child

j i j i i h h h h h

g(a

: Deep

h i i i i h h h h h

gb

: Gillian

i j i i j h h h h h

g(c

: Glover

j k i i i h h h h h

g d

: guitar

i k k i i i i i i i

g(e

: help

h h h h h h i h i i

g(f

: hard

k i k i i h h h h h

g^]6\

: Harrisson

h h h h h i i j i i

g ]l]

: Lennon

h h h h h k i i k k

g^]I_

: Lord

k i i j i h h h h h

g^]6`

: McCartney

h h h h h i k i j k

g^]6a

: Michelle

h h h h h i j h i h

g^]Ib

: music

k i k k k i j i i j

g ]6c

: Paice

i k h k h h h h h h

g^]Id

: please

h h h h h i h k h k

g^]6e

: Purple

j i j i j h h h h h

g^]6f

: Ringo

h h h h h i j i k j

g _@\

: rock

i k i j k k i k i i

g_]

: roll

h h h h h i k i k i

g_l_

: smoke

k h k h i h h h h h

g_@`

: time

k i i k k h h h h h

g_@a

: water

k k k k k h h h h h

Table 1.1: “DB” documents set.

The “DB” set has several interesting properties:

some index terms only appear in one of the topics and not in the other one (

!

“Beatles” or

!m

“Glover”).

some index terms appear in both topics (

!"n

“Guitar” or

!"oW

“rock”).

some index terms do not appear in each document constituting a topic (

! m

“Paice” or

! X

“Michelle”).

It is also supposed that both users Pal and Popeye are interested in “Deep Purple”, and that users Alain and Marco are interested in the “Beatles”. Two queries represented by a set of index terms are used to find the documents relevant to each topic:

3

These occurrences will be used as a basis for the index term weights in the different models presented in chapter 2.

(20)

Query

L

is used to find documents about “Deep Purple” and contains the four words

“Deep Purple hard rock”.

Query

L

is used to find documents about the “Beatles” and contains the four words “Beat- les Lennon McCartney rock”.

It is interesting to note that the same index term

!"W

”rock” is used in both queries.

Remark: Of course, the “DB” document set cannot be used to validate any kind of method. The purpose is only to show how each method works. It will be used for this purpose in chapter 2.

1.7 Structure of the Thesis

A flowchart of the dissertation structure is shown in Figure 1.2 and discussed below. Dotted lines represent non essential paths to the appendices.

2 Information Retrieval Systems Chapter 2 describes the principal types of existing informa- tion retrieval systems. The information retrieval system proposed in this thesis and called GALILEI is also described.

A crucial issue is the evaluation of the performances of GALILEI. Two chapters propose solutions for this problem:

3 Validation One of the key topics when discussing problems and computed solution is to evalu- ate their quality. It becomes therefore necessary to introduce quality measures and a specific validation methodology that simulates real users. Chapter 3 proposes these tools.

4 Evaluation of the Subprofile Clusterings In a real-world situation, the ideal solution is, of course, unknown and it is necessary to evaluate the quality of a solution. In chapter 4, some criteria are proposed that can be used to measure this quality.

The GALILEI system involves computing methods and algorithms. Five chapters present them and study the performances of GALILEI:

5 Documents Chapter 5 details how the documents are handled, in particular the important concept of information entity is introduced.

6 Profiles Chapter 6 details how the users are modeled as profiles and subprofiles, and the dif- ferent methods used to describe them.

7 Virtual Communities Chapter 7 details an important problem of this thesis, the subprofile clustering into virtual communities.

8 Results Chapter 8 presents the main results, and performs a study of the influence of the dif- ferent parameters of the information retrieval system proposed.

Since this thesis lies within the scope of a research project whose goal is the development of a complete information retrieval system, the algorithms and methods needed by GALILEI were developed as a set of C

++

classes. Two chapters are dedicated to the implementation of GALILEI as a complete integrated system:

9 Architecture To implement a system, it is necessary to develop an underlying architecture that makes the different components of the system communicated. Chapter 9 discusses such an architecture for GALILEI.

10 The Client Prototype Chapter 10 describes how a client prototype can be developed and in-

tegrated in a user environment. A client is a user application providing an interface to a

given set of services proposed by the information retrieval system of this thesis.

(21)

Grouping problems

and DocXML Schema

PSfrag replacements

1. Introduction

2. Information Retrieval Systems

3. Validation 4. Evaluation of the Subprofile Clusterings

5. Documents

6. Profiles A. Algorithmic

Conventions

B. XML Language C. DocXML DTD

and DocXML Schemas

E. Multi-Criteria

Grouping Problems

D. The PROMETHEE Method

7. Virtual Communities

8. Results

9. Architecture 10. The Client Prototype

11. Conclusions

Figure 1.2: Flowchart of the dissertation structure.

Finally:

11 Conclusions In chapter 11, the main results of the previous chapters are summarized and discussed.

The appendices are related to: a short introduction to the algorithmic conventions used and

some specific algorithms adapted for this thesis (appendix A); an overview of the XML language

(appendix B); a formal definition of the internal document representation used in this thesis (ap-

pendix C); an overview of the multi-criteria decision-aid method PROMETHEE (appendix D);

(22)

a general introduction to grouping and multi-criteria problems (appendix E) containing useful information to fully understand chapter 7.

Reading Suggestion

The chapters and sections listed below are essential for a good understanding of this thesis:

Chapter/Section Title Page

1 Introduction 1

2.4 Why Another Solution? 35

2.5 Structured and Collaborative Search 37 3.1.1 Recall and Precision in Information Retrieval 48

3.2 Evaluation Methodology 51

3.3 Validation Measures 58

4.4 What is a Good Solution? 74

5 Documents 93

6 Profiles 125

7.2 Genetic Virtual Communities Algorithm (GVCA) 155

8.10 Conclusions of the Results 230

9.4 Architecture 242

10 The Client Prototype 253

11 Conclusions 267

Besides, the reader is suggested to read the introduction and the conclusion of each chapter.

(23)

Chapter 2

Information Retrieval Systems

Since it is vital for document-oriented systems that information can be retrieved when necessary, information retrieval systems has developed into an important field of research in Computer Science. This chapter is devoted to the various solutions existing today. The principal types of information retrieval systems in existence will be described:

Document Rankings (section 2.1).

Text Categorization (section 2.2).

Collaborative Filtering (section 2.3).

An overall comparison (section 2.4) of these methods throws some good results and pertinent features, but also some important drawbacks. The work presented in this thesis aims at preparing a new approach for this difficult problem, entitled structured and collaborative search (section 2.5).

Section 2.6 briefly introduces the semantic web. Finally, a conclusion is presented in section 2.7.

2.1 Document Rankings

This approach is the same as for search engines: the users enter a query and the systems propose a list of potentially interesting documents ranked so that the most interesting documents are at the top of the list.

As explained earlier, most information retrieval systems represent documents by means of index terms. This means that the semantic load of the documents and the users interests can be expressed through a set of index terms, which is clearly a simplification of the matter. Thus, when documents are retrieved, it is normal that a lot of them are irrelevant. For example asking for documents containing “genetic algorithms” will retrieve a document containing the sentence

“this document has nothing to do with genetic algorithms” which, of course, has nothing to do with the subject. It is therefore important for information retrieval systems to evaluate which documents may be relevant and which not.

To decide upon the relevance of a document, the systems usually need ranking algorithms to order their documents by relevance, i.e. the documents at the top of the list are the most relevant.

These ranking algorithms are the core of the systems. There are three different basic information

retrieval models [92] labeled Boolean, vector, and probabilistic. In the Boolean model, where

documents and queries are represented through sets of index terms, the model is said to be set

theoretical. In the vector model, documents and queries are represented as vectors in the word

space, and the model is said to be algebraic. In the probability model, the representation of docu-

ments and queries is based on probabilistic theory, and the model is said as probabilistic. Through

the years, the basic models have been extended.

(24)

2.1.1 Set Theoretical Models

2.1.1.1 The Boolean Model

This section presents the Boolean model, which is a very simple model based on set theory and Boolean algebra. It is easy for run-of-the-mill users of Information Retrieval (IR) systems to un- derstand because the concept of a set is very intuitive, and queries are expressed through Boolean expressions, which have precise semantic loads. Moreover, Boolean operations and their imple- mentations are well known [228] in the field of information retrieval. This model was widely used in the earlier years of information retrieval because of its simplicity.

The Boolean model considers that a given index term is either present or absent in a docu- ment, so its weight can only be

or , i.e.

49=53qp % +( .

. A query,

L

, consists of index terms linked by the operators or, and, and not. For example, the query

rL #s!tUuv;I!"wyx{z|!"} A@~

(see Figure 2.1) retrieves all the documents containing the term

!t

and those containing the term

!w

, but not those containing the term

!"}

. Is is always possible to express a query as a disjunction of conjunctive vectors, i.e. to construct its disjunctive normal form (DNF) [116]. For example, query

L

can be writ- ten as

r:

L((€)H#‚;

(

A xƒ;

(

A x„;

A@~

, where each component is a binary vector associated with the tuple

;I!"t !"w !} A

.

PSfrag replacements

g)…

g(†

g)‡

ˆ

kŠ‰okŠ‰ok‹

ˆ

kŠ‰okŠ‰Œh)‹

ˆ

kŠ‰ŒhB‰Ih)‹

Figure 2.1: Query

rLH#!"t?uŽ;I!"wTx1z|!"} A@~

.

D EFINITION 2.1 (I NDEX T ERM W EIGHTS IN THE B OOLEAN M ODEL )

In the Boolean model, the weight associated with a pair

%! 2 3 .

is represented by a binary vari- able, i.e.

49=53qp % .

.

D EFINITION 2.2 (Q UERIES IN THE B OOLEAN M ODEL ) Let

:

L((€)

be the disjunctive normal form for a given query

L

. Let

:

Ll

be any of the conjunctive components of

:

L((€)

. Let



be the number of conjunctive components of

:

L(€(

and

L((€(l5

a reference

to the i-th conjunctive component.

:

L(€(

can be written

:

L)(€(|#L)(€(l5

x‘L)(€(l5yx '('('

x’2(€(l5“

(2.1)

The model uses the similarity between a given query and the documents of the systems to

retrieve those that are relevant.

(25)

D EFINITION 2.3 (S IMILARITY IN THE B OOLEAN M ODEL ) The similarity of a document,

2 3

, to query

L

is given by

”Š•—–

;62

3 L A # ˜

š™œ›F

:

L)lU;

:

L(l

p :

L)(€(

A

uŽ;Ÿž !

Co;

:

2

3BA

#DC;

:

L)l

AoA

¡¢—£+¤)¥Š¦

•Ÿ”

¤

(2.2)

If

”Š•—– ;62 3 L A #

, the model supposes that the document

2 3

is relevant to a query

L

, and retrieves

it, otherwise,

2 3

is considered as not being relevant.

In other words, the system evaluates the proposition on each document and retrieves those for which the evaluation is true.

2.1.1.2 “DB” Document Sets and the Boolean Model

The similarity given for the Boolean Model in D EFINITION 2.3 shows that a document is consid- ered to be relevant if at least one of the conjunctive components of the query is present in it. So the first question to answer is how to interpret the queries, i.e. which logical operators must be used to form the Boolean expression with these index terms. There are two trivial choices:

1. Only and operators are used.

2. Only or operators are used.

It can be seen from Table 1.1 on page 5 that the index term

!oW

“rock” appears in every document of the collection. Therefore, if only or operators are used, all the documents will be considered as being relevant to

L

and

L

because the index term

!"oW

“rock” is contained in both queries, and the user interested in “Deep Purple” will retrieve documents about the “Beatles” and vice versa. When only and operators are used, query

L

will not retrieve document

2W

because the index term

!X

“Deep” does not appear in it. Query

LB

will retrieve the corresponding relevant sub-set because each document of the topic contains the index terms used in the corresponding query. Table 2.1 shows the document retrieved (

§

) by the different combinations of queries and logical operators.

Deep Purple Beatles

[ \ [ ] [ _ [ ` [ a [ b [ c [ d [ e [ f

¨ ]

and and operators

© © © ©

¨ ]

and or operators

© © © © © © © © © ©

¨ _

and and operators

© © © © ©

¨ _

and or operators

© © © © © © © © © ©

Table 2.1: “DB” documents set and Boolean Model.

This simple example shows the principal problem with Boolean queries, namely that it is difficult to find the right expression to retrieve the relevant sub-set of the collection.

2.1.1.3 Trends and Research for Set Theoretic Models

While the Boolean model is simple, it does have its disadvantages. Firstly, a Boolean expression

is either true or false, i.e. a document matches the whole expression and is retrieved, or it is con-

sidered as being irrelevant because all index terms have the same importance (present or not in a

document) whenever they are highly discriminatory or not. This means that the model does not

include a notion of scoring documents or approximate matching, and can be seen more as a data

retrieval method. The set of documents retrieved may be too large or too small, depending on

the query. Secondly, constructing a Boolean query becomes very difficult when the requests are

complex. In fact, many studies have shown [19, 88, 147, 239] that users have difficulty in mas-

tering Boolean expressions because of the differences between normal language and the Boolean

Références

Documents relatifs

It differs from the Control Experiment in that each game is followed by a second stage that allows targeted interactions (in this treatment, peer punishment opportunities) at

28 En aquest sentit, quant a Eduard Vallory, a través de l’anàlisi del cas, es pot considerar que ha dut a terme les funcions de boundary spanner, que segons Ball (2016:7) són:

La transición a la democracia que se intentó llevar en Egipto quedó frustrada en el momento que el brazo militar no quiso despojarse de sus privilegios para entregarlos al

L’objectiu principal és introduir l’art i el procés de creació artística de forma permanent dins l’escola, mitjançant la construcció d’un estudi artístic

This protocol allows a host to determine the IP addresses of the local host and the boot server, the name of an appropriate boot file, and optionally the

concatenating a network layer address (IP address) and the transport layer TCP/UDP port number. 2) included in the TCP/UDP checksum calculation are the IP layer source

The state created in routers by the sending or receiving of a JOIN_ACK is bi-directional - data can flow either way along a tree &#34;branch&#34;, and the state is group

Calls placed by a Merit member/affiliate user to these external dial-in services are authenticated by having each of those services forward RADIUS authentication requests