Precise information retrieval in semantic scientiﬁc digital libraries : CEUR doctoral symposium proceedings

(1)

Conference Proceedings

Reference

Precise information retrieval in semantic scientiﬁc digital libraries : CEUR doctoral symposium proceedings

DE RIBAUPIERRE, Hélène

Abstract

When scientists or engineers are looking for information in document collections, or on the web, they generally have a precise ob jective in mind. Instead of looking for documents

”about a topic T ”, they rather try to answer specific needs such as finding the definition of a concept, finding results for a particular problem, checking whether an idea has already been tested, or comparing the scientific conclusions of two articles. One of a ob jective of this thesis is to build an indexing model which includes the decomposition of documents into fragments that will correspond to discourse elements (definition, hypothesis, method, findings, etc.). The division of documents into fragments should allow scientists to retrieve more pertinant information and to make queries more precise. Each type of discourse element will be modeled by defining specific characteristics.

DE RIBAUPIERRE, Hélène. Precise information retrieval in semantic scientiﬁc digital libraries : CEUR doctoral symposium proceedings . Galway : 2012

Available at:

http://archive-ouverte.unige.ch/unige:27232

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

scientific digital libraries

H´el`ene de Ribaupierre

ICLE, Centre Universitaire d’Informatique Information System Department, University of Geneva

Geneva, Switzerland {Helene.deribaupierre}@unige.ch http://www.unige.ch/icle/index.html

Abstract. When scientists or engineers are looking for information in document collections, or on the web, they generally have a precise objective in mind. Instead of looking for documents ”about a topicT”, they rather try to answer specific needs such as finding the definition of a concept, finding results for a particular problem, checking whether an idea has already been tested, or comparing the scientific conclusions of two articles. One of a objective of this thesis is to build an indexing model which includes the decomposition of documents into fragments that will correspond to discourse elements (definition, hypothesis, method, findings, etc.). The division of documents into fragments should allow scientists to retrieve more pertinant information and to make queries more precise. Each type of discourse element will be modeled by defining specific characteristics.

Keywords: Knowledge representation, Ontologies, Science model

1 Problem

In their activity, scientists deal with an extremely large amount of complex information in their domain and in adjacent domains. The number of publications increases every year (Medline has a growth rate of 0.5 million items by year [1]), and the time needed to search and read this information increases every year. Researchers need tools to help them search, organize and read scientific papers, and to avoid wasting time looking for and reading irrelevant articles.

Recently, systems supporting electronic scientific publications have introduced some improvements compared with simple search of printed edition indices, but the revolution heralded by electronic publishing has not happened yet.

The aim of this research is to improve tools for searching information in scientific document. Actually, Information retrieval System on scientific document are working with metadata like author’s name, title, etc. or with the full research text. This kind of indexing is efficient when users know exactly what they are looking for. And different factors can influence the difficulty of this task. In [4],

(3)

2 H´el`ene de Ribaupierre

the authors define some factors that can render problem solving more difficult:

ill-structured problem spaces, low domain knowledge, unsystematic steps, etc..

Moreover, it’s not possible to search for precise question or precise needs. For example, it will be difficult and time consuming to gather all the different definition of ”gender” in the different schools of thought, and to search for changes in the meaning of a term/concept into time and between author. It could be interesting to search all results of research that use ontology based named-entity annotation.

A second part of the research will be dedicated to the creation of a ”reading interface” for the query results. In general, the user task is not simply to find relevant documents. The user must read them, or read the most important fragments, compare them, create summaries, etc. Hence, it is necessary to process the retrieved documents further, or to break into fragments according to the user needs.

2 State of the art

We can find different annotation models for scientific writing. Some authors propose to uses rhetorical structure or discursive categories to manually or automatically annotate scientific writing to improve summarization or information retrieval systems [7], [8], [9], [10], [11], [12], [13]. These works focus mainly on the ”hard” sciences like biology where there is relatively little variation in de- scribing the results, hypothesis, conclusions, etc. In [10], authors use rhetorical status (Background, other, own, aim, textual) to annotate documents for clas- sifying purpose of citations [14] and improving summarization. In [15], authors propose an approach to extract claims and hypothesis from scientific papers in biology. For a more complete survey of the literature, see [16]. In [13], the authors use patterns to automatically annotate discursive categories that allow the user to search for definitions, quotations, causality etc. The model presented here is based on these studies and combines them while also suggesting some new concepts.

Publisher Models like Dublin Core, TEI, and DTD for Science, Nature, Pubmed, etc., define metadata to identify scientific documents according to criteria such as author information, title, year of publication, etc. Additional information associated with metadata, such as relations between authors, research groups they belong to, etc. can contextualize a document and an author, and it could be criteria for scientist to choose between papers to read. In [17], the author uses an explicit ontology to improve search strategic monitoring, in as- tronomy, using concepts such as laboratory, affiliation, and the authors’ country of residence.

3 Proposed Approach

Our assumption is that scientists are looking (when they are not reading the whole article) for very specific information, and they want to compare, follow, an-

(4)

alyze, etc information intra- and inter-scientific document. Understanding what kind of information scientists are looking for when they are searching and reading scientific document can be a way to build a model for annotating scientific document. We assume that when scientists are looking for this very specific information, they will focus on a specific part of a document. A document can be fragmented in different part, we will call this part: Fragment. Fragment can serve the scientist to answer to a specific need or question. For example, taking the same example than above, if we annotated a definition of the term gender in the document, it will be easier to find it and gather all the different definition of the term gender in the futur. Actually, it is not possible to search into article by the type of information a fragment can give. It is possible to queries scientific document with the word ”gender”, but it is not necessarily a document that contain a definition of the concept ”gender”. Morever, even if it is the case, scientist as to read or scan the whole article to find it. In the process of comparing and clus- tering articles, readers introduce relations between articles. Those relations are established for different reasons, different goals and for achieving certain tasks or resolving problems. In the line of the previous example on gender, readers can make relations between articles that have the same school of thought or were written during the same period of time, and exclude all the definitions of

’gender’ that do not correspond to those criteria.

To define a relevant document model for precise information retrieval in scientific documents we need to know how scientists, in different domains, access information. The aim of this first phase is thus to better understand the strategic reading needs of scientists in general and to define a set of test cases. To analyze these needs, we have set up an on-line survey that asks for the purpose of reading scientific writing and whether scientists focus on specific parts when reading.

We are also conducting interviews with scientists in different areas of research to understand more precisely the behavior of scientists and to define some use cases that we can use to test the corpus.

Taking as a starting point the results from the first step and from the analysis of models already existing in the literature, we will defined a ontology of discourse element (definition, hypothesis, findings, methodology, . . . ).

The query model will be based on the document model. Each basic query will thus be expressed in terms of discourse element model (e.g. find fragments with discourse element type and general content model definitionand gender is social construction andnot biological difference.). The matching between queries and documents will be based on a similarity measure defined on the document model. The definition of this measure will re-use results obtained in defining retrieval systems for structured documents (in particular for XML documents) [19].

In this step, we will define and implement primitive operations that take as input documents represented in the document model, and produce a result document in this same model. The goal is to define a set of operations that can be combined to produce sophisticated document generation processes. Typical operations are: filtering (selecting fragments that satisfy a boolean condition),

(5)

merging documents at the fragment level; computing differences; or inferring links. Using the primitive document operations we will develop user interfaces based on the generation of derived documents. The goal of this implementation are 1) to evaluate the usability of the operations for developing document-based interfaces and 2) to evaluate the usability of different types of document-based interfaces.

For the evaluation, we will test our system against Information retrieval system and semantic Information retrieval with some pre-defined queries, and ask user which system is the most precise and accurate for them. For example, we will ask users to find all the definition of the term ”gender” and to compare the definition of the term between to schools of thought.

4 Current results

The results from our Pilot survey and the first interviews allows us to redefine us the fragment-based annotation model [20]. This model (figure 1) will be included into the indexing model. Each fragment will be annotated semantically and in- dexed with relevant domain concepts and the role they play in the fragment, as well as metadata. The proposed model is intended to annotate the different fragments of a document along four characteristics: bibliographic data (metadata), scientific (sub) domain, relation between fragments and discourse type.

The only characteristics dependent on the domain of this model is the domain ontology; the rest will be generic. We will use standard model like DublinCore to represent bibliographic data (metadata) such as author, year, source, editor, publisher, type and title [21]. Different ontologies will be used to define the (sub) domain of the fragment. Each fragment will have a list of concept of domain. For the relations we are using actually CITO¹ [22] to rely fragment between them.

Based on the results of the survey and the first interviews, a discourse element model has be defined.

We have for the moment defined more precisely the case of definition. A definition is composed of two other concepts. The definiens is the sentence or the word that defines the definiendum. For example, “cat (Felis catus), also known as the domestic cat or housecat to distinguish it from other felines and felids, is a small, furry, domesticated, carnivorous mammal that is valued by humans for its companionship and for its ability to hunt vermin and household pests”². The definiendum is the term cat and the definiens is the sentence that follows this term. In addition to providing a more precise indexing, separating the definiens and the definiendum is a way to solve homonymy and synonymy problems. The definiendum can be associated with several linguistic forms (synonyms) and to a precise concept in a domain ontology

With these models it becomes possible to answer very precise user requests such as “find all the definitions of the term gender (in a collection of article) that agree with the definition of author X”.

1 http://purl.org/spar/cito

2 http://en.wikipedia.org/wiki/Cat

(6)

Fig. 1.Fragment-based annotation model

The annotation model is formalized in an OWL ontology. The core of this ontology is thus a set of OWL classes, properties and axioms corresponding, among others, to the abstract schema shown in figure 1. For the bibliographic part of the model (metadata), we simply import the Dublin Core ontology. The domain ontologies are also imported from existing ontology repositories that are now emerging on the web (figure 2).

Fig. 2.Protege, extract of the annotation ontology

(7)

5 Conclusion and Further work

To improve the process of search and analysis of information in scientific papers, we have to create new models for indexing these articles, because generic full text indexing is not sufficient. To define these models of annotation and indexing, we need to know the needs of users. We have initiated an analysis of the search and indexing behavior of an initial set scientists that will increase through further studies.. It is also important to develop a model that is generic enough to be suitable for annotating different research areas to cover broad classes of interdisciplinary research.

Currently, we have defined this fragment-based annotation model, but de- pending of the needs of user and the interviews we can refine or modify it.

We also started to annotate scientific papers manually, allowing us to see the difficulties of the implementation of this model. This annotated corpus will be used as a training corpus for automatic annotation using existing techniqus and algorithms.

We will test this corpus with different queries and some uses cases that we have distilled from the interviews. Subsequently, we will build an interface that facilitates strategic and parallel reading.

References

1. Nov´acek, V., Groza, T., Handschuh, S., Decker, S.: Coraal - dive into publications, bathe in the knowledge. J. Web Sem.8(2-3) (2010) 176–181

2. Tenopir, C., King, D., Edwards, S.: Electronic journals and changes in scholarly article seeking and reading patterns. (2009)

3. Hannay, T.: What can the web do for science? Computer. (2010)

4. Palmer, C., Cragin, M., Hogan, T.: Weak information work in scientific discovery.

Information Processing & Management 433(2007) 808?820

5. Nicholas, D., Huntington, P., Jamali, H., Dobrowolski, T.: Characterising and evaluating information seeking behaviour in a digital environment: Spotlight on the ’bouncer’. Information Processing & Management 434(2007) 1085–1102 6. Kazai, G.: Search and navigation in structured document retrieval: comparison of

user behaviour in search on document passages and XML elements, in Proceed- ings of The Twelfth Australasian Document Computing Symposium (ADCS’07).

(December 2007)

7. Groza, T., Muller, K., Handschuh, S., Trif, D., Decker, S.: Salt: Weaving the claim web. In: Proceedings of the Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea (Berlin, Heidelberg. (2007)

8. Muller, H.M., Kenny, E.E., Sternberg, P.W.: Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2, e309 (2004)

9. Harmsze, F.: A modular structure for scientific articles in an electronic environment. PhD thesis (January 2000)

10. Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Computational linguistics 284(2002) 409–445

(8)

11. Teufel, S.: Argumentative Zoning: Information Extraction from Scientific Text.

PhD thesis, University of Edimburgh (1999)

12. Ibekwe-Sanjuan, F., Silvia, F., Eric, S., Eric, C.: Annotation of Scientific Sum- maries for Information Retrieval. In Zaragoza, O.A..H., ed.: ECIR’08 Workshop on:

Exploiting Semantic Annotations for Information Retrieval, Glasgow, Royaume- Uni (March 2008) 70–83

13. Djioua, B., Descles, J.: Indexing documents by discourse and semantic contents from automatic annotations of texts. (2007)

14. Teufel, S., Siddharthan, A., Tidhar, D.: An annotation scheme for citation func- tion. In: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, SigDIAL06, Association for Computational Linguistics. (2006) 80–87

15. de Waard, A., Shum, S.B., Carusi, A., Park, J., Samwald, M., S´andor, ´A.: Hy- potheses, evidence and relationships: The hyper approach for representing scientific knowledge claims. In: Proceedings 8th International Semantic Web Conference, Workshop on Semantic Web Applications in Scientific Discourse. Lecture Notes in Computer Science, Springer Verlag: Berlin. (October?Autumn 2009)

16. Buckingham Shum, S., Clark, T., de Waard, A., Groza, T., Handschuh, S., Sandor, A.: Scientific discourse on the semantic web: A survey of models and enabling tech- nologies. Semantic Web Journal: Interoperability, Usability, Applicability (2010) 17. Hernandez, N., Mothe, J.: Ontologies pour l’aide `a l’exploration d’une collection

de documents. In: Veille strat´egique, scientifique et technologique. (2005) 18. Ivanov, R., Raae, L.: Inspire: a new scientific information system for hep. (2010) 19. Harrathi, R.: Recherche d’information conceptuelle dans les documents semi-

structur´es. PhD thesis, LIRIS (2010)

20. de Ribaupierre, H., Falquet, G.: New trends for reading scientific documents. In:

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing. BooksOnline ’11, New York, NY, USA, ACM (2011) 19–24

21. Ghoula, N., de Ribaupierre, H., Tardy, C., Falquet, G.: Opérations sur des ressources hétérognes dans un entrepôt de donneée à base d’ontologie. In: Journées Francophones sur les Ontologies, Montréal. (2011)

22. Shotton, D.: Cito, the citation typing ontology, and its use for annotation of reference lists and visualization of citation networks. The 12th Annual BioOntologies Meeting (2009) 1–4