Text mining strategies to support literature-based biocuration

(1)

Thesis

Reference

Text mining strategies to support literature-based biocuration

VISHNYAKOVA, Dina

Abstract

Despite the progress in bioNLP domain, the majority of text mining methods/techniques are not used in the real curation tasks. Partly, it can be explained by the lack of precision and opacity of the provided results. Furthermore, the functionality and adequacy of systems, designed for the sake of curation are also important. In this thesis, the original methods used in the context of biocuration assistance are first explored and developed. More specifically, the focus of the thesis is on how text-mining systems are used in the processes of curation.

Finally, the results of an assisted curation are compared to the ones of manual curation. The achieved improvement in assisted curation accuracy is +9.3% in average. Although the evaluation of the curation tasks is based on a small sample of participants and benchmark, the achieved results suggest that the improvement of the curation quality is possible.

Moreover, it is also dependent on the expert skills.

VISHNYAKOVA, Dina. Text mining strategies to support literature-based biocuration. Thèse de doctorat : Univ. Genève, 2014, no. Sc. 4725

URN : urn:nbn:ch:unige-465681

DOI : 10.13097/archive-ouverte/unige:46568

Available at:

http://archive-ouverte.unige.ch/unige:46568

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE FACULTÉ DES SCIENCES Département d’informatique Professeur Bastien Chopard HAUTE ÉCOLE DE GESTION DE GENÈVE

Département des sciences de l’information Professeur Patrick Ruch

Text%Mining%Strategies%to%Support%

Literature!based&Biocuration"

THÈSE

Présentée"à"la"Faculté"des"sciences"de"l’Université"de"Genève"pour"

obtenir"le"grade"de"Docteur"ès"sciences,"mention"informatique"

"

par"

Dina%VISHNYAKOVA%

"

de"

Armavir,"Russie"

"

Thèse"N°"4725!

"

GENÈVE"

"

Atelier"ReproMail"

"

2014"

08

Fall%

08

(3)

(4)

"

!

(5)

%

To my parents

(6)

%

(7)

%

ACKNOWLEDGMENTS

Many people have contributed to the production of this thesis. I owe my gratitude to all those people who have made this dissertation possible and because of whom my graduate experience has been one that I will cherish forever.

My deepest gratitude is to my supervisor, Prof. Dr. Patrick Ruch. I have been amazingly fortunate to have an advisor who gave me the freedom to explore on my own, and at the same time the guidance to recover when my steps faltered. He taught me how to question thoughts and express ideas. His patience and support helped me overcome many crisis situations and finish this dissertation. I am indebted to him for his continuous encouragement and guidance.

I would like also to thank Prof. Dr. Christian Lovis for the chance to work under his guidance and to experience the world of medical informatics. His insightful comments and assistance at different stages of my research career were thought provoking and they helped me focus my ideas.

I had a chance to gain my professional experience in research groups such as BiTeM and SIMED and I am thankful to all my colleagues for their encouragement, especially Emilie Pasche, Julien Gobeill and Philippe Baumann.

My thesis hypothesis won’t be fully confirmed without essential help of the experts of the Swiss Institute of Bioinformatics: Anne-Lise Veuthey, Pascale Gaudet, Isabelle Cusin, Jonas Cicenas, Alain Gateau, Valerie Hinard. Thank you all for your patience and time you invested in the system testing and evaluation. You gave me a valuable insight on the biocurators’

work.

Many friends have helped me stay sane through these challenging years, but especially, Ricardo Machado who spent hours cheerfully spell-checking while sipping caipirinhas together; Huyen Do and Jee-Hyub Kim who welcomed me as my first colleagues and were always there for me; Valentina Shcherba and Marguerite Deluze who cheered me up. All their support and care helped me stay focused on my graduate study. I greatly value their friendship and I deeply appreciate their belief in me.

Most importantly, none of this would have been possible without the love and patience of my family. My amazing parents, to whom this dissertation is dedicated to and my brother, have been a constant source of support and strength all these years. I would like to express my deepest gratitude to them.

(8)

(9)

ABSTRACT

Although text mining seems gaining impetus in bioinformatics, we observe that the majority of text mining techniques are not exploited in real curation tasks. This is due to their insufficient accuracy and the opaque nature of the providing results.

In this thesis the original methods for the biocuration assistance are explored and developed. More specifically we investigate the use of developed text- mining pipelines for the particular tasks of curation workflow. Additionally we compare the results of the assisted curation to the manual one.

In the Chapter 2 of the thesis we focus on the development of the original text-mining techniques such as gene and species name recognition and normalisation. Since the gene detection and normalisation tasks form the essential part of the mainstream curation process. We propose an original approach based on the well-known techniques of named entity recognition along with systems, which are not, originally, designed for the gene normalisation task. Then, in Chapters 3 and 4, we develop automated curation systems based on integration of independent modules. These systems automate the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Like an airplane does not try to copy the flight of a bird, the idea is not to reproduce the processing followed by experts but rather to propose an original processing pipeline where humans and computers can interact optimally.

(10)

(11)

RÉSUME

Bien que l'exploration de texte semble gagner du terrain en bioinformatique, nous observons qu’en pratique, la majorité des techniques de fouille de textes (text-mining) ne sont pas exploitées dans les tâches de curation. Le manque de précision de ces méthodes ainsi que l’opacité des résultats fournis sont les causes principales de ce manque d’intérêt.

Dans cette thèse, les méthodes originales utilisées dans le contexte de l'assistance en biocuration sont tout d’abord explorées puis développées.

Plus spécifiquement, nous nous concentrons sur la manière dont sont utilisés les systèmes de fouilles de textes dans les processus habituels de curation.

Nous comparons également les résultats obtenus par une curation assistée ou par une curation manuelle.

Dans le chapitre 2 de la thèse, nous nous concentrons sur le développement des techniques d'exploration de textes originaux telles que la reconnaissance de gènes ou la reconnaissance du nom des espèces ainsi que de leur normalisation. La détection de gènes ainsi que la normalisation constituent en effet généralement la partie essentielle du processus de curation. Nous proposons une approche originale basée sur les techniques bien connues de la reconnaissance d'entités nommées ainsi que par l’utilisation de systèmes qui ne sont pas, à l'origine, conçus pour la tâche de normalisation des gènes.

Dans les chapitres 3 et 4, nous développons des systèmes de curation automatisée basés sur l'intégration de modules indépendants. Ces systèmes permettent d'automatiser le processus en extrayant des données scientifiques pertinentes de la littérature, et en les classant selon plusieurs dimensions qualitatives. A l’image d’un avion qui ne cherche pas à copier le vol d'un oiseau, l'idée n'est pas de reproduire la méthodologie utilisée par les experts, mais plutôt de proposer une approche originale, où les humains et les ordinateurs peuvent interagir de manière optimale.

(12)

TABLE"OF"CONTENTS"

Acknowledgements ... 5!

Abstract ... 7!

Résume ... 9!

Abbreviations ... 13!

List of Figures ... 15!

List of Tables ... 17!

Introduction ... 19!

EVOLUTION)OF)BIOMEDICAL)NLP)TECHNIQUES)...)20!

PROBLEM)OUTLINE/BACKGROUND)...)20!

RESEARCH!HYPOTHESISES!...!21!

THESIS)OUTLINE)...)24!

1! State of the art ... 25!

1.1! BIOMEDICAL)DATA/RESOURCES)...)25!

CORPORA)...)25!

ANNOTATED)CORPORA)AND)ANNOTATIONS)...)26!

KNOWLEDGE)SOURCES/TERMINOLOGIES)...)26!

1.1.1! UNIFIED!MEDICAL!LANGUAGE!SYSTEM!...!26!

1.1.2! SPECIES!TAXONOMY!DATABASE!...!27!

1.1.3! MEDICAL!SUBJECT!HEADINGS!...!27!

1.1.4! ONLINE!MENDELIAN!INHERITANCE!IN!MAN!...!27!

1.1.5! NATIONAL!CANCER!INSTITUTE!THESAURUS!–!NCI!THESAURUS!...!27!

1.1.6! MERGED!DISEASE!VOCABULARY!A!MEDIC!...!28!

1.1.7! GENE!ONTOLOGY!...!28!

1.1.7.1! Gene!Ontology!terms!similarity!measurement!...!29!

1.1.7.2! Reznik’s!measure!...!29!

1.1.7.3! Lin’s!and!Jiang’s!measures!...!30!

1.1.7.4! Wang’s!measure!...!30!

1.1.8! GENE/PROTEIN!ANNOTATION!DATA!BASES!...!31!

1.1.8.1! Gene!and!Protein!Synonyms!Database!A!GPSDB!...!31!

1.1.8.2! UniProtKB!...!31!

1.1.8.3! EntrezGene!...!32!

INTEROPERABILITY)FORMAT)FOR)BIOMEDICAL)TEXT)PROCESSING)B)BIOC)...)32!

BIOMEDICAL)TEXT)MINING)...)33!

1.1.9! INFORMATION!RETRIEVAL!...!33!

1.1.10!NAMED!ENTITY!RECOGNITION!...!34!

1.1.10.1! Types!of!Biomedical!NERs!...!35!

1.1.11!INFORMATION!EXTRACTION!...!36!

1.1.12!DATA!...!36!

1.1.13!EVALUATION!METRICS!...!37!

1.1.14!GENE/PROTEIN!NER!...!38!

1.1.14.1! Concepts,!meaning!and!representation!...!38!

1.1.14.2! DictionaryAbased!NER!...!39!

1.1.14.3! RuleAbased!approach!...!39!

1.1.14.4! ClassiﬁcationAbased!approach!...!39!

1.1.14.5! SequenceAbased!approach!...!40!

1.1.15!COMMON!WORDS!IN!BIOMEDICAL!TEXT!...!40!

(13)

1.1.16!ENTITIES!NORMALIZATION!...!40!

1.1.16.1! Methods!...!41!

1.1.16.2! Problems!...!41!

1.1.16.3! Evaluation!methods!...!42!

1.1.16.4! Existing!Approaches!...!43!

1.1.17!IDENTIFICATION!OF!ENTITIES!RELATIONS!...!43!

1.1.18!DOCUMENTS!CLASSIFICATION!...!43!

1.1.18.1! kANearest!Neighbour!(kANN)!approach!...!44!

1.1.18.2! SVM!approach!...!44!

1.1.19!DOCUMENTS!RANKING!...!45!

1.1.20!QUESTIONAANSWERING!...!45!

1.1.21!INTEROPERABILITY!OF!TEXT!MINING!TOOLS!...!46!

1.1.22!EXISTING!TEXT!MINING!SYSTEMS!FOR!BIOMEDICAL!LITERATURE!...!47!

UTILITY)AND)USABILITY)OF)TEXT)MINING)SYSTEMS)...)48!

1.1.23!UTILITY!...!49!

1.1.24!USABILITY!...!49!

SUMMARY)...)50!

2! Gene and Gene Product Name Detection and Normalization ... 53!

AIMS)AND)BACKGROUND)(NORMAGENE))...)53!

RESOURCES)...)54!

2.1.1! BIOMEDICAL!CORPUS!...!54!

2.1.2! CONTROLLED!VOCABULARIES!A!GPSDB!...!56!

2.1.3! GENE!ONTOLOGY!CATEGORIZER!...!56!

2.1.4! ONTOLOGY!LOOKAUP!SERVICE!(OLS)!...!57!

METHODS)...)59!

2.1.5! SPECIES!AND!GENES!NER!...!60!

2.1.6! WORDS!DISAMBIGUATION!...!61!

2.1.7! GN!TASK/SCORE!FUNCTION!...!61!

2.1.8! IMPACT!OF!THE!GENE!ONTOLOGY!CATEGORIZER!...!64!

2.1.9! EVALUATION!METHODS!...!64!

2.1.10!RESULTS!...!64!

COMPARISON)WITH)BIOCREATIVE)III)RESULTS)...)66!

DISCUSSION)...)69!

CONCLUSION)...)70!

3! Biomedical Curation Systems ... 73!

BACKGROUND)AND)AIMS)...)73!

TOXICAT)EVOLUTION:)FROM)TOXICAT)I)TO)TOXICAT)II)...)75!

3.1.1! RESOURCES!...!76!

3.1.1.1! BCAtriage!data!...!76!

3.1.1.2! CTDAtrack!data!...!78!

3.1.2! METHODS!...!79!

3.1.2.1! ToxicatAI!...!79!

3.1.2.2! ToxicatAII!or!NERA!services!...!86!

3.1.3! TOXICAT!GRAPHICAL!USER!INTERFACE!...!88!

3.1.4! RESULTS!...!90!

3.1.4.1! ToxicatAI!...!90!

3.1.4.2! ToxicatAII!...!93!

3.1.5! DISCUSSION!...!98!

3.1.6! CONCLUSION!...!100!

(14)

4! System Functionality and Adequacy ... 103!

4.1! BACKGROUND)...)103!

KINASE)CATEGORIZER)SYSTEM)B)KICAT)...)105!

RESOURCES)...)105!

4.1.1! HUMAN!RESOURCES!...!105!

4.1.2! TUNING!DATA!SET!...!106!

4.1.3! EVALUATION!TASKS!AND!TESTING!DATA!...!108!

4.1.3.1! Classification!task!–!Test!set!I!...!108!

4.1.3.2! Coverage!task!–!Test!set!II!...!112!

4.1.4! SUMMARY!...!116!

METHODS)AND)DESIGN)...)117!

4.1.5! SYSTEM!DESIGN!...!118!

4.1.5.1! AdAhoc!Gene!Ontology!entity!recognition!and!GOCat!module!...!119!

4.1.5.2! ProteinAProtein!Interaction!Module!...!120!

4.1.5.3! Scoring!Function!...!120!

4.1.6! GRAPHICAL!USER!INTERFACE!...!125!

4.1.7! WORKSPACE!...!127!

4.1.8! EVALUATION!OF!KICAT!SYSTEM!...!129!

4.1.9! CLASSIFICATION!TASK!...!129!

4.1.10!EVALUATION!OF!KICAT!AS!A!DECISIONASUPPORT!TOOL!FOR!BIOCURATION!...!130!

RESULTS)...)130!

4.1.11!CLASSIFICATION!TASK!RESULTS!...!130!

4.1.12!EFFICIENCY!OF!ASSISTED!CURATION!...!132!

4.1.13!COMPARISON!OF!TIME!AND!VOLUME!OF!THE!CLASSIFICATION!WORKLOAD!...!134!

DISCUSSION)...)136!

CONCLUSION)...)138!

5! Conclusion ... 141!

RECOGNITION)AND)NORMALIZATION)OF)INFORMATION)...)142!

DOCUMENTS)CLASSIFICATION)AND)PRIORITIZATION)...)144!

5.1.1! PRIORITISATION!OF!DOCUMENTS!IN!TOXICATAI!SYSTEM!...!144!

5.1.2! PRIORITISATION!OF!DOCUMENTS!IN!KICAT!SYSTEM!...!145!

TEXT)MINING)TOOLS)TO)SUPPORT)CURATION)...)146!

References ... 149!

Appendix ... 165!

RESOURCES)...)165!

TEST)SET)I)–)KICAT)...)165!

TEST)SET)II)–)MANUAL)CURATION)...)172!

(15)

Abbreviations

BioCreative (Critical Assessment of Information Extraction systems in Biology) BC or BioCreative Critical Assessment of Information Extraction systems in Biology

CTD The Comparative Toxicogenomics Database

EAGLi Engine for question-Answering in Genomics Literature EntrezGene NCBI's repository for gene-specific information

GN Gene Normalization

GO Gene Ontology

GOCat Gene Ontology Categorizer GPSDB Gene/Protein Synonyms Database

IR Information Retrieval

KiCat Kinase Categorizer

LingPipe tool kit for processing text using computational linguistics MEDLINE Medical Literature Analysis and Retrieval System Online MeSH Medical Subject Headings

NER Named-entity recognition(-er)

NLP Natural Language Processing

OLS Ontology Look-up service

PMID PubMed identifier

PPI Protein-Protein Interaction

PubMed is a free search engine accessing primarily the MEDLINE

QA Question-Answering

SVM Support Vector Machine

UniProtKB/Swiss-

Prot is the central source of the collection of functional information on proteins TAP-k Threshold Average Precision

TREC Text REtrieval Conference UMLS Unified Medical Language System

(16)

%

(17)

List of Figures

FIGURE!2.1!THE!STATISTICS!OF!ORGANISMS’!NAMES!DISTRIBUTION.!...!55!

FIGURE!2.2!EXAMPLE!OF!GPSDB!OUTPUT.!...!57!

FIGURE!2.3!AN!EXAMPLE!OF!THE!ORGANISM!NAME!VARIATIONS,!TAKEN!FROM!THE!BCIII!BENCHMARK.!...!58!

FIGURE!2.4!THE!OUTPUT!OF!THE!OLS!ON!THE!GIVING!TERM!STREPTOCOCCUS!PNEUMONIAE.!...!59!

FIGURE!2.5!A!DOCUMENTS’!WORKFLOW!OF!NORMAGENE..!...!60!

FIGURE!2.6!DISTRIBUTION!OF!FEATURE!FACTORS!OF!THE!SCORE!FUNCTION.!...!62!

FIGURE!2.7!COMPARISONS!OF!NORMAGENE!AND!BIOCREATIVE!RESULTS.!...!66!

FIGURE!3.1!THE!EXAMPLE!OF!HOW!DATA!COLLECTION!IS!REPRESENTED!IN!XML!FORMAT.!...!79!

FIGURE!3.2.!THE!TOXICAT’S!GENE!DETECTION!WORKFLOW.!...!81!

FIGURE!3.3.!WORKFLOW!OF!TOXICATAI!COMPONENTS![VISHNYAKOVA,!PASCHE!ET!AL.!2011].!...!85!

FIGURE!3.4.!THE!WEBABASED!GUI!OF!GOCAT![GOBEILL,!PASCHE!ET!AL.!2013].!...!86!

FIGURE!3.5.!THE!RANKED!OUTPUT!OF!THE!GOCAT!(HTTP://EAGL.UNIGE.CH/GOCAT/).!...!87!

FIGURE!3.6!THE!INPUT!STATE!OF!THE!TOXICAT’S!WEBAGUI.!...!88!

FIGURE!3.7!THE!TOXICAT’S!GUIAOUTPUT.!...!88!

FIGURE!3.8!EXPANDEDAMODE!OF!GUI.!...!89!

FIGURE!3.9.!THE!TOXICAT’S!INTEGRATION!WITH!EAGLI!SERVICES.!...!89!

FIGURE!3.10!!INTEGRATION!OF!TOXICAT!SERVICE!IN!EAGLI.!...!90!

FIGURE!3.11!AGGREGATED!SCORES!OF!THE!BCATRIAGE!TASK!PARTICIPANTS![VISHNYAKOVA,!PASCHE!ET!AL.! 2011],![WIEGERS,!DAVIS!ET!AL.!2012].!...!91!

FIGURE!3.12.!MAP!SCORES!OF!PARTICIPANTS![WIEGERS,!DAVIS!ET!AL.!2012].!...!92!

FIGURE!3.13.!GENE!RECALLS!OF!BCATRIAGE!TASK!PARTICIPANTS![WIEGERS,!DAVIS!ET!AL.!2012].!...!92!

FIGURE!3.14!CHEMICAL!RECALLS!OF!BCATRIAGE!TASK!PARTICIPANTS![WIEGERS,!DAVIS!ET!AL.!2012].!...!93!

FIGURE!3.15!DISEASE!RECALLS!OF!BCATRIAGE!TASK!PARTICIPANTS![WIEGERS,!DAVIS!ET!AL.!2012].!...!93!

FIGURE!3.16.!GENE!NER!RESULTS!SUCH!AS!RECALL,!PRECISION,!FASCORE!AND!THEIR!AVERAGE!SCORES!OF!BCIVA CTD!TRACK!PARTICIPANTS[WIEGERS!T.!2013].!...!95!

FIGURE!3.17.!CHEMICAL!NER!RESULTS!SUCH!AS!RECALL,!PRECISION,!FASCORE!AND!THEIR!AVERAGE!SCORES!OF! BCIVACTD!TRACK!PARTICIPANTS![WIEGERS!T.!2013].!...!95!

FIGURE!3.18.!DISEASE!NER!RESULTS!SUCH!AS!RECALL,!PRECISION,!FASCORE!AND!THEIR!AVERAGE!SCORES!OF! BCIVACTD!TRACK!PARTICIPANTS![WIEGERS!T.!2013].!...!96!

FIGURE!3.19.!ACTION!TERM!NER!RESULTS!SUCH!AS!RECALL,!PRECISION,!FASCORE!AND!THEIR!AVERAGE!SCORES! OF!BCIVACTD!TRACK!PARTICIPANTS![WIEGERS!T.!2013]..!...!96!

FIGURE!3.20.!BALANCED!FASCORES!OF!NER!CATEGORIES!AND!THEIR!AVERAGE!SCORES!OF!BCIVACTD!TRACK! PARTICIPANTS![WIEGERS!T.!2013].!...!97!

FIGURE!3.21.!RESPONSE!TIME!OF!EACH!NER!WEBASERVICE!AND!THEIR!COMBINED!AVERAGE!SCORE!OF!BCIVA CTD!TRACK!PARTICIPANTS[WIEGERS!T.!2013]..!...!97!

FIGURE!4.1!DISTRIBUTION!OF!JOURNAL!NAMES!AND!PMIDS!ACCORDING!TO!THE!THREE!ASPECTUAL!CLASSES.!.!107! FIGURE!4.2!OVERVIEW!OF!THE!JOURNAL!NAMES!OVERLAPPING,!REGARDING!PAIRWISE!TOPICS!SUCH!AS!PPI,!GO! AND!DISEASE..!...!107!

FIGURE!4.3!OVERLAPPING!RESPONSES!OF!DISEASE!AXIS.!...!110!

FIGURE!4.4!OVERLAPPING!RESPONSES!OF!PPI!AXIS.!...!111!

FIGURE!4.5!OVERLAPPING!RESPONSES!TO!ANNOTATE!PROTEIN!WITH!GO!DESCRIPTORS.!...!111!

FIGURE!4.6!EXTRACT!OF!THE!DIRECTED!ACYCLIC!GRAPH!OF!GO!DESCRIPTORS!ASSIGNED!BY!OUR!CURATORS.!..!114!

FIGURE!4.7.!DISTRIBUTION!OF!GO!TERMS!IN!THE!INTERSECTED!PMIDS!OF!CURATOR1!(RED!COLOUR)!AND! CURATOR3!(GREEN!COLOUR).!...!115!

FIGURE!4.8.!DISTRIBUTION!OF!THE!ASSIGNED!GO!TERMS!PER!PMID!...!116!

FIGURE!4.9.!KICAT!COMPONENTS,!RESPONSIBLE!FOR!EACH!STEP!OF!THE!BIOCURATION!PROCESS:!RETRIEVAL,! SELECTION,!READING,!EXTRACTION!AND!NORMALIZATION.!...!117!

FIGURE!4.10!THE!WORKFLOW!OF!THE!KICAT!COMPONENTS.!...!119!

FIGURE!4.11.!FEATURES!COMBINATION!FOR!THE!DISEASE!AXIS.!...!121!

FIGURE!4.12.!FEATURES!COMBINATION!FOR!THE!GO!AXIS.!...!122!

FIGURE!4.13!FEATURES!COMBINATION!FOR!THE!PPI!AXIS..!...!123!

FIGURE!4.14!INPUT!FIELDS!OF!KICAT’S!GUI..!...!126!

FIGURE!4.15!KICAT’S!LIST!OF!RESULTS.!...!126!

(18)

FIGURE!4.16!THE!EXPANSION!OF!KICAT!RESULTS.!...!127!

FIGURE!4.17!KICAT’S!WORKSPACE.!...!128!

FIGURE!4.18!KICAT’S!DIALOG/RECORDING!WINDOW.!...!129!

FIGURE!4.19!THE!CLASSIFICATION!CAPACITY!IN!TIME!AND!VOLUME.!...!135!

FIGURE!4.21!COMPARISON!OF!DOCUMENTS’!VOLUME!(ABSTRACTS)!CLASSIFIED!PER!WORK!YEAR!BY!A!CURATOR! AND!KICAT!SYSTEM.!...!136!

%

(19)

%

List of Tables

TABLE!2.1!SPECIES!NAMES!DISTRIBUTION!IN!THE!BCIII!BENCHMARK.!...!55!

TABLE!2.2!STATISTICS!OF!ANNOTATED!GENE!ID!OVER!THE!SETS!OF!BC!III!...!56!

TABLE!2.3!DESCRIPTION!OF!SCORE!FUNCTION!FEATURES.!...!63!

TABLE!2.4!COMPARISON!OF!GENE!NORMALIZATION!RESULTS!WHERE!THE!SCORE!FUNCTION!A!F1!IS!OF!TWO!TYPES:! 1)!WITH!GOCAT;!2)!WITHOUT!GOCAT!...!65!

TABLE!2.5!COMPARISON!OF!GENE!NORMALIZATION!RESULTS!WHERE!THE!SCORE!FUNCTION!A!F2!IS!OF!TWO!TYPES:! 1)!WITH!GOCAT!2)!WITHOUT!GOCAT!...!65!

TABLE!2.6!THE!OFFICIAL!NORMAGENE!RESULTS!ACHIEVED!DURING!THE!BCIII!CONTEST.!...!65!

TABLE!2.7!COMPARISON!OF!THE!BESTAACHIEVED!RESULTS!OF!BIOCREATIVE!III!WITH!THE!NORMAGENE!BESTA ACHIEVED!RESULTS.!...!67!

TABLE!3.1.!THE!INPUT!FORMAT!OF!A!TRAINING!SET!FOR!THE!BCATRIAGE!TASK.!...!77!

TABLE!3.2!DISTRIBUTION!OF!ENTITIES!IN!BIOCREATIVEATRIAGE!TRAINING!SET.!...!77!

TABLE!3.3.!DISTRIBUTION!OF!CURATED!ARTICLES!PER!CHEMICAL.!...!78!

TABLE!3.4!COMPONENTS!USED!IN!TOXICAT!SYSTEM!DESIGN!AND!THEIR!REFERENCES!TO!THE!BIOCURATION!STEPS !...!80!

TABLE!3.5.!FEATURE!SET!OF!THE!TOXICAT’S!SVM!CLASSIFIER!AND!CONTRIBUTION!OF!EACH!FEATURE!TO!THE!FA SCORE!...!81!

TABLE!3.6.!SEMANTIC!TYPES!EXCLUDED!FROM!THE!DISEASE!AND!CHEMICAL!LIST!OF!CANDIDATES!...!83!

TABLE!3.7.!RESULTS!OF!TOXICAT!(TEAM!120)!FOR!THE!TASKAI!OF!BIOCREATIVE!2012!...!91!

TABLE!3.8!RESULTS!OF!TOXICATAII!ON!BCIV!TEST!DATA.!...!94!

TABLE!3.9.!MACROAAVERAGE!RECALL!(MAR)!OF!TOXICAT!IAII!ON!DIFFERENT!TEST!DATASETS.!...!98!

TABLE!4.1!DISTRIBUTION!OF!ARTICLES!IN!THE!TRAINING!SET.!ALL!ARTICLES!ARE!DIVIDED!INTO! CATEGORIES/TOPICS.!EACH!SUBJECT!HAS!THREE!DIMENSIONS:!GO,!PPI,!DISEASE.!...!106!

TABLE!4.2!DISTRIBUTION!OF!PMIDS!PER!CONCEPT!DIMENSION.!...!106!

TABLE!4.3!DISTRIBUTION!OF!RESPONSES!PER!AXIS!IN!TEST!DATA.!...!109!

TABLE!4.4!DISTRIBUTION!OF!POSITIVE!RESPONSES!FOR!EACH!AXIS!BY!EACH!CURATOR.!...!109!

TABLE!4.5!MANUAL!CLASSIFICATION!RESPONSES!ON!DISEASE!AXIS!...!110!

TABLE!4.6!MANUAL!CLASSIFICATION!RESPONSES!ON!GO!AXIS!...!110!

TABLE!4.7!MANUAL!CLASSIFICATION!RESPONSES!ON!PPI!AXIS!...!110!

TABLE!4.8!INTERPRETATION!OF!K"VALUES!(INTERARATER!AGREEMENT)!ACHIEVED!BY!PARTICIPATING!EXPERTS!ON! TEST!SET!I!...!112!

TABLE!4.9.!DISTRIBUTION!OF!CONCEPTS!IN!THE!DATA!SET!PERFORMED!BY!MANUAL!CURATION.!...!115!

TABLE!4.10!DESCRIPTION!OF!FEATURES!FOR!DISEASE!DIMENSION!SCORE!FUNCTION.!...!123!

TABLE!4.11!DESCRIPTION!OF!FEATURES!FOR!THE!SCORE!FUNCTION!OF!THE!GO!DIMENSION.!...!124!

TABLE!4.12!DESCRIPTION!OF!FEATURES!FOR!THE!SCORE!FUNCTION!OF!THE!PPI!DIMENSION.!...!124!

TABLE!4.13!KICAT’S!ACCURACY,!RECALL,!PRECISION!AND!F!SCORE!EVALUATIONS!ON!TEST!SET!I!FOR!THE!PPI!AXIS !...!130!

TABLE!4.14!KICAT’S!ACCURACY,!RECALL,!PRECISION!AND!FASCORE!EVALUATIONS!ON!TEST!SET!I!FOR!THE!GO!AXIS. !...!131!

TABLE!4.15.!KICAT’S!RESULTS!ON!DISEASE!AXIS:!ACCURACY,!RECALL,!PRECISION!AND!FASCORE!EVALUATIONS!ON! TEST!SET!I!...!131!

TABLE!4.16!MACROAAVERAGE!STATISTICS!OF!(P)!PRECISION,!(R)!RECALL,!(F)!F!SCORE.!...!131!

TABLE!4.17!DISTRIBUTION!OF!KICATAASSISTED!CURATED!GO!TERMS!PER!CURATOR.!...!132!

TABLE!4.18!IMPROVEMENTS!MEASURED!PER!EXPERT!IN!DISEASE!TERMS!COVERAGE.!...!133!

TABLE!4.19!IMPROVEMENTS!MEASURED!PER!EXPERT!IN!GO!TERMS!COVERAGE.!...!133!

TABLE!4.20!GO!TERMS!CURATION!IMPROVEMENTS!PER!EXPERT!...!134!

TABLE!4.21!CLASSIFICATION!PERFORMANCE!...!135!

(20)

(21)

%

Introduction

A recurrent theme in the daily (professional) life of a biocurator is the quest for trustworthiness or the lack thereof, when operating with text-mining applications.

Nowadays, almost all produced documents are available under electronic format and are distributed by different informational channels. The information provided in these documents is typically available in free text contents, which is easily understood by humans but difficult to interpret by machines. Due to the increasing number of electronically available publications stored in databases and digital libraries such as MEDLINE, there is an increasing interest in the technology of text mining. Such technology is broadly applied for a wide variety of research or business needs.

Biomedical text mining is a rather recent research and development field, at the convergence of natural language processing, data mining, bio- and medical informatics. As a consequence, there is an increasing interest in developing systems and methods, which are able to detect and normalize entities, in order to build knowledge repositories and provide accurate, well-structured information on demand.

Furthermore, the attention of the biomedical community has been moving towards research in document retrieval, information extraction and automated text mining fields. In these fields, several research groups have approached a different problem, often reported using a different data set. Due to systems’ diversity it was challenging to assess, evaluate and select the most suitable applications and methods for a given task. Consequently, some domain-specific evaluation and assessment campaigns emerged in the information retrieval and bioinformatics communities such as BioCreative [Hirschman, Yeh et al. 2005] and TREC [Voorhees and Harman 2005].

BioCreative aims at providing a state-of-the-art set of common evaluation tasks for text mining and applied to biological problems, although surprisingly biocuration of protein functions has been usually regarded as a task unlikely to be performed by machines. Certainly, it is not yet time for a computer to read and to extract all the relevant data on its own. However, campaigns such as BioCreative represent the readiness of the community to apply automated text mining systems to real or quasi- real tasks.

The success of the campaign tasks shows the progress in application development of the various methods from Named Entity Recognition (NER) systems to user-friendly interfaces of the text-mining systems. Here, it should be noticed that most of the developed methods are strictly concerned with the recognition, organization and retrieval of information. Certainly, semantics is dependent on the domain of interest and it makes the curation process and the information retrieval challenging, so that developments are often too specialized to be reusable. Moreover, well-known search engines offer only simple representation levels (e.g. at the level of a library notice) of

(22)

the results and usually provide too generic meta-data (e.g. authors, journals, dates of publication, ideally indexing keywords…), which then must be carefully read. The lack of data at the source (authors, publishers…) dramatically affects the tasks related to literature curation in general and especially in life and health sciences which requires a careful examination by domain experts. As a consequence, experts face the lack of suitable systems to support and to speed-up the process of curation.

Evolution%of%Biomedical%NLP%techniques%

Natural Language Processing (NLP) techniques have, to date, more power than ever to provide its users efficient insights into large datasets. In order to judge Biomedical NLP (BioNLP) techniques evolution, as well as its success and its prevalence, it is useful to split it into subtopics or trends.

The analysis in [Thamrongrattanarit, Shafir et al. 2012] provides an overview of trends within the sub disciplines in BioNLP literature. This analysis revealed small changes in the topic frequency over time. This can be explained by the relative youth of the BioNLP field. However, the authors discovered a few topics that have undergone notable changes in past decade. Among those are three topics that gained popularity recently: event extraction, event triggers and their categorization. At the same time [Thamrongrattanarit, Shafir et al. 2012] pointed out that two topics, with a high popularity in early years of BioNLP are today much less popular.

These topics are referring to named-entity recognition (strong loss of publication after 2004)¹ and protein interaction (with a decreased interest since 2005) tasks. The most likely explanation is that these tasks reached optimum with early BioCreative campaigns, while since 2011, tracks became more complex and user-oriented [Lu and Hirschman 2012], [Wiegers, Davis et al. 2012].

Problem%outline/Background%

It has been argued that advances in NLP applied to life science have been only remotely connected with the real needs of end users [Hirschman, Burns et al. 2012].

In the general, we can arguably observe that few text-mining systems became successful to support biocuration with the noticeable exception of TextPresso [Müller, Kenny et al. 2004]. We believe it is about time to address the current obstacles in using text mining in for biocuration. It is also part of the evolution to move forward in the development in order to meet user requirements and to advance standard development by re-using and by (re-) integrating text-mining tools.

1Trend analysis was done at http://albator.hesge.ch/medlinetrends/

(23)

The objective of this thesis is to develop a successful text mining system in order to assist the biocurator. The system should meet the needs of biocurators by automatizing the curation process. The idea is to draft a unified system able to search the relevant literature, prioritize it and finally supporting the curation in multiple qualitative dimensions. We do intend to push the model beyond named-entity recognition as performed by TextPresso and above all we hope to explore how text mining systems can be integrated as deeply as possible with the curation workflows and instruments.

Research%Hypothesises%

Starting with the needs of the biomedical community and taking into account the lack of suitable curation systems, the current PhD thesis puts forward the following three hypotheses about what tentatively defines an optimal automation system:

Hypothesis 1: Automation systems are capable of improving the quality of biocuration.

While working in the biomedical domain we often use Web content to obtain information or a data sets that will meet our needs. There are many biological databases available that have been created over the last 20 years. Behind all these data and information resources there are real people who arrange this information and put it into the formats that allow users to easily work with it.

As the volume of biological literature increases, biocurators need help in keeping up with the literature. Automated or semi-automated systems for the biocuration support would be an ideal application to accelerate traditional curation or even to provide biologists and experimentalists with instruments to make sense of the literature.

However, most of the resources are highly heterogeneous, task-specific, multi-formats and are not easily accessible or adaptable [Neves and Leser 2012]. Many text-mining tools are freely available online but developed rather not for sake of annotation.

Reviews of biocurator work and their needs, see [Howe, Costanzo et al. 2008] and more recently [Li, Liakata et al. 2013] outline the following key-problems, which influence the quality of biocuration process and which have to be considered in the supportive system:

• Connection of information from different sources in a rational and comprehensible way;

• Development and management of controlled vocabularies that are crucial to build data relationships;

• Ability to inspect and correct entities and their respective associations so that inconsistencies or errors can be corrected;

(24)

• Integration of knowledge bases for representing complex data e.g. protein interactions.

Although automated systems are not able to replace domain experts for the biological relationship discovery, they are able to provide evidence about possible action terms and to help expert to hypothesize these relationships. Thus [Li, Liakata et al. 2013]

draw the following example which comprises the above-mentioned gaps: automated text-mining system finds a molecular relation in the document and this relation is not referenced in a knowledge base, it is most-likely will be considered as a mistake or as an unknown relationship. Nevertheless, such cases may point to the existence of a new relationship, which in its turn holds the potential of expressing some biological novelty. Although the human is by definition establishing the ground truth, it is therefore important to directly construct experimental settings with the end-users instead of recycling database contents.

Hypothesis 2: Automated systems are capable to speed-up a biocuration process.

As it was already mentioned, behind high-quality data there are domain experts (the so-called “mechanical Turks”) who do the work. Here, the human factor is always the bottleneck of the biocuration work and it makes the whole process to be really time- consuming and extremely labour-intensive and so expensive. Thus, text-mining tools can potentially speed up the curation process provided that efficiency and effectiveness are adequate with the workflow where they must operate.

The literature reviews describing biocuration needs [Alex, Grover et al. 2008], [Voorhees and Harman 2005], [Howe, Costanzo et al. 2008], [Lu and Hirschman 2012] showed that an automated system, capable to support biocuration, has to be able to provide the answers on the following questions of the hypothesis 2:

1) Can an automated system be good enough to assist curators in keeping up with the overflowing data?

2) Is an automated system able to solve (some of) the bottlenecks of a curation process?

In order to provide answers it is necessary to quantify a curators’ work. According to [Wiegers, Davis et al. 2009], it was easy for biocurators to identify articles not appropriate for the curation of the Comparative Toxicogenomics Database (CTD).

Therefore, it is estimated that CTD biocurators spent only 7% of their time on this task (on average 2.5 min per rejected article and 21 min for a curatable article), with 40% of articles labelled as ‘not appropriate’. Obviously, the timesaving is greatly dependent on the ratio of curatable to non-curatable documents in a given set [Lu and Hirschman 2012], hence the importance of the initial search/fetch steps, which must

(25)

be able to take into account relevance as well as date-related information. Thus in situations where it is difficult and time-consuming to identify papers with curatable content, document-ranking tools can be extremely valuable, as well as system able to take into account novelty dimensions.

The statistics in [Alex, Grover et al. 2008] and [Névéol, Islamaj Doğan et al. 2011]

provide information about the processing time and shows that manual curation is more time-consuming in comparison to assisted curation. It may seem obvious that an assisted curation speeds up the work of a curator; but the gain is in fact relatively modest: it is thus estimated that a maximum reduction of 1/3 in manual curation time can be expected if a Natural Language Processing (NLP) pipeline with perfect accuracy would be available… in particular because perfect systems might be on the shelf in the near future!

Hypothesis 3. Altogether automated systems are capable to improve productivity of biocurators

The analysis of the biocurators work showed up several commonalities in the process of the curation. These commonalities can be conventionally split up into three typical steps of a biocuration process [Hirschman, Burns et al. 2012], [Vishnyakova, Pasche et al. 2011]:

1) Retrieval of documents for the curation;

2) Selection of documents with relevant entities and 3) Detailed curation of specific relations.

To our knowledge there is no common computerized solution, which covers these steps in a single workflow, to support biocurators. The existing solutions such as Google, PubMed, EBIMed [Rebholz-Schuhmann, Kirsch et al. 2007], EAGLi [Gobeill, Pasche et al. 2012], [Ruch 2006], [Zhou, Smalheiser et al. 2006] suggest a generic solution for the step 1, where biocurators have to construct a particular query for the particular repository to retrieve documents; step 2 is almost always in the hands of professional biologists, who process manually a set of documents returned by step 1; step 3 may involve automated annotation system, which is represented usually by third-party applications and sometimes is hardly connected or even consistent with the previous steps.

An analysis of the activity of biocurators in [Névéol, Islamaj Doğan et al. 2011]

revealed that manual curation is less productive in identifying information than the same task performed with NLP assistance (+16.5% records). This suggests that the NLP assistance helps reaching better annotation coverage.

Bringing together into a pipeline an information retrieval engine, an interactive access to knowledge bases as well as task-specific NLP techniques it should be possible to

(26)

improve the overall productivity (efficiency, effectiveness, consistency, …) of the biocurators.

Thesis%Outline%

This thesis is organized as follows:

Chapter 1 provides a general survey (the state of the art) on the existing information resources, biological Natural Language Processing techniques, information retrieval systems, existing evaluation methods and existing challenges. In this chapter we also provide a definition of a biocuration workflow, as well as an overview of the investigations done to perform some assisted curation.

Chapter 2 describes the development and the evaluation of the original gene/protein- entity recognition and normalisation system designed to identify a wide range of biological entities. This system exploits well-known techniques of Named Entity Recognition (NER) along with systems, which are not, originally, designed for entity recognition (e.g. functional annotation of proteins with Gene Ontology categories).

Further, we compare the achieved results with the results obtained during the BioCreative III campaigns.

Chapter 3 focuses on the text-mining pipeline development for a specific database – Comparative Toxicogenomics Database (CTD). Moreover, we describe the add-on web services of this pipeline based on NER techniques. We first define the steps of the biocuration workflow and then construct the pipeline to cover these analytical steps. Further, we compare the measured results with the results derived from the BioCreative’12 and BioCreative IV campaigns. Additionally, we focus on the description of an original Graphical User Interface (GUI), which manages all steps of the defined workflow.

Chapter 4 reports on the design and evaluation of the user-centric curation pipeline designed for a specific database - the neXprot resource, maintained by the CALIPHO² group. First we focus on the topic-dependent representation of results to end-users.

We propose a user-workspace based on a light web GUI. Finally we report on the evaluation conducted by trained curators of the CALIPHO group.

2 http://www.isb-sib.ch/groups/geneva/calipho-bairoch.html

(27)

1 State of the art

1.1 Biomedical%Data/Resources%

“We are drowning in information but starved for knowledge.”

John Naisbitt The primary resource to exchange knowledge-intensive biological contents is obviously text. Every year, the amount of electronic biomedical resources available as text (literature, narratives in health records, patents, web…) grows with increasing speed [Wilczynski, McKibbon et al. 2013], [Pathak, Kho et al. 2013], [Pasche, Gobeill et al. 2014].

Some of these documents are stored in digital libraries. It is thus possible to extract such a collection from a biomedical literature database such as MEDLINE [Lindberg, Siegel et al. 1993], [Lindberg, Humphreys et al. 1993], PubMedCental [Roberts 2001], or for instance Cochrane [Collaboration 2000]. For example, the original concept of literature-based discovery was further facilitated by controlled vocabulary terms added to the bibliographic citations while indexing MEDLINE. The development of such resources has triggered the development of interchangeable annotation formats, guidelines and standards.

Corpora%

In linguistics ‘Corpora’ is a plural form of corpus or text corpus, which is a large and structured set of texts.

The most commonly used resource for biomedical text mining is by far MEDLINE.

This bibliographic database contains references to journal articles focused mainly on health and life sciences in a broad sense. It is maintained by the U. S. National Library of Medicine (NLM) and, to date, contains over 21 million references. Its oldest publications are dated from 1809.

It is possible to obtain abstracts of documents from MEDLINE via several channels.

The majority of text-mining solutions are supported by PubMeD from which one can download a set of records by using PubMed tools such as the Entrez Programming Utilities [Wheeler, Barrett et al. 2007]. Another way to obtain records from MEDLINE is to use corpora provided by evaluation campaigns, such as TREC, BioCreative and etc. It should be noticed that TREC collections are usually larger than other collections provided by other competitions.