• Aucun résultat trouvé

Standardization of the annotations in literature

6 Protein-protein Interaction Extraction

7.5 General discussion

7.6.4 Standardization of the annotations in literature

The bulk of challenges we had to go through to extract all this information from the literature also affected our perception regarding the way the information is provided to the scientific community.

On one hand, we realized that automatic methods are still far away from being able to manage all the problems involved in the PPI process as accurately as the human can do. On the other hand, we also noticed that concentrating all the curation work at the level of the database curators is not a scalable solution, given the amount of literature that must be treated (Baumgartner, Cohen et al.

2007). Consequently, it is worth thinking about alternative ways to resolve the annotation problem, for instance, by shifting many efforts at the level of the authors by giving them some curation responsibilities. Indeed, there is no other person more suitable than the authors themselves to identify and annotate the information of interest in their papers. Unfortunately, involving the authors in the integration of controlled information in their articles cannot be done so easily. It requires adopting strong standards regarding the way the information is structured and stored in the documents. Some researchers have already proposed some solutions (Rebholz-Schuhmann, Kirsch et al. 2006; Seringhaus and Gerstein 2007; Ceol, Chatr-Aryamontri et al. 2008; Rebholz-Schuhmann and Nenadic 2008), but they all require to increase the cooperation among all the actors related to publication microcosm.

Figures

FIGURE 1 COMPLETE PROTEIN-PROTEIN EXTRACTION PROCESS ... 8

FIGURE 2PROPORTION OF POSITIVE AND NEGATIVE INSTANCES IN THE TWO DATA SETS ... 42

FIGURE 3DISTRIBUTION OF THE TWO CLASSES GIVEN NUMBER OF MESH TERMS FOR THE TRAINING SET ... 49

FIGURE 4DISTRIBUTION OF THE NUMBER OF MESH TERMS IN THE TWO CLASSES FOR THE TEST SET ... 50

FIGURE 5COVERAGE AND AVERAGE ACCURACY OBTAINED BY CLASSIFYING DOCUMENTS OF THE TRAINING SET USING THE MOST REPRESENTATIVE CLASS OF THE MESH TERMS ... 50

FIGURE 6COVERAGE AND AVERAGE ACCURACY OBTAINED BY CLASSIFYING THE DOCUMENTS OF THE TEST SET USING THE MOST REPRESENTATIVE CLASS OF THE MESH TERMS ON THE TEST SET ... 51

FIGURE 7COVERAGE AND AVERAGE ACCURACY OBTAINED BY CLASSIFYING THE DOCUMENTS OF THE TEST SET USING THE MOST REPRESENTATIVE CLASS OF THE MESH TERMS ON THE TRAINING SET ... 51

FIGURE 8PROPORTION OF POSITIVE AND NEGATIVE INSTANCES GIVEN PROTEINS COUNT ... 53

FIGURE 9PROPORTION OF POSITIVE AND NEGATIVE INSTANCES GIVEN THE INTERACTORS COUNT ... 54

FIGURE 10ACCURACY GIVEN C AND EPSILON PARAMETER VARIATION ON THE TRAINING SET ... 57

FIGURE 11RECALL AND PRECISION GRAPH OF THE DIFFERENT APPORACHES USED AT BIOCREATIVEII ... 66

FIGURE 12ACCURACY OBTAIN BY THE DIFFERENT TEAM IN BIOCREATIVEII ... 66

FIGURE 13NUMBER OF GENES PER DOCUMENTS ... 73

FIGURE 14NUMBER OF WORDS COMPOSING GENE NAMES IN THE TRAINING SET ... 74

FIGURE 15COMMON AND EXCLUSIVE WORDS IN THE TRAINING AND TEST SET OF THE GENE MENTION TASK ... 75

FIGURE 16NUMBER OF TERMS PER UNIQUE ID IN GPSDB ... 76

FIGURE 17FREQUENCIES OF THE WORDS IN THE EXPERT CURATORS AND WALL STREET JOURNAL LEXICON ... 77

FIGURE 18PROPORTION OF WORDS IN THE LEXICON GIVEN THE PERCENTAGE OF UPPERCASE LETTERS ... 81

FIGURE 19PROPORTION OF WORDS IN THE LEXICON GIVEN THE PERCENTAGE OF DIGIT ... 81

FIGURE 20PERFORMANCES VARIATION GIVEN DIFFERENT TEMPLATES PARAMETERS ... 90

FIGURE 21F-MEASURE EVOLUTION GIVEN THE VARIATION OF COST,ETA AND MAX ITER ... 91

FIGURE 22COMPARISON OF THE DIFFERENT APPROACHES ... 94

FIGURE 23RECALL AND PRECISION OBTAINED FOR THE RUNS OF THE DIFFERENT GROUPS ... 95

FIGURE 24F-MEASURE OBTAINED BY THE DIFFERENT GROUPS DURING BIOCREATIVEII ... 96

FIGURE 25NUMBER OF TERMS RELATED TO AN ID ... 102

FIGURE 26NUMBER OF WORD PER TERM CONTAINED IN ENTREZGENE ... 104

FIGURE 27NUMBER OF WORDS COMBINATIONS PRODUCED GIVEN THE NUMBER OF SELECTED WORDS FOUND IN SEQUENCE ... 105

FIGURE 28 RELATIVE FREQUENCIES OF THE SHARED TERMS BETWEEN THE ENTREZGENE AND THE WALL STREET JOURNAL CORPUSES ... 107

FIGURE 29PRODUCED WORDS AND TERMS BY THE DIFFERENT VARIANT GENERATION TECHNIQUES... 110

FIGURE 30LEVENSHTEIN DISTANCE COMPUTED BETWEEN IFI-17 AND IFI16 ... 111

FIGURE 31PERFORMANCE OBTAINED GIVEN THE MINIMUM RELATIVE FREQUENCY OF THE SHARED WORDS ... 116

FIGURE 32 EVOLUTION OF THE PERFORMANCE GIVEN THE MINIMUM SIMILARITY LEVEL REQUIRED TO PERFORM A MATCH ... 117

FIGURE 33COMPARISON OF THE PERFORMANCE OF THE DIFFERENT APPROACHES IN BIOCREATIVEII... 119

FIGURE 34COMPONENTS OF THE PROTEIN-PROTEIN INTERACTIONS EXTRACTION PROCESS ... 124

FIGURE 35NUMBER OF INTERACTIONS PER DOCUMENT IN THE TRAINING SET ... 125

FIGURE 36PROPORTION OF ENTREZGENE TERMS FOUND IN GPSDB ... 127

FIGURE 37NUMBER OF TERMS PER ID GIVEN THE GPSDB AND ENTREZGENE LEXICON ... 127

FIGURE 38CHI2FEATURE SELECTION SCORE OF THE TRANSITIVE VERBS BASED ON THE DATA OF THE BINARY CLASSIFICATION TASK . 129 FIGURE 39CUMULATE FREQUENCY OF THE MOST COMMON SPECIE IN GPSDB ... 130

FIGURE 40 NUMBER OF SPECIES INVOLVED IN ONE INTERACTION... 131 FIGURE 41DISTRIBUTION OF THE NUMBER OF RETURNED PROTEIN MENTIONS PER DOCUMENTS GIVEN THE EXTRACTION METHOD . 134

FIGURE 42DISTRIBUTION OF THE INITIAL SET OF PROTEIN .VS. THE PROTEINS BELONGING TO AN INTERACTION PATTERN GIVEN THE SELECTION USING FUZZY METRIC ... 134 FIGURE 43NUMBER OF SPECIES FOUND IN DOCUMENTS ... 135 FIGURE 44NUMBER OF POSSIBLE SPECIES PER DOCUMENT REGARDING THE IDENTIFIED PROTEIN NAMES ... 136 FIGURE 45PROPORTION OF POSITIVE AND NEGATIVE INTERACTIONS CREATED DEPENDING OF THE TRANSITIVE VERB CHOSEN AS

TRIGGER ... 138 FIGURE 46PERFORMANCE ON SPECIES RETRIEVAL GIVEN THE NUMBER OF RETURNED SPECIES ... 139 FIGURE 47PERFORMANCE ON PROTEIN MENTIONS IDENTIFICATION GIVEN THE NUMBER OF RETURNED SPECIES ... 140 FIGURE 48DIFFERENCE IN RECALL DEPENDING IF THE CANDIDATE PROTEINS ARE FILTERED GIVEN THEIR BELONGING TO AN INTERACTION PATTERN ... 141 FIGURE 49COMPARISON OF THE RECALL AND PRECISION ON PROTEIN NORMALIZATION GIVEN THE NUMBER OF RETURNED SPECIES

BETWEEN THE DICTIONARY METHOD AND THE USE OF FUZZY MAPPING ... 141 FIGURE 50PERFORMANCE COMPARISONS OF THE DIFFERENT APPROACHES IN BIOCREATIVEII ... 143

Tables

TABLE 1UMLS SEMANTIC TYPES ... 45

TABLE 2DATABASES EMPLOYED TO POPULATE GPSDB ... 46

TABLE 3SIZE OF THE FEATURE SPACE OF THE TRAINING SET WITH AND WITHOUT STEMMING ... 48

TABLE 4 THE 68 WORDS CONSIDERED AS INTERACTOR ... 53

TABLE 5NUMBER OF FEATURES USED TO REPRESENT THE DOCUMENTS IN THE SPECIFIC FEATURE SPACE ... 55

TABLE 6PERFORMANCE OBTAIN BY USING OR NOT A COST MATRIX FOR THE CLASSIFICATION ... 58

TABLE 7NUMBER OF SELECTED FEATURE ON THE TRAINING AND TEST SET FOLLOWING THE FEATURE SELECTION PROCESS ... 59

TABLE 8ACCURACY AND F-MEASURE OBTAIN BY A CLASSIFICATION ON THE FEATURE SPACE OBTAINED FOLLOWING THE FEATURE SELECTION PROCESS ... 59

TABLE 9ACCURACY AND F-MEASURE OBTAIN BY A CLASSIFICATION (LIBSVMC=3,GAMMA=0.009) ON THE FEATURE WEIGHTED USING THE GIVEN WEIGHTING SCHEMA ... 60

TABLE 10SUMMARY OF THE MACRO AVERAGED PERFORMANCE ON THE TRAINING SET ... 61

TABLE 11SUMMARY OF THE MACRO AVERAGED PERFORMANCE ON THE TEST SET ... 61

TABLE 12SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 61

TABLE 13CLASSIFICATION PERFORMANCE ON THE TRAINING SET BASED ON THE PROTEIN COUNT AND INTERACTORS IN THE DOCUMENTS ... 62

TABLE 14CLASSIFICATION PERFORMANCE ON TEST SET BASED ON THE PROTEIN COUNT AND INTERACTORS IN THE DOCUMENTS ... 62

TABLE 15SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 62

TABLE 16CLASSIFICATION PERFORMANCES ON THE TRAINING SET BASED ON THE MIXTURE OF MESH TERM AND SEMANTIC TYPE FEATURES... 63

TABLE 17CLASSIFICATION PERFORMANCES ON TEST SET BASED ON THE MIXTURE OF MESH TERM AND SEMANTIC TYPE FEATURES ... 63

TABLE 18SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 63

TABLE 19CLASSIFICATION PERFORMANCES ON THE TRAINING SET BASED ON THE MIXTURE OF MESH TERM AND INTERACTORS FEATURES... 63

TABLE 20CLASSIFICATION PERFORMANCES ON TEST SET BASED ON THE MIXTURE OF MESH TERM AND INTERACTORS FEATURES ... 63

TABLE 21SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 64

TABLE 22CLASSIFICATION PERFORMANCES ON THE TRAINING SET BASED ON THE MIXTURE OF MESH TERM AND TEXTUAL FEATURES 64 TABLE 23CLASSIFICATION PERFORMANCES ON TEST SET BASED ON THE MIXTURE OF MESH TERM AND TEXTUAL FEATURES ... 64

TABLE 24SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 64

TABLE 25CLASSIFICATION PERFORMANCES ON THE TRAINING SET BASED ON THE MIXTURE OF PROTEIN COUNT AND TEXTUAL FEATURES ... 65

TABLE 26CLASSIFICATION PERFORMANCES ON TEST SET BASED ON THE MIXTURE OF PROTEIN COUNT AND TEXTUAL FEATURES ... 65

TABLE 27SUMMARY OF THE ACCURACY ON TRAINING AND TEST SET USING THE DIFFERENT FEATURES FOR THE MODEL... 65

TABLE 28ORTHOGRAPHIC FEATURES ... 80

TABLE 29SYNTACTICAL FEATURES ... 83

TABLE 30LEXICAL FEATURES ... 84

TABLE 31CLASSIFICATION OF THE SENTENCES OF THE TEST SET USING CROSS VALIDATION ... 86

TABLE 32PERFORMANCE ON THE TRUE INSTANCES GIVEN DIFFERENT MISCLASSIFICATION-COST USING CROSS VALIDATION ... 86

TABLE 33CLASSIFICATION OF THE SENTENCES OF THE TEST SET USING THE MODEL BUILD ON THE TRAINING SET ... 86

TABLE 34PERFORMANCE ON THE TRUE INSTANCES GIVEN DIFFERENT MISCLASSIFICATION-COST USING ON THE TEST SET... 87

TABLE 35COMPOSITION OF THE DIFFERENT SENTENCE SETS ... 87

TABLE 36PERFORMANCE MEASURE OBTAIN BY EXTRACTING THE NON "COMMON"ENGLISH WORDS... 88

TABLE 37PERFORMANCE MEASURES OBTAIN BY EXTRACTING ONLY THE "UNCOMMON"ENGLISH TERMS ... 88

TABLE 38CLASSIFICATIONS RESULTS USING ORTHOGRAPHIC FEATURES ... 92

TABLE 39PERFORMANCE MEASUREMENT WITH THE USE OF SYNTACTICAL FEATURES ... 92

TABLE 40CLASSIFICATION PERFORMANCE ON THE TEST SET USING DIFFERENT CONTROLLED VOCABULARY ... 93

TABLE 41 CLASSIFICATION PERFORMANCE OBTAINED BY MIXING THE FEATURES OF ALL CORPUSES ... 93

TABLE 42SUMMARY OF THE PERFORMANCE OBTAINED ON THE TEST SET GIVEN THE DIFFERENT APPROACHES... 94

TABLE 43PERFORMANCE OBTAINED IN PROTEIN/GENE RECOGNITION USING CRF WITH MORPHOLOGICAL FEATURES ... 107

TABLE 44NUMBER OF CANDIDATE WORD GIVEN THE DIFFERENT CANDIDATE GENERATION APPROACHES ... 107

TABLE 45FREQUENCY OF THE DIFFERENT PATTERN IN THE ENTREZGENE LEXICON ... 110

TABLE 46NUMBER OF GENERATED TERM AND WORDS GIVEN THE VOCABULARY VARIATION ... 110

TABLE 47PERFORMANCE OBTAINED ON THE TRAINING SET DEPENDING OF THE METHOD EMPLOYED TO EXTRACT THE CANDIDATE FROM THE ARTICLES ... 113

TABLE 48PERFORMANCE OBTAINED ON THE TEST SET DEPENDING OF THE METHOD EMPLOYED TO EXTRACT THE CANDIDATE FROM THE ARTICLES ... 113

TABLE 49PERFORMANCE OBTAINED ON THE TRAINING SET GIVEN DIFFERENT VARIANT LEXICON ... 114

TABLE 50RESULT BY KEEPING ALL CANDIDATES ... 115

TABLE 51 SELECTED VERBS REQUIRED TO TRIGGER AN INTERACTION ... 128

TABLE 52PERFORMANCE ON PPI EXTRACTION GIVEN THE TECHNIQUES ... 142

References

DIP Link.

EBI Link.

"Expasy - SwissProt and TrEMBL."

"The Gene Ontology categorizer aims at helping functional annotation of proteins."

Adar, E. (2004). "SaRAD: a Simple and Robust Abbreviation Dictionary " Bioinformatics Akbani, R., S. Kwek, et al. (2004).

20(4).

Applying Support Vector Machines to Imbalanced Datasets

Akgöbeka, Ö., Y. S. Aydinb, et al. (2006). "A new algorithm for automatic knowledge acquisition in inductive learning."

. ECML 2004, Pisa, Italy, Springer Berlin.

Knowledge-Based Systems

Al-Mubaid, H. and R. K. Singh (2005). "A new text mining approach for finding protein-to-disease associations."

19(6): 388-395.

American Journal of Biochemistry and Biotechnology

Alberts, B. and R. Miake-Lye (1992). "Unscrambling the puzzle of biological machines: the importance of the details."

. Cell

Ananiadou, S., C. Friedman, et al. (2004). "Introduction: named entity recognition in biomedicine."

68(3): 415-420.

Journal of Biomedical Informatics Andrade, M. and A. Valencia (1997).

37(6): 393-395.

Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system

Attwood, T. K., P. Bradley, et al. (2003). "PRINTS and its automatic supplement, prePRINTS."

. International Conference on Intelligent Systems for Molecular Biology, AAAI Press.

Nucleic Acids Research

Auerbach, D., M. Fetchko, et al. (2003). "Proteomic approaches for generating comprehensive protein interaction maps."

31(1): 400-402.

TARGETS

Augen, J. (2001). "Information technology to the rescue!"

2(3): 85-92.

Nature biotechnology

Bader, G., I. Donaldson, et al. (2001). "BIND--The Biomolecular Interaction Network Database."

19.

Nucleic Acids Research

Bairoch, A. and R. Apweiler (2000). "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2000."

(29).

Nucleic acids research

Bairoch, A., B. Boeckmann, et al. (2004 ). "Swiss-Prot: Juggling between evolution and stability."

28(1): 45-48.

Briefings in Bioinformatics Baker, L. D. and A. K. McCallum (1998).

5(1): 39-55.

Distributional clustering of words for text classification

Baumgartner, W. A., B. Cohen, et al. (2007). "Manual curation is not sufficient for annotation of genomic databases."

. Annual ACM Conference on Research and Development in Information Retrieval, Melbourne Australia.

bioinformatics

Bernardi, L., E. Ratsch, et al. (2002). "Mining Information for Functional Genomics."

23: 41-48.

IEEE Intelligent Systems

Bernardi, L., E. Ratsch, et al. (2002). "Mining information for functional genomics."

17(3): 66 - 80.

IEEE Intelligent Systems

Blaschke, C., M. Andrade, et al. (1999).

17(3): 66-80.

Automatic extraction of biological information from scientific text: protein-protein interactions

Blaschke, C., L. Hirschman, et al. (2002). "Information extraction in molecular biology."

. International Conference on Intelligent Systems for Molecular Biology.

Briefings in Bioinformatics

Blaschke, C. and A. Valencia (2001). "Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study."

3(2): 54-165.

Comparative and functional genomics Blaschke, C. and A. Valencia (2002). "The Frame-Based Module of the SUISEKI Information Extraction

System."

2(4): 196-206.

IEEE Intelligent Systems 17(2).

Blumer, A., A. Ehrenfeucht, et al. (1987 ). "Occam's Razor." Information Processing Letters

Boeckmann, B., A. Bairoch, et al. (2003). "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003."

24(6): 377 - 380.

Nucleic acids research

Bourne, P. E. and J. McEntyre (2006). "Biocurators: Contributors to the World of Science."

31(1): 365-370.

PLoS Computational Biology

Bowers, P. M., M. Pellegrini, et al. (2004). "Prolinks: a database of protein functional linkages derived from coevolution."

2(10).

Genome Biology

Brank, J., M. Grobelnik, et al. (2003). Training text classifiers with SVM on very few positive examples.

.

Breitkreutz, B., C. Stark, et al. (2008). "The BioGRID Interaction Database: 2008 update." Nucleic Acids Research

Camon, E., M. Magrane, et al. (2003). "The Gene Ontology Annotation Project Implementation of GO In SWISSPROT TrEMBL And InterPro."

(36).

Genome Research

Caspi, R., H. Foerster, et al. (2006). "MetaCyc: a multiorganism database of metabolic pathways and enzymes." Castellano, M., G. Mastronardi, et al. (2008).

. International Symposium on Reference Resolution for NLP.

Biomedical Text Mining Using a Grid Computing Approach

Ceol, A., A. Chatr-Aryamontri, et al. (2008). "Linking entries in protein interaction database to structured text:The FEBS Letters experiment."

. 4th international conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications - with Aspects of Artificial Intelligence, Shanghai, China Springer Berlin.

FEBS Letters Chang, J., S. Raychaudhuri, et al. (2001).

582: 1171-1177.

Including biological literature improves homology search Chawla, N. V., N. Japkowicz, et al. (2004). "Editorial: Special Issue on Learning from Imbalanced Data

Sets."

Claverie, J.-M. (2001). "Gene number. What if there are only 30,000 human genes."

. 6th conference on Message understanding, Columbia, Maryland

Science Cohen, A. M. and W. R. Hersh (2005). "A Survey of Current Work in Biomedical Text

Mining." Consortium, T. U. (2007). "The Universal Protein Resource (UniProt)."

, Springer Netherlands.

Nucleic Acids Research

Cooper, J. and A. Kershenbaum (2005). "Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information."

35: 139-197.

BMC Bioinformatics Cortes, C. and V. Vapnik (1995). "Support-vector networks."

7(6).

Machine Learning

Cotton, R. G. H., V. McKusick, et al. (1998). "The HUGO Mutation Database Initiative."

20(3). Craven, M. and J. Kumlien (1999).

. AAAI Workshop on Machine Learning for Information Extraction, Orlando Florida.

Constructing biological knowledge bases by extracting information from text sources

Dandekar, T., B. Snel, et al. (1998). "Conservation of gene order: a fingerprint of proteins that physically interact."

. International Conference on Intelligent Systems for Molecular Biology.

Trends in biochemical sciences

Daraselia, N., A. Yuryev, et al. (2004). "Extracting human protein interactions from MEDLINE using a full-sentence parser."

23(9): 324-8.

Bioinformatics 20(5): 604-611.

De Bruijn, B. and J. Martin (2002). "Getting to the (c)ore of knowledge: mining biomedical Language Processing in Biomedical Applications, Nicosia, Cyprus.

Assessment of the reliability of protein-protein interactions and protein function prediction

Donaldson, I., J. Martin, et al. (2003). "PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine."

. Pacific Symposium on Biocomputing.

BMC Bioinformatics Dougherty, J., R. Kohavi, et al. (1995).

4(11).

Supervised and unsupervised discretization of continuous features

Dunning, T. (1993). "Accurate methods for the statistics of surprise and coincidence."

. Twelfth International Conference on Machine Learning, Tahoe City, California, USA.

Computational Linguistics

Edgar, R. C. and K. Sjölander (2004). "A Comparison of Scoring Functions for Protein Sequence Profile Alignment "

19(1): 61 - 74.

Bioinformatics

Edwards, A., B. Kus, et al. (2002). "Bridging structural biology and genomics: assessing protein interaction data with known complexes."

20: 1301-1308.

Trends in genetics

Ehrler, F., A. J. Yepes, et al. (2005). "Data-poor Categorization and Passage Retrieval for Gene Ontology Annotation in Swiss-Prot."

18(10): 529-536.

Bioinformatics

Enault, F., K. Suhre, et al. (2004). "Phydbac2: improved inference of gene function using interactive phylogenomic profiling and chromosomal location analysis "

6(1).

Nucleic Acids Research

Enright, A., I. Iliopoulos, et al. (1999). "Protein interaction maps for complete genomes based on gene fusion events."

32.

Nature Eriksson, G., K. Franzén, et al. (2002).

402(6757): 86-90.

Exploiting Syntax when Detecting Protein Names in Text Fellenberg, M., K. Albermann, et al. (2000).

. EFMI Workshop on Natural Language Processing in Biomedical Applications, Nicosia, Cyprus.

Integrative Analysis of Protein Interaction Data Fernández, J. M., R. Hoffmann, et al. (2007). "iHOP web services."

. Eighth International Conference on Intelligent Systems for Molecular Biology, AAAI Press

Nucleic Acids Research Fields, S. and O. Song (1989). "A novel genetic system to detect protein-protein interactions."

35: 21-26.

Nature Fillmore, C. (1968). "The Case for Case."

340: 245-246.

Universals in Linguistic Theory

Forman, G. (2003). "An extensive empirical study of feature selection metrics for text classification."

: 1-90.

The Journal of Machine Learning Research Fragoudis, D., D. Meretakis, et al. (2002).

3: 1289 - 1305.

Integrating Feature and Instance Selection for Text Classification

Franzén, K., G. Eriksson, et al. (2002). "Protein names and how to find them."

. ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada.

International journal of medical informatics

Franzén, K., G. Eriksson, et al. (2002). "Protein Names and how to find them."

4(67): 49-61.

International Journal of Medical Informatics

Friedman, C., P. Kra, et al. (2001). "GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles."

67(1): 49-61.

Bioinformatics

Fry, D. C. (2006). "Protein-protein interactions as targets for small molecule drug discovery."

17(1): 74-82. ACM/IEEE-CS joint conference on Digital libraries, Houston, Texas

Toward Information Extraction: Identifying protein names from biological papers

Fukuda, K., T. Tsunoda, et al. (1998).

. 3rd Pacific Symposium on Biocomputing.

Toward Information extraction identifying protein names from biological papers

Fuller, M. and J. Zobel (1998).

. Pac. Symp. Biocomput.

Conflation-based Comparison of Stemming Algorithms. Third Australian Document Computing Symposium

Gabrilovich, E. and S. Markovitch (2004 ). Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5

Gasperin, Karamanis, et al. (2007).

. twenty-first international conference on Machine learning, Banff, Alberta, Canada.

Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme

Gasperin, C. (2006).

. DAARC.

Semi-supervised anaphora resolution in biomedical texts Gasteige, r. E., E. Jung, et al. (2001). "The beginnings of a database."

. BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL, New York City.

Current issues in molecular biology

Gasteiger, E., E. Jung, et al. (2001). "SWISS-PROT: Connecting Biomolecular Knowledge Via a Protein Database."

3(3): 47-55.

Current issues in molecular biology

Gavin, A.-C., M. Bösche, et al. (2001). "Functional organisation of the yeast proteome by systematic analysis of protein complexes."

3(3): 47-55.

Nature

Geer, R. C. and E. W. Sayers (2003). "Entrez: Making use of its power."

415: 141-147.

Briefings in Bioinformatics Gobeill, J., I. Tbahriti, et al. (2008). "Gene Ontology density estimation and discourse analysis for

automatic GeneRiF extraction."

4(2): 179-184.

Bioinformatics

Greenbaum, D., R. Jansen, et al. (2002). "Analysis of mRNA expression and protein abundance data:

an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts."

9(3).

Bioinformatics

Grigoriev, A. (2201). "A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae."

18(4): 585-596.

Nucleic Acids Research

Grzymala-Busse, J. W., J. Stefanowski, et al. (2005). "A comparison of two approaches to data mining from imbalanced data."

29(17): 3513-3519.

Journal of Intelligent Manufacturing

Guo, H. and H. L. Viktor (2004). "Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach."

16(6): 565-573.

ACM SIGKDD Explorations Newsletter

Hakenberg, J., S. Bickel, et al. (2005). "Systematic feature evaluation for gene name recognition."

6(1): 30 - 39.

BMC Bioinformatics

Hamosh, A., A. F. Scott, et al. (2002). "Online Mendelian Inheritance in Man (OMIM) a knowledgebase of human genes and genetic disorders."

6.

Nucleic Acids Research Hanisch, D., J. Fluck, et al. (2003).

30(1): 52-55.

Playing biology's name game: identifying protein names in scientific text

Hanisch, D., K. Fundel, et al. (2005). "ProMiner: rule-based protein and gene entity recognition."

. Pacific Symposium on Biocomputing, World Scientific.

BMC Bioinformatics

Hao, Y., X. Zhu, et al. (2005 ). "Discovering Patterns to Extract Protein-Protein Interactions from the Literature: Part II."

6.

Bioinformatics Hayes, P. J. and S. P. Weinstein (1990).

21(15): 3294-3300.

CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

Hirschman, L., A. A. Morgan, et al. (2002). "Rutabaga by any other name: extracting biological names."

. The Second Conference on Innovative Applications of Artificial Intelligence.

Journal of Biomedical Informatics

Ho, Y., A. Gruhler, et al. (2001). "Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry."

35(4): 247 - 259.

Nature

Hoffmann, R. and A. Valencia (2004). "A gene network for navigating the literature."

Hoffmann, R. and A. Valencia (2004). "A gene network for navigating the literature."