Named entity recognition/normalization problems

2 State of the Art

2.2 Named entity recognition and normalization

2.2.1.3 Named entity recognition/normalization problems

All the research performed these last years has revealed that recognizing biological objects in written language remains a difficult task for many reasons. These factors include a general lack of naming conventions or lack of usage of those conventions, excessive use of abbreviations, frequent usage of synonyms and homonyms (Leser and Hakenberg 2005). The problem is also made more complicated by the need to determine name boundaries and resolve overlap of candidate names. Indeed, as biological objects often possess names consisting of many single words, it is difficult to say clearly, where the name starts or ends. For instance, even among several human readers, it is complicated to reach consensus on which words are part of a protein such as human T-cell leukaemia lymphotropic virus type 1 Tax protein. Whereas many biologists would argue that the first word ‘human’ and the last word ‘protein’ are not elements of the protein name itself, others would claim the contrary (Leser and Hakenberg 2005).

2.2.1.3.1 Dynamic nature of terminology

The complex nature of protein and gene name identification is inherent to the dynamic nature of gene-name usage and name creation (Krallinger and Valencia 2005). Given the lack of standards governing the construction of words and the ways of combining them, the language used to describe biological knowledge is constantly changing. Conventions, if they exist, may differ between sub-disciplines scholars of different traditions (Franzén, Eriksson et al. 2002). Researchers are not obviously aware of other similar findings, they tend to use their favorite name. They usually choose a name for a gene that clearly differentiates it from the other materials according to their own conventions and background (Fukuda, Tsunoda et al. 1998) (Krallinger and Valencia 2005). These practices lead to a situation where most of the entities have many aliases and can have six or seven synonyms (Bernardi, Ratsch et al. 2002). When many synonyms exist for a given object, it is unlikely that the entities appearing in different articles can be linked together.

2.2.1.3.2 Homonymy

There are three main sources of homonymy in the vocabulary related to gene and protein: the general use of acronyms, the usage of common English terms to name a gene, and the usage of similar technical terms. Acronyms are abbreviations of names and are very popular in scientific writing because they allow shortening texts (Leser and Hakenberg 2005). Unfortunately, it is difficult to determine the true meaning of acronyms because they are often homonyms. For instance, the acronym ACE stands for angiotensin converting enzyme, affinity capillary electrophoresis, acetylcholinesterase and a couple of other things (Adar 2004).

The second source of homonymy results from the significant overlap that exists between gene and protein names and common English words. Indeed, especially in the beginning of the genomic era, gene names were not distinguished from normal language. The use of common English words to name genes and proteins make their identification very difficult without the use of contextual information. The fly organism poses the most complicated problem due to the significant overlap of fly gene names with common English words. Furthermore, a number of gene names differ only in their case. For example, with standard English words, such as light, map, complement, and Sonic hedgehog (Hanisch, Fundel et al. 2005).

The last source of homonymy results from the propensity of genes and proteins single names to refer to several entities. The most central problem of such ambiguity concerns the belonging of a same gene or protein to several species. Indeed, it is common for proteins to be named independently of the species from which the protein/gene is extracted. In such a case gene names disambiguation requires to distinguish between proteins that have the same names but belong to different genomes (Krallinger and Valencia 2005). Additionally, it is sometimes unclear whether the name refers to the gene or the gene product. Protein names overlap with gene names (myc-c gene and myc-c protein), cell cultures (CD4+-cells and CD4 protein), and may be rather similar to chemical compounds (Caeridin and Cantharidin) (Mika and Rost 2004). A given string, such as ‘‘rhodopsin’’, can refer to a number of different genes. Some names for different proteins in DIP are so similar that reliable distinction between them became impossible, e.g. p52shc and p52(Shc) are a mouse and a human protein that form part of different interactions in DIP (Blaschke and Valencia 2001). A recent study by Tuason et al. (Tuason, Chen et al. 2004) showed that from 2.4% to 32.9% gene names are highly ambiguous. This is especially cumbersome in the case of mouse and human genes; often, the same gene symbol is used in both species and both names are mentioned in the same textual passage.

2.2.1.3.3 Compounds word

The difficulty of defining simple rules to identify the beginning and the end of many multi-word gene and protein names makes the identification of protein name boundaries challenging for general linguistic analysis software (Fu, Mostafa et al. 2003). Even for human experts, it is sometime difficult to find an agreement on the exact borders of such names. Gene and protein names can contain verbs and other parts of speech that are hard to distinguish from the surrounding text, as in deleted in azoospermia-like, son of sevenless, ran, man, young arrest and never in mitosis. Genes can be transfected into cells, or combined with chemicals, resulting in ambiguous terms like CHO-A(3) and ca2+/calmodulin (Tanabe and Wilbur 2002 ). It is also difficult to decide which constituent of a name can be omitted and which parts are necessary to distinguish different proteins (Cdc7 and Cdc7 protein kinase are different proteins, but immunoglobulin enhancer binding factor e12 and e12 are not). All these problems not only increase the difficulty in retrieving such terms, but also make the evaluation more complex. Indeed, if one word of the terms does not belong clearly to the names, it is hard to say if an answer that misses it is correct or wrong (Leser and Hakenberg 2005). This can be particularly difficult when name contains verbs and adjectives (mullerian inhibiting substance deleted in azoospermia-like) (Tanabe and Wilbur 2002 ) or when it is not clear which word must be taken in consideration. For example, we have four ways to tag the name in the phrase yeast YSY6 protein:

yeast YSY6 protein, yeast YSY6, YSY6 protein or YSY6. This ambiguity implies that annotators may include yeast today and may exclude it a year later, unless given some ‘annotation rules’ (Mika and Rost 2004).

2.2.2 Evaluation

NER systems are usually evaluated using a standard text collection called “gold standard”. In such collection, all the interesting entities are tagged by human experts. It allows easily tuning an individual system as well as comparing it with systems working on the same data. The performance of NER tools is typically measured in terms of precision (percentage of true entity names in all entity names found; also called specificity) and recall (percentage of true entity names found in comparison to all true entity names; also called sensitivity) (Leser and Hakenberg 2005). These two figures are combined to the so-called F-measure, defined as the harmonic mean of precision and recall 2 ∗ 𝑃𝑃 ∗ 𝑅𝑅 (𝑃𝑃 + 𝑅𝑅)⁄ , or by reporting the balanced precision and recall, defined as the point where precision and recall are equal.

2.2.3 Approaches

We can generally bring together most of the methods employed to retrieve proteins and genes mention under three categories. The first family of methods called dictionary method relies mostly on the use of controlled lexicons. The principle consists to search the collection of documents for the terms contained in a dictionary of genes using subsequent exact or inexact pattern matching (Krauthammer, Rzhetsky et al. 2000). Secondly, the rule-based approach (Fukuda, Tsunoda et al.

1998) regroups the methods that attempt to define some patterns to recognize the entities. These patterns can either be constructed by hand or learned automatically from examples provided by experts. Once the patterns are generated, they are employed to scan each document of the collection to select every sequence of words that matches. Finally, the statistically based approach covers various machine learning (ML) techniques that classify every word in order to decide if they belong to the target entities. Obviously, there are also many types of hybrid approaches that attempt

to mix the advantages of the different techniques (Krallinger, Erhardt et al. 2005; Tanabe and Wilbur 2002).

Dans le document Modular text mining for protein-protein interactions extraction (Page 50-53)