• Aucun résultat trouvé

Text Mining in Medicine

4.3 Text Mining Platforms and Tools

During the text mining research on biomedical data, a number of techniques have been developed. Some research work has further evolved into publicly available services, libraries, or larger projects. In this section, we present some general text mining tools that have been successfully used on biomedicine domain or were already initially developed to solve a biomedicine problem.

It is common that the researchers from the field of general text mining often relearn their approaches on different domain and evaluate them without having deeper knowledge about data [1]. Some of the general tools, successfully applied to a biomedicine domain, are listed in Sect.4.3.1. Presented frameworks therefore also include special biomedical plug-ins to handle specific data formats and visualizations.

In Sect.4.3.2, we present text mining tools that were initially developed to work with biomedical data. In contrast to general approaches, most of them include large corpuses of processed data and data manipulation tools.

4.3.1 General Tools

General Architecture for Text Engineering(GATE) [2] is an open-source platform containing a number of text analysis tools. It has been actively developed for more than 20 years by academic and industry contributors. The community is still constantly including new state-of-the-art techniques and technologies. Recently, they also provided a cloud-based service for large-scale text processing.

The family of products offers a variety of tools to work with. GATE Developer (Fig.4.1) is a special integrated development environment for end users. It provides a graphical user interface, which enables the user to read new data, construct a pipeline of text processing modules, and visualize and analyze the input data. Each component also offers customizations of special parameters or an ability to define custom extraction rules. Next, the GATE Teamware provides a web collaboration framework to annotate and curate new text corpora. Having a large number of tagged corpora is essential to improve or build new text mining techniques, and therefore it is very useful to use the same semiautomatic tagging technique for a group of manual annotators. Lastly, GATE Embedded provides an open-source library that can be directly used by software developers when building new text

analysis supported systems. Moreover, using the library, new custom plug-ins for GATE Developer tool can be implemented.

In the field of biomedicine, GATE tool was used to detect a head and neck cancer mutation association, medical records analysis, richer drug-related searching, protein-protein interactions, etc. [3]. There are also a lot of biomedical-specific plug-ins already developed. Furthermore, it allows to structure extracted data using an ontology language, or it allows using an ontology to guide the extraction process.

Unstructured Information Management Architecture (UIMA) [4] is a similar framework to GATE, but not so focused into providing a nice graphical user interface (Fig.4.2). By itself it is an empty framework for the analysis of unstruc-tured data like video, audio, and text data. It provides a general framework for data acquisition, representation, processing, and storing. The UIMA main goal was to develop a lot of reusable components, for example, annotators or external resources that can be easily plugged into the system. Next to general components, also specialized biological annotators or medical knowledge extraction have been developed. The framework became better known after IBM’s system Watson, which was built on top of UIMA, and had won the 2011 Jeopardy challenge.

In contrast to GATE and UIMA frameworks, a number of natural language processing libraries exist. Their main focus is to solve a specific task and to Fig. 4.1 GATE Developer. On theleft side, selected applications are shown with a number of processing resources that can process language resources. On theright side, a number of known annotation sets are listed. In thecenter, a processed document is visualized and user is given a possibility to refine the results

be used in third-party applications. Stanford Core NLP [5] is one of the most comprehensive set of tools that next to the text preprocessing techniques provides named entity recognizer, simple relationship extractor, and coreference resolution system. By functionality, similar library isBreeze[6], and other more general are alsoOpenNLP[7],NLTK [8],DepPattern [9], or FreeLing[10]. One of the key differences among them is also the programming language a specific library is available in.

4.3.2 Medicine-Specialized Tools

Turku Event Extraction System (TEES) is one of the best performing systems at biological shared tasks [11]. It uses a classification-based machine-learning approach to detect events from which it further identifies relationships. The whole TEES text mining process was designed to uncover biomedical interactions (i.e., relationships between proteins/genes and corresponding processes) in research articles. Additionally, the system includes standard text preprocessing tools and is adaptable to various tasks, like speculation and negation detection (i.e., similar to general sentiment analysis), protein/gene coreference resolution, or synonym detection.

Extraction of Classified Entities and Relations from Biomedical Texts (EXCERBT) [12] system is a result of a Ph.D. thesis in the field of bioinformatics.

Its main goal is to use a simple machine-learning algorithm with shallow semantic role labeling in contrast to slow full-text parsing. Next to EXCERBT, the authors also perform large-scale network analysis for extracted relationship ranking using neighboring documents similarities. Lastly, they use big data database Fig. 4.2 UIMA example annotation results. On theleft side, the source document is annotated with selected tags, and on theright side, annotation details are shown

technologies to efficiently store, index, and retrieve data from few ten million PubMed documents. In Fig.4.3, we show a graphical user interface built on top of EXCERBT with a simple information retrieval example.

Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) is an integration and information retrieval tool along with a database of known and predicted protein interactions [13]. The tool offers a successful confidence scoring function and merely focuses on an interactive user interface. The STRING database covers more than thousand types of organisms, ranging from bacteria to humans, which are represented by fully sequenced genomes. In Fig.4.4, we show a sample STRING result when searching forRAD51protein in context of a human genome.

The integration methods underneath the graphical user interface use standard approaches with some minor domain modifications.

Multiple Association Network Integration Algorithm(GeneMANIA) [14] is a real-time network integration algorithm for gene function prediction. The first part of the algorithm creates a single-gene functional association network using other genomic data sources. In the second part, a technique of label propagation (i.e., an approach that uses node values and their connections to achieve converged network) is used to identify gene functions from previous step. On the top of proposed algorithm, a network-based web interface has been developed that enables the user to find association data for a set of input genes. The association data includes protein/gene interactions, protein domain similarity, pathways, co-expressions, and co-localizations.

Genie [15] is a tool that evaluates biomedical literature and identifies genes within the texts from a number of databases. Its goal is to provide a ranked set of genes given target genome and a biomedical topic. Moreover, it supports natural language query input and provides high precision results. The researchers can further use that information to more thoroughly focus on higher ranked genes.

Fig. 4.3 EXCERBT search engine user interface. On theleft side, we select the source entity type, followed by relationship and target entity type. On theright side, documents with short text snippets, from which the whole relationship was extracted, are presented

GENIA Tagger[16] is an example of information extraction tool that provides text preprocessing taggers (i.e., part-of-speech tagger, parser) with named entity recognizer. From general tools, it differs by being trained on biomedical data with accordingly tuned features.