General Data Integration Framework

SEARCH ENGINE USER

4.6 Data Integration

4.6.2 General Data Integration Framework

In Fig.4.10, we show a general end-to-end integration framework [37]. At first, the data needs to be preprocessed and transformed into network-based format as described in Sect. 4.6.1. Then, attribute resolution is performed, followed by entity resolution (i.e., merging). Entity resolution module takes a network with duplicated nodes as input and returns merged network, where new nodes (i.e., clusters) consist of a set of old duplicated nodes. The attribute resolution technique uses the same approach as entity resolution but works on attributes from different data sources and identifies which attributes represent the same data. Lastly, redundancy elimination step selects one representative value from each cluster and returns cleaned network. Post-processing step transforms the result network into selected format (e.g., attribute-value pairs, ontology-based) and returns it as a final result of integration execution.

Entity Resolution. A naive approach for entity resolution is simple pairwise comparison of attribute values among different entities. Although such approach could be already sufficient for flat data, this is not the case for network data, as the approach completely discards related data between the entities. For instance, when two entities are related to similar entities, they are more likely to represent the same entity. However, only the attributes of the related entities resolve to the same entities when their related entities resolve to not only similar but the same

entities. An approach that uses information, and thus resolves entities altogether, is denotedcollectiveentity resolution algorithm.

As an example, we show a state-of-the-art collective data clustering algorithm, proposed by Bhattacharya and Getoor [38]. The algorithm (Table4.6) is actually a greedy agglomerative clustering. Entities are represented as a group of clustersC, where each cluster represents a set of entities that resolve to the same entity.

Contexts (User, Data, Trust)

Fig. 4.10 General end-to-end integration framework

Table 4.6 Collective entity

resolution algorithm Collective entity resolution algorithm 1 Initialize clusters asC¼{{k}|k∈K} 2 Initialize priority queueQ¼∅ 3 forci,cj∈Candsim(ci,cj)θSdo

At the beginning, each entity resides in a separate cluster. Then at each step, the algorithm merges two clusters in C that are most likely to represent the same entity. During the algorithm, similarity of clusters is computed using a joint similarity measure, combining attribute, and related data similarity. First is a basic pairwise comparison of attribute values, while second introduces related information into the computation of similarity (i.e., data accessible using cluster neighbors in a network).

The algorithm (Table 4.6) first initializes clusters C and priority queue of similarities Q, considering the current set of clusters (lines 1–5). Each cluster represents at most one entity as it is composed out of a single knowledge chunk.

Algorithm then, at each iteration, retrieves currently the most similar clusters and merges them (i.e., matching of resolved entities), when their similarity is greater than threshold θS (lines 7–11), which represents minimum similarity for two clusters that are considered to represent the same entities. In line 11, clusters are simply concatenated. Next, lines 12–17 update similarities in the priority queueQ, and lines 18–22 insert (or update) also neighbors’ similarities (required due to related similarity measure). When the algorithm terminates, clusters Crepresent a sets of data resolved to the same entity. These clusters are then used to merge data at the redundancy elimination step.

After the entities have been resolved by entity resolution, the next step is to eliminate the redundancy and merge the data. Let c∈C be a cluster representing some entity,k1,k2,. . .,kn∈c be its merged references, andk^c∈K^C be the merged data within cluster. Furthermore, for some attribute a ∈A, we have precalculated values per data source. The algorithm (Table 4.7) first initializes merged network K^C. Then for each attribute k^c.a, it finds the most probable value among all given references ki within cluster c(line 3). When the algorithm unfolds, K^C represents a merged dataset with resolved entities and eliminated redundancy.

In Fig.4.11, we show example of data integration execution. First part represents input data in form of networks from three different data sources. Secondly, the result of entity resolution contains merged network in which some nodes contain more values (from each data source). Lastly, after redundancy elimination step, the final result contains a cleaned network and the most appropriate value for each node.

Table 4.7 Redundancy elimination algorithm Redundancy elimination algorithm 1 Initialize merged cluster nodesK^C

2 forc∈Canda∈Ado

4.7 Summary

Many medical applications and current ongoing medical research depend on text mining techniques. A lot of research work has already been done, and therefore in this chapter, we have overviewed some methods that enable researchers to auto-matically retrieve, extract, and integrate unstructured medical data. Due to increas-ing number of unstructured documents, the automatic text minincreas-ing methods ease access to relevant data, already conducted research along with its results, and save money by trying to eliminate repeated research experiments.

data

Fig. 4.11 Example of data integration execution on person domain

In the last decade, the text mining field has been generally fast evolving, and still, there is a lot of research to be done. In information retrieval, biomedical language resources typically use simple query models, which seem sufficient when enough of relevant data is extracted. Information extraction is currently receiving a lot of attention because researchers are trying to adapt techniques from other domains to work on biomedical data. Further, these techniques are essentials for automatic research texts processing and extraction of findings from research literature. Lastly, also very important topic of data integration still needs to improve models to merge data and select representative values. The latter is especially important as a reference to the same entity can be represented using many different forms.

References

1. Nguyen NLT, Kim JD, Miwa M et al (2012) Improving protein coreference resolution by simple semantic classification. BMC bioinformatics 13:304–325

2. Cunningham H, Maynard D, Bontcheva K et al (2011) Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science, Sheffield

3. Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting More Out of Biomedical Documents with GATE’s Full Lifecycle Open Source Text Analytics. PLoS Computational Biology 9:1–16

4. Ferrucci D, Lally A (2004) UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering 10:327–348 5. Toutanova K, Klein D, Manning C et al (2011) Stanford Core NLP. The Stanford Natural Language Processing Group.http://nlp.stanford.edu/software/corenlp.shtml. Accessed 20 March 2013

6. Hall D, Ramage D (2013) Breeze. Berkeley NLP Group.http://www.scalanlp.org. Accessed 20 March 2013

7. Kottmann J, Margulies B, Ingersoll G et al (2010) Apache OpenNLP. The Apache Software Foundation.http://opennlp.apache.org. Accessed 20 March 2013

8. Bird S, Loper E, Klein E (2009) Natural Language Processing with Python. O’Reilly Media, Sebastopol

9. Gamalo P (2009) DepPattern. Grupo de Gramatica do Espanol.http://gramatica.usc.es/pln/

tools/deppattern.html. Accessed 20 March 2013

10. Padro´ L, Stanilovsky E (2012) FreeLing 3.0: Towards Wider Multilinguality. Proceedings of the Language Resources and Evaluation Conference. Turkey, Istanbul 2473–2479

11. Bjo¨rne J, Ginter F, Salakoski T (2012) University of Turku in the BioNLP’11 Shared Task.

BMC Bioinformatics, 13:1–13

12. Barnickel T, Weston J, Collobert R et al (2009) Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts.

PLoS ONE 4:1–6

13. Szklarczyk D, Franceschini A, Kuhn M et al (2010) The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Researc 39:

561–568

14. Mostafavi S, Ray D, Warde-Farley D et al (2008) GeneMANIA: a real-time multiple associa-tion network integraassocia-tion algorithm for predicting gene funcassocia-tion. Genome Biology 9:1–15 15. Fontaine JF, Priller F, Barbosa-Silva A, Andrade-Navarro MA (2011) Genie: literature-based

gene prioritization at multi genomic scale. Nucleic Acids Research 39:455–461

16. Tsuruoka Y, Tsujii J (2005) Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of Human Language Technology Conference/EMNLP 2005.

Vancouver, Canada 467–474

17. Allison JJ, Kiefe CI, Carter J, Centor RM (1999) The art and science of searching MEDLINE to answer clinical questions. Finding the right number of articles. International Journal of Technology Assess in Health Care 15:281–296

18. Hamosh A, Dcott AF, Amberger JS et al (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research 33:514–517

19. Gruber TR (1993) A translation approach to portable ontologies. Knowledge Acquisition 5:

199–220

20. Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Scientific American 284:

28–37

21. Jin-Dong K, Ohta T, Teteisi Y, Tsujii J (2003) GENIA corpus - a semantically annotated corpus for bio-text mining. Bioinformatics 19:180–182

22. Pyysalo S, Ginter F, Heimonen J et al (2007) BioInfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics, 8:50–74

23. Rogers FB (1963) Medical Subject Headings. Bulletin of the Medical Library Association 51:114–116

24. Spackman KA, Campbell KE (1998) Compositional concept representation using SNOMED:

towards further convergence of clinical terminologies. Proceedings of the AMIA Symposium.

Orlando, Florida 740–744

25. Ashburner M, Ball CA, Blake JA et al (2000) Gene Ontology: tool for the unification of biology. Nature genetics 25:1–25

26. Xie B, Ding Q, Han H, Wu D (2013) miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics 29:638–644

27. Manning CD, Raghavan P, Schu¨tze H (2008) Introduction to Information Retrieval.

Cambridge University Press, Cambridge

28. Sarawagi S (2008) Information Extraction. Foundations and Trends in Databases 1:261–377 29. Bush V (1945) As We May Think. The Atlantic Monthly 176:101–108

30. Fallows D (2004) The internet and daily life. Pew/Internet and American Life Project.http://

www.pewinternet.org/Reports/2004/The-Internet-and-Daily-Life.aspx. Accessed 21 March 2013 31. Witten IH, Frank E (2005) Data Mining: Practical Machine Learning Tools and Techniques

(Second Edition). Morgan Kaufmann Publishers, San Francisco

32. Broder A (2002) A taxonomy of web search. ACM SIGIR Forum 36:3–10 33. Newman MEJ (2010) Networks: an introduction. Oxford University Press, Oxford

34. Trcˇek D, Trobec R, Pavesˇic´ N, Tasic´ J (2007) Information systems security and human behaviour. Behaviour and Information Technology 26:113–118

35. Nagy M, Vargas-Vera M, Motta E (2008) Managing conflicting beliefs with fuzzy trust on the semantic web. Proceedings of the Mexican International Conference on Advances in Artificial Intelligence 827–837

36. Richardson M, Agrawal R, Domingos P (2003) Trust management for the semantic web.

Proceedings of the International Semantic Web Conference 351–368.

37. Zˇ itnik S, Sˇubelj L, Lavbicˇ D et al (2013) General Context-Aware Data Matching and Merging Framework. Informatica 24:1–34

38. Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data 1:5–40.

39. Lafferty JD, McCallum A and Pereira FCN. Conditional random fields: Probabilistic models for segmenting and labeling sequence data, Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco: Morgan Kaufmann, 2001, pp. 282–289.

40. Soon WM, Ng HT and Lim DCY. A machine learning approach to coreference resolution of noun phrases, Computational linguistics, 2001, 27: 521–544.

41. Ng V, Cardie C (2002) Improving machine learning approaches to coreference resolution.

Proceedings of the 40th Annual Meeting on Association for Computational Linguistics 104–111

42. Bengtson E, Roth D (2008) Understanding the value of features for coreference resolution.

Proceedings of the Conference on Empirical Methods in Natural Language Processing 294–303

43. Miller GA (1995) WordNet: A Lexical Database for English. Communications of the ACM 38:39–41

44. Grishman R, Sundheim B (1996) Message understanding conference-6: A brief history.

Proceedings of the 16th Conference on Computational Linguistics. Morristown, USA 466–471 45. NIST (1998-present) Automatic Content Extraction (ACE) Program

46. Recasens M, Marquez L, Sapena E et al (2010) Semeval-2010 task 1: Coreference resolution in multiple languages. Proceedings of the 5th International Workshop on Semantic Evaluation.

Uppsala, Sweden 1–8

47. Pradhan S, Moschitti A, Xue N, Uryupina O, Zhang Y (2012) CoNLL-2012 Shared Task:

Modeling Multilingual Unrestricted Coreference in OntoNotes. Proceedings CoNLL ’12 Joint Conference on EMNLP and CoNLL - Shared Task. Pennsylvania, USA 129–135

48. Chincor N (1991) MUC-3 Evaluation metrics. Proceedings of the 3rd conference on Message understanding. Pennsylvania, USA 17–24

49. Chincor N, Sundeheim B (1993) MUC-5 Evaluation metrics. Proceedings of the 5th conference on Message understanding. Pennsylvania, USA 69–78

50. Vilain M, Burger J, Aberdeen J, Connolly D, Hirschman L (1995) A model-theoretic coreference scoring scheme. Proceedings of the sixth conference on Message understanding.

Pennsylvania, USA 45–52

51. Bagga A, Baldwin B (1998) Algorithms for scoring coreference chains. The first international conference on language resources and evaluation workshop on linguistics coreference.

Pennsylvania, USA 563–566

52. Luo X (2005) On coreference resolution performance metrics. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing.

Vancouver, Canada 25–32

53. Recasens M, Hovy E (2011) BLANC: Implementing the Rand index for coreference evalua-tion. Natural Language Engineering 17:485–510

54. Rabiner L (1989) A tutorial on Hidden Markov Models and selected applications in speech recognition. Proceedings of the IEEE 77:257–286

55. McCallum A, Freitag D, Pereira F (2000) Maximum entropy markov models for information extraction and segmentation. Proceedings of the International Conference on Machine Learning. Palo Alto, USA 591–598

56. Klein D, Manning CD (2002) Conditional structure versus conditional estimation in NLP models. Workshop on Empirical Methods in Natural Language Processing. Philadelphia, USA 1–8

57. DeRose SJ (1988) Grammatical category disambiguation by statistical optimization. Compu-tational Linguistics 14:31–39

58. Verspoor KM, Cohn JD, Ravikumar KE, Wall ME (2012) Text Mining Improves Prediction of Protein Functional Sites. PLoS ONE 7:e32171.

59. Park J, Costanzo MC, Balakrishnan R et al (2012) CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations. Database, doi:10.1093/database/bas001

60. Krallinger M, Leitner F, Vazquez M et al (2012) How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience. Database, doi:10.1093/database/bas017

A Primer on Information Theory

Dans le document Computational Medicine in Data Mining and ModelingComputational Medicine in Data Mining and Modeling (Page 137-144)