• Aucun résultat trouvé

A similarity measure based on semantic, terminological and linguistic information

N/A
N/A
Protected

Academic year: 2022

Partager "A similarity measure based on semantic, terminological and linguistic information"

Copied!
2
0
0

Texte intégral

(1)

A Similarity Measure based on Semantic, Terminological and Linguistic Information

Nitish Aggarwal, Tobias Wunner∗◦, Mihael Arcan, Paul Buitelaar, Se´an O’Riain

Unit for Natural Language Processing andeBusiness Unit Digital Enterprise Research Institute,

National University of Ireland, Galway [email protected]

Introduction

The fundamental task of ontology matching is based on measuring the similarity between two ontology concepts [1]. However, we argue that a deeper semantic, terminological and linguistic analysis of specialized domain vocabularies is needed in order to establish a more sophisticated similarity measure that caters for the specific characteristics of this data.

In particular we propose ’STL’, a novel similarity measure that takes semantic, terminological and linguistic variation into account.

The STL Similarity

We base our approach on the three-faceted STL ontology enrichment process introduced in [3]. We calculate similarity according to semantic, terminological and linguistic variation and then take a linear combination by using linear regression, called STL similarity, which we describe as follows:

Semantic similarity (simS) is calculated based on semantic (taxo- nomic or ontological) structure. For our purposes we used a recently pro- posed semantic similarity measure proposed by Pirro Sim [2], which uses intrinsic information content, i.e. the information content of a concept defined by the number of its subconcepts.

Terminological similarity (simT) is defined by maximal subterm overlap, i.e. we calculate simT between two concepts c1 and c2 as the number of subterms ti in a termbase that can be matched on the labels of c1 and c2. A term ti is said to match on a concept when no other longer term tj can be matched on the same concept (label). To calcu- late simT we use monolingual as well as multilingual termbases as the latter reflect terminological similarities that may be available in one lan- guage but not in others, e.g. there is no terminological similarity between

(2)

the English terms ”Property Plant and Equipment” and ”Tangible Fixed Asset”, whereas in German these concepts are actually identical on the terminological level (they both translate into ”Sachanlagen”).

Linguisticsimilarity (simL) is defined as the Dice coefficient applied on the head&modifier syntactical arguments of two terms, i.e., the ratio of common modifiers to all modifiers of two concepts. For instance the concepts ”Financial Income” and ”Net Financial Income” have 3 modi- fiers ”financial” ”net” and ”net financial”, whereby only ”financial” is a common modifier.

Putting it all together we define STL similarity as a linear combination of the sub-measures where the weights wS,T,L are their contributions on the data set:

simS,T,L=wS∗simS+wT ∗simT +wL∗simL+constant We evaluated our approach on a data set of 59 financial term pairs, drawn from the xEBR (European Business Registry) vocabulary, that were an- notated by four human annotators. Table 1 shows the correlations ρ of all measures on the data set and that STL outperforms all of its S,T and L contributions.

Measure ρ Type Measure ρ Type Measure ρ Type PathLength 0.16 S UnigMulti 0.72 T SubtermMulti 0.75 T

Wu−Palmer 0.18 S BigMono 0.53 T Lemmatized 0.70 L

Pirro Sim 0.20 S Bi Multi 0.54 T Head&Mod 0.51 L

UnigramMono 0.72 T SubtermMono 0.74 T STL 0.78 S,T,L

Table 1.Correlation of STL similarity measures with human evaluator scores

References

1. Euzenat, J., Meilicke, C., Stuckenschmidt, H., Shvaiko, P., dos Santos, C.T.: Ontol- ogy alignment evaluation initiative: Six years of experience. J. Data Semantics 15, 158–192 (2011)

2. Pirr´o, G.: A semantic similarity metric combining features and intrinsic information content. Data Knowl. Eng. 68, 1289–1308 (November 2009)

3. Wunner, T., Buitelaar, P., O’Riain, S.: Semantic, terminological and linguistic in- terpretation of xbrl. In: In Reuse and Adaptation of Ontologies and Terminologies Workshop at 17th EKAW (2010)

Acknowledgements

This work is supported in part by the European Union under Grant No.

248458 for the Monnet project and by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

Références

Documents relatifs

Variation of the global F-score values when (A) varying the number of gen- erated fuzzy clusters K with the fuzzy C-Means algorithm using IntelliGO similarity measure and (B)

With respect to our 2013 partici- pation, we included a new feature to take into account the geographical context and a new semantic distance based on the Bhattacharyya

The similarity measure introduced by Zadeh and Reformat (2013) is similar to ours in the sense that it uses Tversky’s feature-based model for calculating similarity and then

Using this idea, a text that was not written by an author, would not exceed the average of similarity with known texts and a text of unknown authorship would be

The task 3a of the BioASQ challenge consists in indexing new biomedical papers with MeSH concepts.. The need of document indexed by terms of a thesaurus like the MeSH has already

Semantic similarity is used to include similarity measures that use an ontology’s structure or external knowledge sources to determine similarity between

In Section 3, we give a new similarity measure that characterize the pairs of attributes which can increase the size of the concept lattice after generalizing.. Section 4 exposes

For finding this out, we: (i) have chosen the four candidate measures from the broader variety of the available, based on the specifics of term comparison; (ii) devel- oped the