• Aucun résultat trouvé

The importance of cross-lingual information for matching Wikipedia with the Cyc ontology

N/A
N/A
Protected

Academic year: 2022

Partager "The importance of cross-lingual information for matching Wikipedia with the Cyc ontology"

Copied!
2
0
0

Texte intégral

(1)

The importance of cross-lingual information for matching Wikipedia with the Cyc ontology.

?

Aleksander Smywi«ski-Pohl1,2 and Krzysztof Wróbel1,2

1 Chair in Computational Linguistics, Jagiellonian University, ul. Šojasiewicza 4, 30-348 Kraków, Poland,

http://www.klk.uj.edu.pl

2 Faculty of Computer Science, Electronics and Telecommunications, AGH University of Science and Technology,

al. Mickiewicza 30, Kraków, Poland, http://www.dsp.agh.edu.pl

Abstract. In this paper we try to answer the question how cross-lingual evidence may improve matching between dierent classication schemas.

We concentrate specically on the task of mapping between Wikipedia categories and Cyc terms as well as the classication of Wikipedia articles to the Cyc taxonomy and show how this process may be improved by con- suming the evidence that is available in dierent editions of Wikipedia.

The results show that the performance of the mapping procedure may be improved from 0.6 to 4.9 percentage points, depending on the number of external Wikipedia editions and the given task.

Keywords: Ontology, ontology mapping, classication, multilingual data, Wikipedia, Cyc

1 Approach

To answer the question how the additional Wikipedia editions inuence the per- formance of the mapping between Wikipedia and Cyc (cf. [2]) we have dened the following tasks: 1) mapping of the Wikipedia categories to Cyc terms; 2) classication of the Wikipedia articles to the Cyc ontology based on the rst sentences. In each case the decision of selecting the corresponding Cyc term re- quires disambiguation of some English expressions against the Cyc ontology. This decision is based on the contextual data that are available for each Wikipedia ar- ticle and category. Consulting of the supplementary Wikipedia editions extends the context available when making the decision and in general should improve the performance of the corresponding algorithms.

In case of the category mapping (based on the identication of plural head nouns in category names cf. [3]), when an English category is mapped, the cor- responding Dutch, German, etc. categories are inspected. Then the parent and

?This work was partly supported by Stuctured Dynamics LLC, partly by the Polish National Center for Research and Development under LIDER/37/69/L- 3/11/NCBR/2012 grant and partly by the Faculty of Management and Social Com- munication, Jagiellonian University in Krakow.

(2)

2 Aleksander Smywi«ski-Pohl and Krzysztof Wróbel

child categories as well as articles of the corresponding categories in the other editions are looked up in a reverse interlingual mapping index and if there is an English Wikipedia page, that was not present in the original context, it is in- cluded in the new, extended context. Then a support value used to disambiguate the category is computed against the extended context.

In case of article classication (based on the rst sentence parsing, cf. [1]) the supplementary Wikipedia editions provide additional categories for the clas- sied article, that are used to verify the disambiguation decision. The manner of operation is similar to that from the previous task the corresponding articles in other Wikipedia editions are consulted, their categories are translated back to English and these new categories are included in the extended context.

2 Results

There was a small improvement (F1 increased from 86.8% to 78.4%) in the performance of the category mapping when the English Wikipedia is supported by three other Wikipedias (de,nl,sv). However providing the algorithm with more data from other Wikipedia editions, increased the computation time, but did not further improve the results.

On the other hand the inuence of the additional Wikipedia editions in the task of the classication of the articles into the Cyc ontology was much stronger.

Not only the additional Wikipedia editions improved the recall, but also the pre- cision. The maximum precision was achieved for 5 and 6 additional Wikipedias (96.6% compared to 95.8% for the sole English Wikipedia). The F1 was the largest for the 8 additional Wikipedias resulting in an increase from 66.2% to 71.1%.

The overall conclusion from the results is that the inuence of the supple- mentary Wikipedias is task dependant and in general the extra time necessary to pre-process the data and the increase of the computation time may not be justied. However the task of articles classication shows also that such supple- mentary data may be very valuable and may increase both the precision and the recall of the results.

References

1. Gangemi, A., Nuzzolese, A.G., Presutti, V., Draicchio, F., Musetti, A., Ciancarini, P.: Automatic typing of DBpedia entities. In: The Semantic WebISWC 2012, pp.

6581. Springer (2012)

2. Pohl, A.: Classifying the Wikipedia Articles into the OpenCyc Taxonomy. In: Rizzo, G., Mendes, P., Charton, E., Hellmann, S., Kalyanpur, A. (eds.) Proceedings of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic Web Conference. pp. 516 (2012)

3. Suchanek, F., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In:

Proceedings of the 16th international conference on World Wide Web. pp. 697706.

ACM (2007)

Références

Documents relatifs

In [3] we evaluated T`ıpalo by extracting the types for a sample of 627 resources, while in this work we want to extract the ontology of Wikipedia by running T`ıpalo on

However, in contrast to approaches using co-occurence analysis on Wikipedia [18] or Google Distance [6], which require a quadratic number of search requests, our approach only issues

DBPO classes are insufficient for typing all DBpedia entities, which in turn results in lack of classification knowledge that can be used by automatic techniques for identifying

Характеристики узлов ядра включая разбиение на связные компоненты можно посмотреть в таблице указанной

Unlike the other algorithms that were used to classify Wikipedia articles into an ontology using only one method, this algorithm tries to maximise its coverage by combining

In recent years different IR tasks have used Wikipedia as a ba- sis for evaluating algorithms and interfaces for various types of search tasks, including Question Answering,

We hypothesize that a greater un- derstanding of what makes certain Wikipedia articles more attractive to the serendipitously browsing user than others, will enable us to

Figures 7 and 8 present the monthly evolution of edit requests, edit operations, history, submit and search requests for the considered Wikipedias.. Although these figures are