• Aucun résultat trouvé

A Context-based Approach for Complex Semantic Matching

N/A
N/A
Protected

Academic year: 2022

Partager "A Context-based Approach for Complex Semantic Matching"

Copied!
4
0
0

Texte intégral

(1)

A Context-based Approach for Complex Semantic Matching

Youssef Bououlid Idrissi and Julie Vachon DIRO, University of Montreal, Montreal, QC, Canada

{bououlii,vachon}@iro.umontreal.ca

Abstract. Semantic matching1 is a fundamental step in implementing data sharing applications. Most systems automating this task however limit themselves to finding simple (one-to-one) matching. In fact, com- plex (many-to-many) matching raises a far more difficult problem as the search space of concept combinations is often very large. This ar- ticle presents Indigo, a system discovering complex matching in two steps. First, it semantically enriches data sources with complex concepts extracted from their development artifacts. It then proceeds to the align- ment of data sources thus enhanced.

Key words: complex matching, semantic matching, context analysis

1 Introduction

Semantic matching consists in finding semantic correspondences between het- erogenous sources. When done manually, this task can prove to be very tedious and error prone [3]. To date, many systems [1, 4] have addressed the automation of this stage. However, most solutions confine themselves to simple matching (one-to-one) although complex matching (many-to-many) is frequently required in practice. Linking concept address to the concatenation of concepts street + city is a typical example of a complex match. The little work addressing com- plex matching can be explained by the greater complexity of finding complex matches than of discovering simple ones. To cope with this challenging prob- lem, this article introduces the solution implemented by our matching system Indigo2.Indigo avoids searching such large spaces of possible concept combi- nations. It rather implements an innovative solution based on the exploration of the data sources’informational context, which can indeed contain very useful semantic hints about concept combinations. The informational context of a data source is composed of all the available textual and formal artifacts documenting, specifying, implementing this data source.Indigodistinguishes two main sets of documents in the informational context (cf. [2] for details). The first set, called the descriptive context, gathers all the available data source specification and

1 also calledsemantic alignment, or simplymatching

2 INteroperabilty and Data InteGratiOn

25

(2)

documentation files produced during the different development stages. The sec- ond set is called the operational context. It is composed of formal artifacts such as programs, forms or XML files. In formal settings, significant combinations of concepts are more easily located (e.g. they can be found in formulas, function declarations, etc.).Indigothus favors the exploration of the operational context to identify relevant concept combinations that can form newcomplex concepts.

Complex concepts are added to data sources as new candidates for the matching phase. As an experiment, Indigo was used for the semantic matching of two database schemas taken from two open-source e-commerce applications, Java Pet Store [5] andeStore [6], provided with all their source code files.

2 Indigo ’s architecture

To handle both context analysis and semantic matching, Indigo presents an architecture composed of two main modules: aContext Analyzer and aMapper module. The Context Analyzer module takes the data sources to be matched along with their related contexts and proceeds to their enrichment before deliv- ering them to theMapper module for their effective matching.

2.1 Context Analyzer

TheContext Analyzer comprises two main modules, each one being specialized in a specific type of concept extraction. 1) TheConcept name collector explores the descriptive context of a data source to find (simple) concept names related to the ones found in the data source’s schema. 2) TheComplex concept extractor analyzes the operational context to extract complex concepts. The left side of Figure 1 shows the current architecture of our Context Analyzer. Modules are eitherbasicormetaanalyzers. The analysis performed is based on heuristic rules in all cases.

Complex concept Generator Arithmetic

Parser Concepts Linker

Complex concept extractor meta−analyzer

Head

Concept name collector

Form

Analyzer Schema Analyzer

DTD

Analyzer XSD

Analyzer SQL

Analyzer Program Analyzer

Statistic−based

Aligner Whirl−based

Aligner Content−based

Coordinator Supervisor

Name−based Aligner

Fig. 1.TheContext Analyzer and theMapper modules

Basic analyzers. Basic analyzers (depicted by white boxes on Fig 1) are re- sponsible for the effective mining of complex concepts. The extraction is per- formed as dictated by heuristic rules having the following shape ruleName ::

26

(3)

SP1||SP2...||SPn → extractionAction. The left part is a disjunction of syn- tactic patterns (noted SP) that basic analyzers try to match3 when parsing a document. A SP is a regular expression which can contain pattern variables name, type, exp1, exp2, ... expn (e.g. for an accessor method in a Java pro- gram: SPi ={publictype getname * return exp1}). When a basic analyzer recognizes one of the SPs appearing on the left-hand side of a rule, pattern vari- ables are assigned values (by pattern matching) and the corresponding right- hand side action of the rule is executed. This action builds a complex concept

< name, type, concept combinaison >using the pattern variables’ values. Each basic analyzer applies its own set of heuristic extraction rules over each of the artifacts it is assigned. Our current basic analyzers deal with forms, programs, SQL requests, DTD and XSD schemas.

Meta-analyzers. Each meta-analyzer is in charge of a set of artifacts com- posing the informational context. Its role essentially consists in classifying these artifacts and assigning each of them to a relevant child. To do so, it applies heuristics like checking file name extensions or parsing file internal structures.

The meta-analyzer module at the head of the Context Analyzer is in charge of theConcept name collector and theComplex concept extractor coordination. It enhances data sources with the simple and complex concepts respectively deliv- ered by these two modules. For complex concepts, this enrichment step not only requires the name of the enriching concepts but also the values4 associated to them. Regarding theComplex concept extractor, let’s mention it coordinates the actions of the basic analyzers responsible for the complex concepts extraction.

In addition, it relies on an internal module, called complex concept generator, that validates discovered concept combinations and generates complex concepts by replacing expressions (coming from source code) by appropriate concepts of the data source.

2.2 Mapper module

As shown on the right side of Figure 1, the current architecture of theMapper is composed of several matching modules hierarchically organized. Each aligner supports a given matching strategy and is responsible of generating a mapping between data sources. On top, each coordinator supervises a given set of aligners and combines their returned results. The current implementation of the Map- per comprises the three following aligners: (1) TheName-based aligner proposes matches between concepts having similar names with regards to theJaroWinkler lexical similarity metric. (2) TheWhirl-based aligner uses an adapted version of the so-called WHIRL technique developed by Cohen and Hirsh [7] to match concepts having similar instances. Finally, (3) the Statistic-based aligner com- pares concepts’ content which is represented, in this case, by a normalized vector describing seven characteristics (e.g.minimum value,maximum,variance, etc.).

3 N.b. This kind of text matching is called ”pattern matching”

4 These values are assessed by querying the database using SQL SELECT statements.

27

(4)

3 Experimental results

Indigo’s Context Analyzer has been applied to match the data sources of the eStore andPetStore applications. Two measures, respectively calledsignificance and relevance, were defined to evaluate its performance. The significance mea- sure indicates the percentage of extracted complex concepts presenting a seman- tically sound combination of concepts (e.g.concat(first name, last name)). The relevance measure is used to compute the proportion of significant complex con- cepts which effectively appear in the final mapping of the two data sources. The Context Analyzerhas globally discovered 31 complex concepts of which 87% were significant. Of course, not all of them were relevant. The manual examination of the data sources revealed thateStore’s data source only contained two complex concepts whilePetStore’s contained none.Indigonevertheless succeeded in dis- covering both relevant complex concepts ofeStore. After their enhancement with complex concepts, the PetStore’s and the eStore’s data sources were matched by theMapper module which was able to discover all complex matches.

4 Conclusion

We proposedIndigo, an innovative solution for the discovery of complex matches between data sources. Avoiding to search the unbounded space of possible con- cept combinations,Indigodiscovers complex concepts by searching operational context artifacts of data sources. Newly discovered complex concepts are added to data sources as new matching candidates for complex matching.Indigoim- plements a Context analyzer and aMapper module both offering a flexible and extensible hierarchical architecture. Specialized analyzers and aligners can be added to allow the application of new mining and matching strategies. Exten- sibility and adaptability are undoubtedly appreciable qualities of Indigo. Our first experiments withIndigoshowed the pertinence of this approach and let’s hope for promising futur results.

References

1. Rahm, E., P.A. Bernstein: A Survey of Approaches to Automatic Schema Matching.

VLDB Journal Vol. 10, Issue. 4 (2001) 334–350

2. Bououlid, I.Y., Vachon, J.: Context Analysis for Semantic Mapping of Data Sources Using a Multi-Strategy Machine Learning Approach. In Proc. of the International Conf. on Enterprise Information Systems (ICEIS05), Miami, USA (2005) 445–448.

3. Li, W.S., Clifton, C.: Semantic Integration in Heterogeneous Databases Using Neural Networks. In Proc. of the 20th Conf. on Very Large Databases (VLDB) (1994) 1–12.

4. Euzenat, J., et al: State of the Art on Ontology Alignment. Part of a research project funded by the IST Program of the Commission of the European Communities, project number IST-2004-507482.Knowledge Web Consortium (2004).

5. Sun Microsystems (http://java.sun.com/developer/releases/petstore/)(2005).

6. McUmber, R.: Developing pet store using rup and xde.Web Site (2003).

7. Cohen, W., Hirsh, H.:Joins that Generalize: Text Classification using Whirl. In Proc.

of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (1998).

28

Références

Documents relatifs

This paper has three principal aims: to describe the development of a context-based course, Salters Advanced Chemistry, to draw together the research evidence on the effects of

Moreover, the criticality level can differ and can take different significances according to the method or the approach used to estimate the level of probability and the level

(ii) Slowly expanding siliceous aggregates As the correlation coefficient between the expansion of rock cylinders in alkali and the expansion of concrete prisms

Dependent variables: column (1): Total expenditures for the household in the last 12 months in thousands of CFA francs; column (2): Dummy equal to 1 if distance to the nearest

Concrete, every value-based configuration unit, such as the necessary value in- terval of a supersonic sensor for a reconfiguration option like an autonomous parking service,

Unfortunately, in the general case a single-valued correspondence I (  ) in the problem of formal concept analysis turns out to be incorrect, because the standard  -section

In this paper, we propose a mapping refinement methodology to update mapping sets taking ontology changes into account (based on new concepts added in ontology evo- lution).. We

In the process, the linguistic and machine learning combined hybrid approach has been applied on the top of WordNet of Medical Event (WME2.0) lexicon to extract the medical