• Aucun résultat trouvé

ShExML: An Heterogeneous Data Mapping Language based on ShEx

N/A
N/A
Protected

Academic year: 2022

Partager "ShExML: An Heterogeneous Data Mapping Language based on ShEx"

Copied!
4
0
0

Texte intégral

(1)

ShExML: An heterogeneous data mapping language based on ShEx

Herminio Garcia-Gonzalez1,2, Daniel Fernandez-Alvarez1, and Jose Emilio Labra-Gayo1

1 Department of Computer Science, University of Oviedo, Oviedo, Asturias, Spain [email protected], [email protected], [email protected]

2 Inria Lille Nord Europe, Villeneuve-d’Ascq, France [email protected]

Abstract. Data interoperability is currently a problem that we are fac- ing more intensely due to the appearance of fields like Big Data or IoT.

Many data is persisted in information silos with neither interconnection nor format homogenisation. Our proposal to alleviate this problem is ShExML, a language based on ShEx that can map and merge heteroge- neous data formats into a single RDF representation. We advocate the creation of this type of tools that can facilitate the migration of non- semantic data to the Semantic Web.

Keywords: data·interoperability·RDF·ShEx·ShExML

1 Introduction

Mapping and merging heterogeneous data sources is a task that has gained in importance throughout the last years. With the improvement of hardware support, the development of new technological areas—such as Big Data or In- ternet of Things (IoT)—and the deeper interconnection between heterogeneous devices, a huge amount of data is generated every second. However, this data is created in various formats and persisted using different technologies. Therefore, understanding and exploitation of this data becomes a hard work due to the information silos model.

One of the goals of the Semantic Web was the interconnection of data sources and the avoidance of the aforementioned information silos. Therefore, many tech- nologies were proposed to accompany that objective. However, the migration of non-semantic data to the new semantic technologies is a hard task that many individuals and companies are not able to face due to the time or resources consumption. Migrating all databases in a company to their counterpart in Se- mantic Web world will carry not only the migration of the platforms, but also the data with the development of ad-hoc solutions for every dataset. Therefore, so- lutions that alleviate this translation can contribute to the adoption of semantic technologies or, at least, facilitate it.

We propose a language to map and merge heterogeneous data into its Re- source Description Framework (RDF) counterpart. But also taking into account usability and easiness of use.

(2)

2 H. Garcia-Gonzalez et al.

2 Related work

Many mapping languages and tools were proposed to perform a mapping between a non-semantic format to its RDF counterpart. This is the case of XSPARQL [1] which converts from XML to RDF based on XQuery and SPARQL queries, R2RML [2] which allows to define mappings from relational databases to RDF graphs, or CSV2RDF [4] which permits to convert from CSV to RDF.

However, none of these works tackle the mapping and the merging of het- erogeneous datasets in the same solution. This is addressed by RML [3] which extends R2RML language to support formats like JSON, CSV or XML in ad- dition to relational databases. Other alternative is YARRRML [5] a text-based language which is intended to be easy-readable by humans. YARRRML is based on YAML and can be used to represent RML and R2RML rules.

ShExML shares the same goal as RML and YARRRML. However, as being based on ShEx, validation of generated data can be done faster, i.e., the gap between ShExML and ShEx is small. Moreover, it is designed to keep the same simplicity and easiness of use that ShEx has.

3 ShExML at a glance

ShExML3 is based on ShEx [6] which means that language constructions of ShExML are similar to ShEx. Therefore, it uses the shape as the main foundation for every transformation.

Listing 1.1.ShExML example for films

P R E F I X : < h t t p :// e x a m p l e . com / >

P R E F I X dbo : < h t t p :// d b p e d i a . org / o n t o l o g y / >

P R E F I X f o a f : < h t t p :// x m l n s . com / f o a f /0.1/ >

P R E F I X dbr : < h t t p :// d b p e d i a . org / r e s o u r c e / >

S O U R C E f i l m s _ x m l < h t t p s :// e x a m p l e . com / f i l m s . xml >

S O U R C E f i l m s _ j s o n < h t t p s :// e x a m p l e . com / f i l m s . json >

Q U E R Y f i l m _ i d s _ x m l <// f i l m / @id >

Q U E R Y f i l m _ n a m e s _ x m l <// f i l m / name >

Q U E R Y f i l m _ y e a r s _ x m l <// f i l m / year >

Q U E R Y f i l m _ d i r e c t o r s _ x m l <// f i l m / d i r e c t o r >

Q U E R Y f i l m _ i d s _ j s o n < $ . f i l m s [ * ] . id >

Q U E R Y f i l m _ n a m e s _ j s o n < $ . f i l m s [ * ] . name >

Q U E R Y f i l m _ y e a r s _ j s o n < $ . f i l m s [ * ] . year >

Q U E R Y f i l m _ d i r e c t o r s _ j s o n < $ . f i l m s [ * ] . d i r e c t o r >

E X P R E S S I O N f i l m _ i d s < $ f i l m s _ x m l . f i l m _ i d s _ x m l U N I O N $ f i l m s _ j s o n . f i l m _ i d s _ j s o n >

E X P R E S S I O N f i l m _ n a m e s < $ f i l m s _ x m l . f i l m _ n a m e s _ x m l U N I O N $ f i l m s _ j s o n . f i l m _ n a m e s _ j s o n >

E X P R E S S I O N f i l m _ y e a r s < $ f i l m s _ x m l . f i l m _ y e a r s _ x m l U N I O N $ f i l m s _ j s o n . f i l m _ y e a r s _ j s o n >

E X P R E S S I O N f i l m _ d i r e c t o r s < $ f i l m s _ x m l . f i l m _ d i r e c t o r s _ x m l U N I O N $ f i l m s _ j s o n . f i l m _ d i r e c t o r s _ j s o n >

: F i l m s :[ f i l m _ i d s ] {

f o a f : n a m e [ f i l m _ n a m e s ] ; dbo : y e a r dbr :[ f i l m _ y e a r s ] ; dbo : d i r e c t o r [ f i l m _ d i r e c t o r s ] ; }

We can see ShExML as a combination of declarations followed by a set of shapes. Being the declarations a collection of variable definitions and the shapes the core procedure to define and execute the mappings.

3 ShExML on Github: https://github.com/herminiogg/ShExML

(3)

ShExML: An heterogeneous data mapping language based on ShEx 3

Inside the set of declarations there are prefixes, sources, queries and expres- sions. Prefixes work as Turtle prefixes; sources allow to define a URL in which the file is hosted; queries are intended to define reusable queries for the pre- viously defined sources (which normally are defined in a query language, e.g., JSONPath or XMLPath); and expressions which are used to perform the queries over a source, make unions among queries and transform them.

Listing 1.2.JSON films file

{

" f i l m s ": [ {

" id ": 3 ,

" n a m e ": " I n c e p t i o n " ,

" y e a r ": "2010" ,

" d i r e c t o r ":

" C h r i s t o p h e r N o l a n "

} , {

" id ": 4 ,

" n a m e ": " The P r e s t i g e " ,

" y e a r ": "2006" ,

" d i r e c t o r ":

" C h r i s t o p h e r N o l a n "

} ] }

Listing 1.3.XML films file

< f i l m s >

< f i l m id = " 1 " >

< n a m e > D u n k i r k < / n a m e >

< y e a r > 2 0 1 7 < / y e a r >

< d i r e c t o r >

C h r i s t o p h e r N o l a n

< / d i r e c t o r >

< / f i l m >

< f i l m id = " 2 " >

< n a m e > I n t e r s t e l l a r < / n a m e >

< y e a r > 2 0 1 4 < / y e a r >

< d i r e c t o r >

C h r i s t o p h e r N o l a n

< / d i r e c t o r >

< / f i l m >

< / f i l m s >

Thus, imagine that we want to make the transformation of two lists of films:

one in JSON and the other in XML (see Listings 1.2 and 1.3). We define a ShExML which can convert both files to RDF and merge them into a single RDF file (see Listing 1.1). This conversion has a single shape called :Films which has the main conversion for the films. In order to construct each triple a name is defined under the :[films ids]directive which will match with the subject of every triple generated by this shape. Then, predicates and objects are generated, based on the previous ids, using the expressions enclosed between braces. For example,foaf:name [films name]will generate a triple in the form ofsubject foaf:name :object. Notice that every expression enclosed between square brackets allows a prefix definition which tells the compiler if this expres- sion will be a node or a literal. Moreover, if a query produces a list of results, instead of a single one, the ShExML engine performs the mapping taking into account the relation of them with each entity. Hence, making it possible to merge files with various entities. Finally, the result of this example is showed in Listing 1.4.

Listing 1.4.Result of mapping with ShExML in Turtle format

@ p r e f i x dbo : < h t t p :// d b p e d i a . org / o n t o l o g y / > .

@ p r e f i x : < h t t p :// e x a m p l e . com / > .

@ p r e f i x dbr : < h t t p :// d b p e d i a . org / r e s o u r c e / > .

@ p r e f i x f o a f : < h t t p :// x m l n s . com / f o a f /0.1/ > . :4 dbo : d i r e c t o r " C h r i s t o p h e r N o l a n " ;

dbo : y e a r dbr : 2 0 0 6 ;

f o a f : n a m e " The P r e s t i g e " . :3 dbo : d i r e c t o r " C h r i s t o p h e r N o l a n " ;

dbo : y e a r dbr : 2 0 1 0 ;

(4)

4 H. Garcia-Gonzalez et al.

f o a f : n a m e " I n c e p t i o n " .

:2 dbo : d i r e c t o r " C h r i s t o p h e r N o l a n " ;

dbo : y e a r dbr : 2 0 1 4 ;

f o a f : n a m e " I n t e r s t e l l a r " . :1 dbo : d i r e c t o r " C h r i s t o p h e r N o l a n " ;

dbo : y e a r dbr : 2 0 1 7 ;

f o a f : n a m e " D u n k i r k " .

4 Conclusions

In this work, we have presented ShExML, a language that allows to map and merge heterogeneous data into its RDF counterpart. This tool helps the migra- tion of semi-structured data to a semantic data format, improving its interoper- ability and searchability. With the development of this solution, the integration of data into the Semantic Web is an easier task and it can be adapted to differ- ent scenarios. We are planning to include some extra features in future versions, such as: the unification of URIs between different representations, the matching between generated URIs and existing ones in the Linked Open Data cloud and the conversion of streaming sources.

Acknowledgments This work has been partially funded by the Vicerectorate for Research of the University of Oviedo under the call of ”Plan de Apoyo y Promoci´on de la Investigaci´on” and by the Ministerio de Econom´ıa, Industria y Competitividad under the call of ”Programa Estatal de I+D+i Orientada a los Retos de la Sociedad” (project TIN2017-88877-R).

References

1. Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between RDF and XML with XSPARQL. Journal on Data Semantics1(3), 147–185 (2012) 2. Das, S., Sundara, S., Cyganiak, R.: R2RML: RDB to RDF Mapping Language.

https://www.w3.org/TR/r2rml/ (2012), W3C Recommendation 27 September 2012 3. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de Walle, R.: RML: A Generic Language for Integrated RDF Mappings of Heteroge- neous Data. In: LDOW. Seoul, Korea (2014)

4. Ermilov, I., Auer, S., Stadler, C.: CSV2RDF: User-driven CSV to RDF mass con- version framework. In: Proceedings of the ISEM. vol. 13, pp. 04–06. Graz, Austria (2013)

5. Heyvaert, P., De Meester, B., Dimou, A., Verborgh, R.: Declarative Rules for Linked Data Generation at your Fingertips! In: Proceedings of the 15thESWC: Posters and Demos. Heraklion, Greece (2018)

6. Prud’hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape Expressions: An RDF Validation and Transformation Language. In: Proceedings of the 10th International Conference on Semantic Systems. pp. 32–40. SEM ’14, ACM, New York, NY, USA (2014)

Références

Documents relatifs

For example, to perform a simple and quick data transformation of which schemas are not avail- able, a user may prefer the direct definition of the transfor- mation program, using

We will see that in general we cannot decide whether a program is minimal, or compute in the general case a “best” minimiser. In order to statically compute a minimiser for a program

We have investigated Shape Expressions Schemas (ShEx), a novel formalism of schemas for RDF graphs currently under development by W3C. We have proposed two alternative

We have presented a preliminary system that interprets natural language ques- tions with respect to SPARQL and has three key features: i) it is pattern-based in the sense that

Can we think of a tactile learning resource based on sign language structures?. Cédric Moreau,

T2WML is a mapping language designed to meet three objectives: 1) Identify and map data and their context qualifiers in arbitrary data layouts found in Excel and CSV files without

Con- sequently, the documents serve as input for training vector representations, commonly termed as embed- dings, of data elements (e.g. entire relational tuples or

Current version of QA 3 takes therefore 100ms∗50 ≈ 5s to check for the best candidate dataset and annotate the question with triples necessary to find and fill in the correct