• Aucun résultat trouvé

Inferring missing schema from linked data using formal concept analysis (FCA)

N/A
N/A
Protected

Academic year: 2021

Partager "Inferring missing schema from linked data using formal concept analysis (FCA)"

Copied!
106
0
0

Texte intégral

(1)

I

NFERRING

MISSING SCHEMA FROM

LI

I

KED DATA

USING

FORMAL

CONCEPT

A ALYSIS

(FCA)

!J:ASTER

THESIS

PRESE

I

TED

AS A

PARTIAL REQUIREMENT

FOR T

HE

MASTER

l

N COMPUTER

SCIENCE

BY

RAZIEH MEHRI

DEHNAVI

(2)

Service des bibliothèques

Avertissement

La diffusion de ce mémoire se fait dans le respect des droits de son auteur, qui a signé

le formulaire

Autorisation de reproduire et de diffuser un travail de recherche de cycles

supérieurs

(SDU-522- Rév.01-2006).

Cette autorisation stipule que

«conformément

à

l'article

11 du Règlement no 8 des études de cycles supérieurs, [l

'auteur] concède

à

l'Université du Québec

à

Montréal une licence non exclusive d'utilisation et de

publication de la totalité ou d'une partie importante de [son] travail de recherche pour

des fins pédagogiques et non commerciales.

Plus précisément, [l'auteur] autorise

l'Université du Québec à Montréal

à

reproduire, diffuser, prêter, distribuer ou vendre des

copies

de [son] travail de recherche

à

des fins non commerciales sur quelque support

que ce soit, y compris l'Internet. Cette licence et cette autorisation n'entraînent pas une

renonciation de [la] part [de l'auteur]

à

[ses] droits moraux ni

à

[ses] droits de propriété

intellectuelle. Sauf entente contraire,

[l'auteur] conserve la liberté de diffuser et de

commercialiser ou non ce travail dont [il] possède un exemplaire.»

(3)

INFÉRE

CE

DU

SCHÉMA MANQUA T

À

PARTIR DE DO

ÉES LIÉES

À

L'AIDE DE L'ANALYSE DE

CONCEPTS

FORMELS

(ACF)

MÉMOIRE

PRÉSENTÉ

COMME EXIGENCE PARTIELLE

DE LA

MAITRISE EN INFORMATIQUE

PAR

RAZIEH MERRI DEHNAVI

(4)
(5)

I

am

d

ee

ply

thankfu

l

to

my

s

up

erv

i

sor,

Dr. Petko Valt

c

h

ev,

for his

guidance

a

nd

encourageme

nt

t

hr

oughout

my Mast

e

r

studies at

UQAM. He has been

a

n

exce

ll

e

nt

adv

i

sor and a constant source

of knowl

e

dg

e,

motivation

, a

nd

e

n

co

urag

e

ment during this

diss

e

rtation work.

I would lik

e

to

extend

my thanks to Dr.

F

at

ih

a

S

a

dat

,

my

co

-

s

up

e

rvisor

,

for h

e

r

g

uid

a

n

ce

throughout this res

ea

rch work.

I

am co

ntinuously

gratef

ul

to m

y

family

specia

ll

y

my par

e

nts for

their support

,

l

ove

a

nd

e

ncouragement.

Fin

a

ll

y,

I would

lik

e to thank a

ll

the

staff

m

e

mb

e

r

s

of

the Computer

Sci

e

nc

e department

at

UQAM

for

th

e

ir

dir

ect

and

indir

ect

helps during my studi

es at

UQAM.

(6)
(7)

LIST

OF FIGURES

Xl

LIST

OF TABLES

Xlll

ABREVIATIO

NS

xv

SUMÉ

xvn

ABSTRACT

X lX

I

N

TRODU

CT

IO

N

1

C

HAPTERI

MA

I

CO CEPTS

3

1.

1

RDF .

3

1.2 RDF

Schema (RDFS)

5

1.3 FCA

7

1.3

.1

In

t

r

o

du

ction to Formal Concept

An

alys

i

s

7

1.3.2

Concept

Lattice

.

8

1.4 DBpedia

. . . .

.. .

9

1.4.1

Data Source .

10

1.4.2

Data Structure

11

1.4.3

Data Access .

13

1.5 Summary

14

CHAPTERII

REVIEW OF THE LITERATURE

15

(8)

2.1.1

Name-based Techniqu

es

.

.

2.1.2

Structure

-b

ased Techniques

2.1.3

Extensional Techniques

..

2.1.4

Semantic-based

T

ec

hniqu

es

2.2 RDF

a

nd Similarity . . . .

18

27

30

32

34

2.2.1

Simi

l

arity of RDF Graphs on Link

e

d Op

e

n Data (Interlinking Tools)

34

2.2.2

Finding Similarity

b

etwee

n RDF Individuals Using FCA

2.3

FCA

and

Semantic Web Applications

.

2.4

Summary

45

46

47

CHAPTERIII

METHODOLOGY

A D IMPLEMENTATION

3.1

Approach

..

.

. . . .

. . . . .

. . . .

3.1.1

Converting RDF

to FCA Input

3.1.2

Converting FCA Output to

RDFS

49

49

50

52

3.1.3

Choosing

Plausibl

e a

m

es

for RDFS

C

l

asses Using

DBp

e

di

a

54

3

.2

3.3

Implementat

ion

. . .

.

. .

3.2.1

Java Frameworks and

APis

3.2.1.1

J

e

na.

3.2.1.1.1

RDF API

3

.

2.1.1.2

SPARQL

3.2.1.2

Galicia

3.2.1.

3

RDF Gravity

3.2.2

Generat

i

ng RDFS from RDF data

3.2.2.

1

Step

One:

Converting

.

rdf

to .

rcf.

xml

.

3.2.2.2

Step

Two

:

Converting lat.

xml

to

RDFS

3.2.2.3

Step Three:

aming

Classes Using DBp

edia

Summary

.

.

58

58

58

59

60

61

61

64

64

65

66

67

(9)

C

H

APTERIV

EXPER

I

MENTS A

ND RESULTS

4

.

1 Dataset

4.2 Resu

l

ts .

4.

2.1

Binary

Re

l

ation Ta

ble

4.2

.2

Concept Latt

i

ce

.

4.2

.

3

R

D

FS G

r

a

ph

. .

4.3 Discussion

of the

Exper

i

ments

.

4.4 Summary

co

T

C

LUSIO

B

I

B

LI

OGRAPHY

69

69

70

70

70

70

72

77

79

79

(10)
(11)

Fi

g

ur

e

P

age

1.1 RDF

g

r

a

ph

exa

mp

l

e

4

1.2 RDFS

g

r

a

ph

exa

mple.

6

1.3

Formai

co

n

text exa

mpl

e

7

1.4 Conc

e

pt l

attice e

x

a

mpl

e

9

1.5 R

ed

u

ce

d l

a

b

e

ling diagram of

co

n

ce

pt

la

tt

ice

exa

mpl

e

10

1.6 lnf

o

b

ox o

f Portuga

l

.

. . . . .

.

. .

.

.

.

11

1.7 Th

e

DBp

e

dia datas

e

t for B

arac

k Ob

a

m

a

14

2.1 Similarity

tec

hniqu

es

. . .

.

.

16

3.1 RDF

g

r

a

ph of m

u

s

i

c datas

e

t

.

51

3

.

2

L

att

i

ce

of musi

c

d

ata

s

et

. . .

52

3

.

3

R

e

duc

e

d

la

b

e

ling d

i

agram of mu

sic

d

ataset

l

att

i

ce

.

53

3

.4

RDFS

g

r

ap

h

o

f m

u

sic

d

ataset

.

. . .

.

. .

.

.

.

. .

55

3.5

RDF

Grav

i

ty

r

e

presentation

of

mu

s

i

c dataset

's

RDFS

g

r

ap

h

56

3.6

Obj

ects

of

rdf: type

pr

e

di

cate

w

i

t

h

dbpedia

-

owl

prefix

o

f

Lak

e

On

ega

and

Neva

R

iver

in DBp

ed

i

a .

.

. . . .

. .

.

.

.

.

.

. . . .

.

.

.

.

.

.

56

3.7 Objects

of

dbpedia-owl: Wikipagesdirect

predicate for

N

eva

River

in

DBp

ed

i

a .

.

. .

.

57

3

.

8

Jena framework .

59

3.9 SPARQL Example

60

3.10 Galicia v

.

2 beta view

.

62

(12)

3.12 S

N

ORQL

4.

1 L

att

i

ce o

f Ru

ss

i

a

d

ataset

b

y

G

a

li

c

i

a

. .

. .

.

.

.

.

. .

4.

2 Full RDF

S g

r

a

ph

o

f Ru

ss

i

a datas

e

t b

y

RDF

G

r

avity

4

.

3

P

a

r

ts of

RDF

S g

r

ap

h

o

f Ru

ss

i

a

d

atas

e

t . . . .

.

.

.

.

66

7

1

73

(13)

Tab

l

e

P

age

1.1

Classes

in DBp

edia

onto

l

ogy .

.

. .

.

.

. . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

2.1 Tool

comparison

.

.

.

.

.

.

.

. .

.

.

.

.

. .

.

. .

.

.

.

.

. .

.

.

.

. .

. .

44

3.1

Bin

ary

re

l

atio

n

tab

l

e of music dataset

50

3

.

2 RDF Grav

i

ty

no

tat

i

ons

. . . . .

.

.

. .

64

(14)
(15)

FCA

Formal Con

ce

pt An

a

l

ys

is

RDF

R

eso

u

ce

D

esc

ription Fr

a

m

e

work

RDFS

RDF Schema

LOD

Link

e

d Open Data

SW

S

e

manti

c

W

e

b

G

a

li

c

i

a

GAlois

L

att

i

cebase

d In

cre

m

e

nt

a

!

Closed Items

e

t

Approa

c

h

SPARQL

Simple

Proto

co

l

a

nd Rdf

Qu

ry Langu

age

RDF Gravit

y

RDF GRAph Vlsu

a

li

zat

ion Tool

XML

E

xte

nsibl

e

Markup

L

ang

u

age

W3C

World Wid

e

W

eb

Consortium

(16)
(17)

RÉSUMÉ

Avec

l'au

gmentat

i

o

n

massive d

e

l

a

quantité

de

données

disponibl

es

sur

l

e

w

e

b

,

l

a

d

étecti

on et

l'a

n

al

yse d'i

nf

o

rm

at

i

on

dans

le

co

nte

nu

web

d

evie

nn

e

nt

très rentables.

L

e

d

ép

loi

e

ment des donn

ées st

ructur

ées

fondé

sur

les tec

hn

o

l

ogies du W

e

b sémant

ique

a

a

u

g

-m

enté

d

e f

açon sign

ifi

cative en

li

gne

au co

urs

d

es d

eux

dernières dé

ce

nni

es.

L

'ext

r

acti

on

d

'i

nformation d

ev

i

e

nt do

nc un

probl

è

m

e

majeur entre

les c

h

er

c

heur

s du W

e

b sémant

ique.

Pour publi

e

r des données st

ruct

ur

ées

sur

le

W

eb,

les

sources de donn

ées sont

décrites

avec le

Cadre d

e

D

escr

iption d

es

R

esso

ur

ces

(Resource D

escri

pt

i

on Framework ou RDF).

D

ans cette m

é

moir

e,

n

o

u

s cherchons à

ext

rair

e

l

a str

uct

ur

e

conceptue

lle

du W

e

b d

e

donn

ées,

c'est à

dir

e

, d

es

donn

ées

RDF d

a

ns

le

W

eb

de doc

ume

nts. L'obj

ect

if pr

i

nc

ipa

l

est

d

'

a

pprendr

e

l

e

niv

eau

du

sc

m

a à

p

a

rtir du niv

eau

d

'

in

stances, en

d'a

ut

r

es

termes,

no

us essayo

ns d

e

convert

ir

les

donn

ées

RDF

à

RDF

Schéma (R

DF

S)

par

a

ppr

e

ntiss

age

d

e

la

st

ru

ct

ur

e conceptue

ll

e

indui

te p

a

r d

es

indi

v

idu

s décr

it

s

e

n RDF.

Pour

con

st

ruire

l

e

treill

i

s

d

e co

n

cepts à

part

ir

de données

RD

F,

les co

n

cepts

sont

i

dent

ifiés

à

l'ai

de

d

e

l

'Anal

yse

d

e concepts

f

o

rmel

s

(

FCA).

Le

nombr

e

de

co

nc

e

pt

s est

bas

é

sur

l

e

nombr

e

de so

u

s-e

n

semb

les

possibl

es

contenant

r

essour

ces

RDF

s

imil

a

ires

.

Par r

essour

ces

RDF s

i

mi

l

a

ir

es

, on veut dir

e

qu

e l

'on

co

n

s

id

ère l'e

ns

e

mbl

e

des resso

urces

RDF q

ui

p

artage

n

t

un

e

n

se

mbl

e co

mmun d

'att

ributs.

Après

la co

nstru

ct

ion du tre

illis

de

co

n

ce

pt

s,

nous

all

ons tenir compte des

propri

étés et

des propriétés d

e

donn

ées

d

é

duit

es

à

p

a

rtir d

e

donn

ées

RDF pour

constr

uire le

sc

ma.

Un autre défi pour

co

nstruir

e

le

modè

le

RDF

S

est

le

fait

d

e

nommer

les

classes

de

RDFS.

P

our

ce

l

a

, on

utilise DBp

edia.

DBpedi

a

co

nti

e

nt

l

'

information

struct

urée

d

e

Wikip

éd

ia

,

qui

cont

i

ent

d

es

inform

at

ion

s t

r

ès

util

es

nou

s

permettant d'apprendre

le

type

d'

in

stances de sort

ie

dans

les

donn

ées

R

DF

.

La méthodologie

présentée

dans cette thèse extrait

le

schéma

maximum possible

à

p

art

i

r

du niveau d'in

sta

nce

de d

o

nnées

RDF. En

a

d

o

p

ta

n

t

l

es étapes

menti

o

nnées

ava

n

t

,

on atte

in

t

l

a capacité

d'exp

l

oiter

l

a structure conceptue

ll

e

à

partir

du Web de données.

(18)
(19)

The

amo

unt

of

ava

il

a

bl

e

data

o

n

the w

e

b ha

s

considerably increased in rec

e

nt

yea

r

s,

thus the d

e

tection and analysis of u

se

ful information from

i

ts co

nt

e

nt is

very

profi

tab

l

e.

D

e

plo

y

m

e

n

t o

f

st

ru

ct

ur

e

d

data

b

ased o

n

Semantic

W

eb tec

hn

o

l

og

i

es

h

as grow

n

s

i

g

nifi-ca

ntl

y o

nlin

e

in past two

d

eca

d

es.

Th

e

r

e

f

ore,

information

extraction

has become

a

m

a

jor

co

n

cern a

mong S

e

mantic W

e

b

r

esea

r

c

h

ers.

To

publish

structured data on the w

e

b

,

d

ata so

ur

ces a

r

e

published

u

s

in

g

t

h

e

R

e

sourc

e

D

escr

iption Fr

a

m

ewo

rk

(

RDF) d

ata

m

ade

l.

Thi

s t

h

es

i

s a

im

s at extract

in

g concept

u

a

l

st

ru

ct

ur

es

from Web

o

f D

ata,

i.

e.,

RDF

d

ata

in W

eb

of documents.

Th

e

main objective i

s to

l

ea

rn

sc

h

e

ma l

eve

l fr

om

in

sta

n

ce

lev

e

l in

a

dataset;

in

other

word

s,

we try to

couve

rt

the

RDF d

ata

into

a

data

w

i

t

h

the

RDF

S

c

h

e

m

a

(RDFS) mod

e

l

by

lea

rning

t

h

e

co

n

cept

u

a

l

str

u

ct

ur

e

between RDF

individu

als

in

the

instance l

e

vel.

To

const

ru

ct a co

n

ce

pt l

attice

fr

om t

h

e

RDF d

ata

,

concepts are

id

ent

ifi

ed v

i

a

Forma

i

Conc

e

p

t

Ana

lys

i

s

(FCA)

.

The number

o

f

co

n

ce

pt

s

i

s

bas

e

d on

t

h

e

numb

e

r

of

possibl

e

subs

ets conta

inin

g

similar RDF

individu

a

l

s

. B

y s

imil

ar

RDF

indi

v

idu

a

l

s

w

e

m

ea

n

t

h

e

set

o

f

RDF r

eso

ur

ces

which

s

h

are a

co

mmon

set

of

att

ribu

tes.

Aft

e

r

d

etect

in

g

co

n

ce

pts

of

t

h

e concept

lattice

-cl

asses of

RDFS-

a

nd

t

h

e

hi

erarchica

l

re

l

at

i

ons

b

etwee

n

them

,

we

ta

k

e

int

o acco

un

t

the prop

e

rti

es

and t

h

e

inferred data prop

e

r

t

i

es

from th

e

RDF

data

in

ard

e

r

to

co

nstru

ct

the schema

leve

l.

Anoth

e

r

c

h

allenge

in

bui

ld

in

g the

RDFS mod

e

l

from d

ata

i

s

naming th

e

RDFS

cl

asses.

W

e

overco

m

e t

hi

s

i

ss

u

e

by

u

s

in

g

DBp

e

di

a

. DBp

e

dia

co

nt

a

in

s

th

e

structured

inf

ormat

i

on

f

rom

Wikipedia

, w

hi

c

h

co

nt

a

in

s

v

e

r

y

u

sefu

l inf

ormation a

llowin

g

us

to

l

ea

rn

the typ

e

of ex

iting inst

a

n

ces

in

t

h

e

RDF d

ata.

The proposed methodo

l

ogy

in

the thesis ext

ra

cts the maximum

pos

s

ibl

e schema from

the

instance

level

of

RDF data. By adopting

the aforem

ntioned

steps,

we achieved

the

capab

ili

ty to

ex

p

l

oit conceptua

l

structure from Web.

(20)
(21)

Tod

ay,

the

W

e

b

of documents

h

as ex

p

a

nd

e

d

to the

Web

of

Data

since the a

pp

ea

r

a

n

ce

of

S

e

manti

c

W

e

b. W

e

b

of

D

ata

is d

esc

rib

e

d

as graphs of

data

.

It

rapidly produ

ces

l

a

r

ge

d

atasets conta

inin

g

billion

s

of

RDF

trip

l

es

from diff

e

r

e

nt dom

a

in

s

of

knowl

e

d

ge.

Thu

s,

with

hi

g

h

g

rowin

g availability of st

ru

ct

ur

e

d

data

on the web

, ex

ploitin

g

i

t

b

ecomes ever

mor

e

int

e

r

est

ing

.

Compar

ed to

RDF

data

,

XML

a

nd HTML

a

r

e

mor

e

r

ea

dabl

e

by humans

than

RDF

since

RDF d

ata doesn't ex

pli

c

i

t

l

y

follow hi

era

r

c

hi

ca

l

a

nd

sequent

i

a

l

st

ru

ct

ur

e

form

ats

.

Th

e

r

e

for

e,

RDF mod

e

ll

ac

ks

the simplicity of

human r

ea

d

a

bility

a

nd wri

tab

ilit

y

for

it

s

do

c

um

e

n

ts.

W

e

beli

e

v

e

th

a

t

concept

ext

raction from W

e

b of Data provid

e

d

in

RDF h

e

lp

s

u

s

for

fulfillin

g

u

se

r

's

r

eq

uir

e

m

e

nts

in

h

av

in

g a

b

etter

und

e

rst

a

ndin

g

of

h

ete

ro

ge

n

eous

d

ata

o

n

the web.

Impl

e

m

ent

in

g

this

id

ea co

uld

l

ea

d us

to

improv

e

the readability of RDF

statements by ordering a

nd

g

rouping

them

.

FCA is

a

k

ey

issu

e

for form

a

ll

y

dis

c

ov

e

ring

a

nd r

e

pres

e

nting

concept

hi

e

rar

c

hi

es

as we

ll

as

the

clust

e

ring

of

knowl

e

dg

e

found

on

th

e

w

e

b

.

M

ot

iv

at

i

o

n

Even though the data sources are structura

ll

y

defined

on Web of

Data,

the

effort

for reducing decentralization of data which

suffers

from the

lack

of vocabulary

in

non-conceptualized data is interesting. In other words, extracting schema from

data becomes

more interest

in

g w

h

en

it comes to

data w

i

thout exp

li

cit conceptua

li

zation.

RDF describes resources without cons

i

dering taxonomies of the

i

r classes and propert

i

es.

The approach of discovering conceptual structure from Web of Data represented as RDF

tr

ipl

es

i

s

possible

by

using

FCA.

(22)

Obj

ect

ive

Extr

act

ion of sch

e

m

a

from

RDF data

could l

ea

d

to RDFS model

co

n

st

ru

ct

ion

whi

c

h

co

nt

a

in

s

ri

c

h

e

r

vocabu

l

a

ri

es

for

describing

th

e

dat

a

. RDFS

i

s a

n

exte

n

s

ion

of

RDF

mod

e

l

whi

c

h

a

llows t

h

e

d

esc

ript

ion of

RDF

tenns

in

t

h

e

form

of

class

(types

of the

in

sta

n

ces),

subClass

(re

l

at

ion

b

etwee

n clas

ses)

,

property

(properti

es

whi

c

h d

esc

rib

e

classes)

a

nd s

ubProperty

(

r

e

lation

b

etwee

n properties)

as

w

e

ll

as

domain

a

nd

range

of t

he

properties.

Obt

a

ining

a

n

RDFS

model fr

om the

RDF d

ata

h

e

lps

u

s so

lve the

pr

o

bl

e

m

s

of h

ete

ro

ge

n

e

i

ty

in

r

a

w

data

of

t

h

e

web.

Stru

ct

ur

e

of thi

s d

is

sertat

ion

This

di

sse

rt

at

ion i

s

m

ga

ni

ze

d

as

follows:

In

Chapter 1, we

d

e

fin

e

th

e

b

as

i

c co

n

ce

pts

whi

c

h

are

u

se

d

t

hrou

g

hout

t

his

t

h

es

i

s.

Th

e

m

a

in

co

n

ce

pt

s

that

a

r

e exp

l

a

in

e

d in t

h

e

chapter

includ

e

:

R

eso

ur

ce

D

esc

ri

ption

Frame-work

(RDF),

RDF

Schem

a

(RDFS

),

Formai Con

ce

pt An

a

l

ys

i

s

(FCA)

a

nd

DBp

ed

i

a

.

Chapter

2 pr

esen

ts

a

r

ev

i

ew

of

th

e

literatur

e

.

It

pres

e

nts r

e

l

ate

d

works to our thesi

s

a

nd

the

co

mp

a

rison

of our

works

t

o th

e

m

.

First, t

h

e c

h

a

pt

e

r di

sc

u

sses

th

e

basi

c

simi-l

a

rit

y

m

et

hods that

ex

i

s

t

for

ontologies.

The similarit

y

m

et

hods

a

r

e

used

f

o

r

building

in

te

rlinkin

g

tools whi

c

h

are

introd

uced

in

continuan

ce

bri

e

fi

y

. Fin

a

ll

y,

we

pr

ese

n

t o

ur

approac

h in

com

p

a

ri

so

n to

the oth

e

r

works for

ex

tr

ac

tin

g s

imil

a

r RDF

individu

a

l

s.

Ch

a

p

te

r 3 d

e

s

c

rib

es t

h

e full impleme

n

ta

t

i

on of our m

e

thodo

l

o

gy

in

a

ddition to th

e

in-trod

u

ct

i

on to some

J

ava

p

l

atforms a

nd

APis

r

eq

uir

e

d

during

im

p

l

ementat

i

on.

Fina

lly

,

the methodology

is

eval

uated by

three metric measurements

including

precision

,

recall

and

f-measure in

Chapter 4 .

(23)

MAIN CONCEPTS

The current

chapter provides

background informat

ion on t

echnologies

we

benefited

from during our

approach.

In two first sections

, brief introductions

to RDF and RDFS

models

are

provided. Third

section introduces

FCA which plays

an

important

role

in

our methodology. Finally, the

DBpedia which

contains

useful

knowledge for

generating

our final

output is proposed.

1.1 RDF

The Resource

Description Framework

(RDF)

is a

fundamental

data

model in

Se-mant

ic Web t

echnology [MM04]

.

It is

designed

to b

e read and und

erstood by

machines.

As

a

generic

data

model,

RDF represents

t

he

information

on the

web

in

th

e

form

of

<

subj

ect-predicate-object

>

triples.

Each

triple

is a

sentence

describing

a

resource.

A

resource is an ent

ity whi

ch can be a subj

ect,

predicat

e or object in

an RDF triple.

Each

resource on t

he web

is

uni

quely

ident

ified

by Uniform Resource

Ident

ifier

(URI). URI

ident

ifies a resource via location

or a

name or both.

The subject or first part of an RDF triple is a resource which the statement describes.

The predi

cat

e or second par

t of a t

riple is

a property or

aspect which relates t

he resource

t

o

a

n obj

ect. T

herefore, t

he object

is

t

hird part of a t

riple

which

could b

e anot

her

re-source

or

a

liter

al

value defined as a string or a

number

, a date, etc

[1

899

]

.

(24)

of

a set o

f

tr

ipl

es w

h

ere eac

h tripl

e

represents

a

n

arc

.

Th

ere

for

e

,

each

RDF

state

m

e

nt

i

s a s

ub

graph

where

eac

h nod

e

i

s a s

ubj

ect

or object whereas

arcs a

r

e

pr

e

di

cates (The

arc starts

from the

s

ubj

ect and

it i

s

dir

ected to t

h

e

object). Further,

RDF

can

us

e

XML based

sy

n

tax

, i

.

e.,

RDF

/

XML

to c

r

eate or

modify the RDF

g

r

ap

h

s

[RDF0

4

]

.

An

exa

mpl

e of

RDF

grap

h i

s g

i

ven

in

the f

o

ll

ow

in

g

[Li13]

.

Suppose that

a st

ud

e

n

t

with name

James Anderson

h

as

professor Paul Jones

as

hi

s su

-pervisor

.

The

stateme

nt

s

r

e

l

ate

d

to this

inf

o

rm

at

i

on are represe

nt

e

d

as a

n

RDF

graph

s

h

own

in Figure

1.1.

http:f

fwww~mydomain.orgfuni-m/PauiJones

http:f

fwww .mydomain.org/uni-m/ JamesAnderson

http:f/www

.

mydomain.org/uni-m/

/Prof

essor

Figure 1.1: RDF

grap

h

examp

l

e

[Li13]

The XML

sy

nt

ax

of the RDF datais:

<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:uni="http://www.mydomain.org/uni-ns">

<rdf:Description rdf:about="http:

//www

.mydomain

.org

/

uni-ns

/

PaulJones"

>

<rdf:

t

ype

rdf:resource="http://www.mydomain.org/uni-ns/Professor"/>

<uni:advises rdf:resource=

"

http://www

.mydomain

.

org/uni-ns/JamesAnderson"/>

</rdf

:

Description>

</rdf:RDF>

http:

//

www.w3.org/1999/02/

22-rdf-syntax-ns#

and

http:

//w

ww

.mydomain.org/uni-ns

a

re

XM

L name

spaces.

One use

s

XML name

space

in RDF

to s

h

ow

the

co

ll

ection of

names

(25)

of

resources

and

proper

t

ies. For example,

t

h

e

xmlns: rdf nam

espace

spec

ifies

that

the

el-emen

ts w

i

t

h rdf prefix com

e

from t

he namespace http:

1

/www

0

w3

0

org/1999/02/22-rdf--syntax-ns

wh

ich

is

known

as a

namespace

for RDF

vocabu

laries.

Moreove

r

,

XML

Qualified

Na

me

is

a shortcut for

URI

; for

example,

we could use

uni

instead

of the

full

URI

http:

1

/www omydomainoorg/uni-ns.

In

t

h

e

graph

,

Paul Jon

es

is connected

to

Jam

es

And

erson

by pre

dicate

advises.

Be-sicles,

there

exists another

re

lati

on wh

ich

co

nnects

Paul Jon

es

to

t

h

e

class

Prof

essor

at

t

he schema

l

eve

l;

ther

e

for

e,

Paul Jon

es

is

connected to

class

Profess

or

by

u

s

in

g

pred-icate

type

.

Again

,

t

he

nam

espace

rdf

is used ins

tead

of wr

i

ting

t

h

e

full

URI

http:-1

/www

0

w3

0

org/1999/02/22-rdf-syntax

-n

s#. The

full

examp

l

e

of

t

h

e

schema

level

is

given

in

t

he

following section.

1.2 RDF

Schema

(RDFS

)

On

top o

f

t

he

RDF

which doesn't provide sign

ificant semant

ics,

RDFS

is an

exten-sible

know

l

edge

repr

ese

ntat

ion

language

which adds vocabu

lary to

RDF in order

to

ex-press

information

about

class

a

nd subclass and properties

(re

lationship

betwee

n classes)

[BVM04]

.

These

vocabu

lari

es cont

ain

class,

subclass,

r

e

lationship

between classes, prop

-e

rty,

subp

r

operty, re

lati

onsh

ip

betwee

n proper

t

ies,

domains and ranges, etc.

The RDFS

level of

the examp

l

e desc

rib

ed in the previous

section

is

giv

e

n i

n

t

h

e

following

gr

a ph

[Li13]

(

drawn

as

Figure

1.

2):

The

schema

part

of th

e

RDF

/

XML syntax

from RDFS

data is:

<rdfs:Class rdf:ID

<rdfs:Class rdf:ID

"Person"/>

"Student"/>

<

rdfs:subClassOf

rdf:resource

=

"#Person"/>

</rdfs

:

Class

>

<rdfs:Class rdf

:

ID

=

"Professor">

(26)

rdfs:range

RDFSievel

RDF level

advises

Figure 1.2:

RD

FS gra

ph exa

mple

[Li

13]

</

rdfs:Class

>

<

rdf

s

:Pro

pert

y

rdf:ID

=

"advises"

>

<

rdfs:domain rdf

:

res

our

ce

=

"#Profe

ssor"

/

>

<

rdfs

:

range rdf

:

r

esour

ce

=

"#Student"

/>

</rdfs

:

Property>

"

#

"

is

us

ed

ins

tead of riting URI referenc. Th rdfs:doma

i

n

and

rdfs:range

predi

-cates

relate

a predicate to the class of instances which can be considered

as

the subject

or obj

ect

of

t

h

e

pr

ed

i

cate,

r

es

p

ect

ively.

rdfs: subClass

O

f ident

ifies t

h

e

hi

e

rarchical

re-lationship between

classes at

the

schema

level. In

t

he above example,

P

ro

f

essor

and

St

u

d

e

nt are subclasses of P

ers

on class. ad

vises

is a property

which

has classes

P

ro

f

ess

or

and St

u

d

e

nt respectiv

e

ly as

its

domain

and r

ang

.

(27)

1.3 FCA

1.3

.

1 Introduction to Form

a

l

Concept

Anal

ys

i

s

FCA

sta

nds for

Forma

l

Concept

An

a

l

ys

i

s, a

formal r

eprese

n

tat

ion

of

d

ata that

has

the potentia

l

to

b

e

r

e

pr

esented as

conceptua

l

str

u

ct

ur

e

[GW

99

]

.

FCA

is

a

data a

n

a

l

ys

i

s

tec

hniqu

e

that

h

e

l

ps to

id

e

ntif

y

the co

n

cept

u

a

l

st

ru

ct

ur

e o

f

data

u

s

in

g

formal

co

n

texts

a

nd

co

n

ce

pt l

at

ti

ces.

Every

d

a

t

ase

t whi

c

h

cons

i

sts

of

a

binary r

el

at

i

o

n

between

a set o

f

objects a

nd

a

set of

att

ribu

tes ca

n

be

in

troduced as a

formal

co

n

text

in FCA

[

WB04].

Definition 1 (Formal

Context):

A

f

orma

l

context

i

s a tr

iple

K

:= ( G

,

M

,

I

), whe

r

e

G

a

nd

M

are sets and

I i

s a

re

l

at

ion b

etwee

n

G

a

nd

M. The el

ements

of

G

a

nd

M are

ca

ll

ed objects a

nd

att

ribut

es

, r

es

p

ect

iv

e

l

y

,

a

nd

(g

,

m)

E

I

i

s

r

ead as

"

a

n

object

g

h

asan

attr

ibut

e

m".

A set

of o

bj

ects a

nd

their

co

rr

es

pondin

g attr

ibu

tes

plus

the

r

e

l

ations that ex

i

st b

e

tw

e

e

n

those

ob

j

ects a

nd

attr

ibu

tes ca

n

be s

h

own

in

a

forma

l

co

n

text.

F

o

rm

a

l

co

nt

ext ca

n

be

represented

as a

table

i

n wh

i

c

h row

s a

r

e ob

j

ects and co

lu

mns

are att

ribut

es a

nd

eac

h

cross

in th

e

tab

l

e

i

s a

r

e

l

at

i

o

n b

e

tw

ee

n

a

n

ob

j

ect and corresponding

at

tribut

e

.

An

e

xamp

l

e o

f

form

al

cont

e

xt can

b

e

s

ee

n in Figure 1.3. Th

e

examp

l

e

includ

es

four

obj

e

ct a

nd

four

attr

ibut

es.

~

Aii2l~

Att'4l

Objl ~ ~ ~

IEl

Obj2

1

~

~ [ô

lEI

1obj3

1

~

IEl

IEl

~

Obj4 ~

ID

[j) ~

F

i

gur

e

1.3

:

F

o

rmal

c

ont

e

xt

e

xampl

e

F

ur

t

h

e

r

, t

h

e

form

a

l

co

n

text ca

n b

e re

pr

ese

n

te

d in

c

on

ce

p

t

u

a

l

str

u

ct

ur

e

whi

c

h

w

ill

be

ex

pl

a

in

e

d

in th

e

n

ext sec

tion.

(28)

1.3.2

Concept

L

att

i

ce

A

formal

concept

ca

n b

e

r

ep

r

ese

nt

ed

in

a

l

att

i

ce

of co

n

ce

pt

s

in

which

eac

h

co

n

cept

includ

es a set

of objects a

nd

r

e

l

ated

att

ribut

es.

Th

e

d

e

finition

of

form

a

l

concept a

nd

concept

l

att

i

ce

are g

i

ven

in

t

h

e

followin

g

[WH06].

Definition 2 (

'

Operation):

For

a

set

A

Ç

G

of abjects,

w

e

defin

e:

A'=

{mE

M

1

\lg

E

A: (g,m)

E

I}

Corresponding

l

y, for a set

B

Ç

M

of attri

but

es,

we define:

B'

=

{g

E

G

1

\lm

E

B

: (g,

m)

E

I}

Th

e

formal

concept

i

s

defined as:

Definition 3 (Formai Concept):

A forma

l

concept C

in

the

form

a

l

co

n

text (

G

,

M,

I

)

i

s a

p

a

ir

(A,

B)

,

where

A

Ç

G

,

B

Ç

M

,

A'

=

B

a

nd

B'

=

A. T

h

e set

Ais

ca

ll

ed

the

exte

n

t a

nd

the set

B

the

int

e

nt

of the co

n

cept

C.

In

ot

h

er words,

eac

h

concept

i

s

represented by

a

p

a

ir

cons

i

st

in

g of an

exte

nsion

and an

intension which

a

r

e a set

of abjects

and a set

of

attr

ibut

es

,

respect

i

ve

l

y.

As

a ge

n

era

l

rule

,

the abjects in the exte

n

sion

h

ave a

ll

the attributes

in

t

h

e

ir in

tension

in

co

mmon

and

h

ave

no oth

e

r

attr

i

butes

in

common. Further,

a

ll

the attr

ibute

s

i

n the intension are

s

h

a

r

ed

b

y a

ll

the abjects

in

t

h

e ex

t

ens

i

on a

nd no

other

abject

outside

o

f

the extension.

A

conce

pt

l

at

ti

ce arises on

the top o

f

the co

ncept

s

d

e

riv

ed from

form

a

l

co

nt

ext.

Definition 4 (Concept

Lattice): For

a

formal

context

K

:=

(G,

M,I) and

two

con-cepts C1

=

(A 1,

B1)

and C2

=

(A2

,

B2)

, a

hi

era

r

chica

l

subconcept

-

superconcept relation

is

given by

T

h

e set of a

ll

concepts in K

o

rd

ered by

the :::; relation is

called the concept

lattice of

K.

The conce

pt l

attice of th

e

above examp

l

e

i

s s

h

own

in

F

i

gure

1.4.

(29)

Figure 1.4: Concept

l

attice examp

le

l

abeli

ng di

agram on

ly s

h

ows

the

attr

i

butes

a

nd

abj

ect

s

o

n

ce

in

latt

i

ce

diagram

(

Fi

gure

1.5); therefore,

i

t

mak

es

data

a

n

al

yzat

i

o

n

eas

ier

for

sorne

applications.

1.4

DBp

e

dia

DBpedia

1

i

s a

project

t

hat aims at

ext

racting structured information from Wik

i

pedia

co

ntent. T

his

open source data

set is availabl

e

o

n the web

as

linked data

-

RDF

t

ripl

es-

for

hum

an

and machine u

sage. S

ince

DBped

ia

is

prov

ided

in struct

ura

l

form

,

it a

llows

u

sers

for

much

eas

ier

querying

and

e

xploring

agai

nst Wikipedia

content

by

using

SPARQL

e

nd point.

So

far

DB

pedia is

known

as a central

interlinking hub for

published

data

on t

h

e

web

and

it is

evolved

by

any

c

hanges

in

Wikip

e

dia

[

ABKLC08].

DBpedia includes

around

3.5

million instances

that belong to

diff

erent categor

i

es.

Also,

DBp

ed

ia

is available

in

97

different languages.

More

informati

on

is

gi

ven

later in

this section.

In

t

he

following

,

t

he structure

of DBpedia and

t

he source

of

its

data will

be

described.

F

inally, the methods

for

accessing

DBpedia

are

discussed.

(30)

Att 1

Att 4

Att 3

Obj 3, Obj 4

Obj 1

Figure

1.5:

R

educed

labeling diagr

am of concept lattice example

1.4

.

1 Dat

a

Source

DBp

edia

is

a

cross-domain ont

olo

gy which has b

een built manually by th

e members

of DBpedia

community

.

DBpedia

uses Wikip

edia as it

s source

of knowledge. Wikipedi

a

is

one of th

e fastest-

growing and

largest

collections of human know

l

edge ever

collected.

Since some of the information in Wikipedi

a is unstructured, querying information from

it

needs a full t

ext search

.

The DBpedia community found a way t

o convert th

e contents of

Wikipedia

into RDF triples.

In addition to fr

ee t

ext information

, DBpedia also uses the

different t

ypes

of structured information from Wikipedia

including

infobox templates

1

,

t

itle, abstract

,

categorization information

,

images,

geo

information,

and

external

url

links and

conver

t

s

t

hem

into

RDF triples.

Figure

1

.

6 shows an

example

of extracting

semant

ics

from

a

Wikipedia

infobox

for

Port

ugal's content

in DBpedi

a. Currently, t

he

DBpedi

a ontology

2

is created based

on several

Wikipedia

infobox temp

lates and

con-verts t

hem into 359 classes with

1,775 properties.

As m

ent

ioned

earlier

,

DBpedi

a

includes 3.5 million

inst

ances and 2.

35

million of which

are classified in the DBp

edia

Ontology, including

persons,

places,

works

(

contain mu

sic

1

http:

/ / en.

wikipedia.org/

wiki/

Help:Infobox

2

(31)

1 1 1 image..111ap 1 mapsize

=

Algarve = Region = LocalRegiaoAlgarve. svg

=

186lpx

1 map_caption = llap showing Algarve

Region in Portugal

subdivision...type

=

[[Countries of the

wor ld 1 Country]] subdi vision...name = { {POR}}

subdivision...type3

=

Capital city

subdivision...name3

=

[[Faro, PortugaliFaro]]

1 are a_ totaLJan2

=

5412

1 population...total

=

41&&&&

1 timezone

=

[[Western European

1 utc_offset 1 timezone..J)ST 1 utc_offset..JlST 1 blank_name_sec1 1 blanlc_info_sec1 1 blank_name_sec2 1 blank..info_sec2 } Time 1 iiETJ] =+Il = [[Western European

S\l.llllller Time IWESTJ]

=

+1

= [[NUTS]] code

=

PTIS

= [[GDPJ] per capita

=

€19,281!1 (2&&6)

Figure

1.

6:

Infobox of Portuga

l

Class

In

stances

Resource ( over

a

ll

)

2,350

,000

Place

573,000

P

erson

764,000

Work

333,000

Species

202,000

Organisation

192,000

Tab

l

e

1.1: Classes

in DBpedi

a ontology

a

lbum

s,

film

s and video

games), organi

zations (con

ta

in

companies

and educationa

l in

st

i

-tuti

ons), spec

i

es

.

Table

1.1

shows t

he number

of the

instances per class

in t

he

DBpedi

a

ontology.

1.4.2 Data Structure

In

contrast to

Wikipedia, which

lacks

a structural

representation of data, DBpedia

uses Semantic Web technologies

for

extracting

structured

information from Wikipedia

to facilitate querying

and searching tasks. DBpedia ontology is based on the OWL

lan-guage

[

PHH04].

Figure

Figure  P age
Table  P age
Figure  1.1:  RDF grap h  example  [Li13]
Figure 1.2:  RD FS gra ph exa mple  [Li 13]
+7

Références

Documents relatifs