• Aucun résultat trouvé

Ontologies, Knowledge Bases, Wikidata MPRI 2.26.2: Web Data Management

N/A
N/A
Protected

Academic year: 2022

Partager "Ontologies, Knowledge Bases, Wikidata MPRI 2.26.2: Web Data Management"

Copied!
38
0
0

Texte intégral

(1)

Ontologies, Knowledge Bases, Wikidata

MPRI 2.26.2: Web Data Management

Antoine Amarilli Friday, January 11th

1/31

(2)

Reminder

Ontology: vocabulary (classes and relations) to describe things

• Knowledge base: set of facts in one or several ontologies

Focus on Wikidata: a general-purpose knowledge base and ontology

2/31

(3)

Ontologies

(4)

Ontologies

Various domain-specific vocabularies used across knowledge bases

One general-purpose ontology used by Google, Microsoft, Yahoo, Yandex: schema.org

• Other ontologies that come together with a knowledge base

3/31

(5)

Friend of a friend (FOAF)

Describe people, relationship, profiles, activities (social network)

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

@prefix foaf: <http://xmlns.com/foaf/0.1/> .

<#JW>

a foaf:Person ;

foaf:name "Jimmy Wales" ;

foaf:mbox <mailto:[email protected]> ; foaf:homepage <http://www.jimmywales.com> ; foaf:nick "Jimbo" ;

foaf:depiction <http://www.jimmywales.com/aus_img_small.jpg> ; foaf:interest <http://www.wikimedia.org> ;

foaf:knows [

a foaf:Person ;

foaf:name "Angela Beesley"

] .

4/31

(6)

Creative Commons

Describe the license and rights on documents

<div about="http://lessig.org/blog/"

xmlns:cc="http://creativecommons.org/ns#">

This page, by <a property="cc:attributionName"

rel="cc:attributionURL"

href="http://lessig.org/">Lawrence Lessig</a>, is licensed under a <a rel="license"

href="http://creativecommons.org/licenses/by/3.0/">

Creative Commons Attribution License</a>.

</div>

Many content providers add this kind of markup (e.g., Flickr)

• Search engines can use it (e.g., Google)

5/31

(7)

Other domain-specific ontologies

• Dublin Core (DC): Describe digital resources (videos, images, etc.) and physical resources (books, CDs, etc.)

• Simple knowledge organization system (SKOS): describe thesauri, taxonomies, etc.

• Open Graph Protocol: metadata for Web pages to be integrated in Facebook’s social graph; also Twitter Cards for Twitter

• DOAP (Description of a Project): describe software projects

VoID (Vocabulary of Interlinked Datasets): describe a linked dataset

• Countless others

6/31

(8)

Schema.org: a general-purpose ontology

• General-purpose ontology: 598 types and 862 properties in version 3.5

• Intended to be used on Web pages to annotate the semantics of elements

Used by search engines for rich search results

Used in over 10 million sites 1

1 Source: https://schema.org/

7/31

(9)

Format: Microdata

<div class="event-wrapper" itemscope itemtype="http://schema.org/Event">

<div class="event-date" itemprop="startDate"

content="2013-09-14T21:30">Sat Sep 14</div>

<div class="event-title" itemprop="name">

Typhoon with Radiation City</div>

<div class="event-venue" itemprop="location"

itemscope itemtype="http://schema.org/Place">

<span itemprop="name">The Hi-Dive</span>

<div class="address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">

<span itemprop="streetAddress">7 S. Broadway</span><br>

<span itemprop="addressLocality">Denver</span>,

<span itemprop="addressRegion">CO</span>

<span itemprop="postalCode">80209</span>

</div>

</div>

<div class="event-time">9:30 PM</div>

</div>

itemscope creates an item and itemtype gives its type

itemprop gives values for properties of the item 8/31

(10)

Format: RDFa

Competing format to Microdata, seems less common 2

<div vocab="http://schema.org/" class="event-wrapper" typeof="Event">

<div class="event-date" property="startDate"

content="2013-09-14T21:30">Sat Sep 14</div>

<div class="event-title" property="name">

Typhoon with Radiation City</div>

<div class="event-venue" property="location" typeof="Place">

<span property="name">The Hi-Dive</span>

<div class="address" property="address" typeof="PostalAddress">

<span property="streetAddress">7 S. Broadway</span><br>

<span property="addressLocality">Denver</span>,

<span property="addressRegion">CO</span>

<span property="postalCode">80209</span>

</div>

</div>

<div class="event-time">9:30 PM</div>

</div>

2 http://webdatacommons.org/structureddata/index.html#toc2

9/31

(11)

Format: JSON-LD

Alternative approach: give the structured data separately in JSON

<script type="application/ld+json">

{

"@context": "http://schema.org",

"@type": "Event",

"location": {

"@type": "Place",

"address": {

"@type": "PostalAddress",

"addressLocality": "Denver",

"addressRegion": "CO",

"postalCode": "80209",

"streetAddress": "7 S. Broadway"

},

"name": "The Hi-Dive"

},

"name": "Typhoon with Radiation City",

"startDate": "2013-09-14T21:30"

}

</script>

The @context

attribute gives the namespace for the

@type.

• No longer gives any link to the page contents

Also @id to give an URI

to a node

• Many other features (editor’s draft of the spec is 167 pages)

10/31

(12)

Web Data Commons Structured Data

• Extraction of semantic content from the Common Crawl

• Also useful to measure usage of structured data:

• In November 2017, the Common Crawl contained 66 TB (compressed), 260 TB (uncompressed), 3.2G pages

• 39% of pages (and 28% of domains) contained semantic data

• 9G entities and 38G triples

• http://webdatacommons.org/structureddata/

11/31

(13)

Knowledge bases

(14)

Common Knowledge bases

• Generalistic: DBpedia, YAGO, Freebase (defunct), Wikidata

• Proprietary: Google Knowledge Graph, Bing Knowledge Graph (aka Satori)

• Domain-specific

• We will focus afterwards on Wikidata

12/31

(15)

DBpedia

Started in 2007

License: CC-BY-SA

• Code license: GPLv2

Actors: Leipzig University, University of Mannheim, Open Link Software

• Latest release: 2016-10

Extracted from Wikimedia projects

• 6M entities and 10G triples in 2016-04 3 ,

3 https://blog.dbpedia.org/2016/10/19/yeah-we-did-it-again-new-2016-04-dbpedia-release/

13/31

(16)

YAGO

Started in 2008

License: CC-BY

• Code license: GPLv3

Actors: Max Planck Institute for Informatics, Télécom ParisTech

• Latest release: YAGO 3.1 (2017)

Extracted from Wikipedias and other sources; manual evaluation

• 10M entities and 120M triples 4 ,

4 http://yago-knowledge.org/

14/31

(17)

Freebase

Started in 2007, discontinued in 2016

License: CC-BY

• Code license: Apache2 (provided after-the-fact by Google)

Actors: Metaweb, acquired by Google in 2010

Initially imported from various sources

Could be edited by anyone

Partially imported into Wikidata (but not completely)

• Last release: 2016

• Last dump has 1.9G triples

15/31

(18)

Wikidata

Started in 2012

License: public domain

• Code license: GPLv2

Actors: Wikimedia Deutschland, Wikimedia

• Last release: weekly

Around 650M statements and 54M items

• Can be edited by anyone! Around 20k active users.

16/31

(19)

Domain-specific

• MusicBrainz, for CDs and music in general (20 million recordings)

• British National Bibliography: bibliographic details about books published in the UK since 1950

• data.bnf.fr, data from the French national library

• OpenStreetMaps, and Geonames

• Medicine and chemistry with SNOMED CT, and other databases:

DrugBank, KEGG, UniProt, ChEMBL, etc.

• Linguistic resources, e.g., Babelnet

• Bibliography, e.g., DBLP, Crossref

17/31

(20)

Linked Open Data

Legend Cross Domain Geography Government Life Sciences Linguistics Media Publications Social Networking User Generated

status...

GeoNam...

Person...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status... Amino ...

Compar...

Chemic...

CRISP ...

Logica...

Cell l...

MESH T...

Medica...

NCI Th...

Nation...

Nation...

NIFSTD

NanoPa...

Read C...

RxNORM

SNOMED...

SNP-On...

Sequen...

Sugges...

VANDF DBpedi...

DBpedia

datahub

openli...

W3C Arthro...

DBLP R...

Freebase

New Yo...

status...

status...

status...

status...

status...

status...

status...

status...

TaxonC...

BBC Wi...

Europe...

Fishes...

GeoSpe...

OpenCyc

UMBEL ...

UniProt status...

status...

DBTune...

MusicB...

Poképé...

Pokede...

Univer...

OLiA

Japane...

Web ND...

DBpedi...

HEALTH...

Cancer...

Cancer...

COSTART

Human ...

Experi...

Health...

ICPC-2...

MedDRA Medlin...

Natura...

NIF Dy...

Online...

PMA 2010

RadLex WHO Ad...

ChEMBL...

Bio2RD...

EPA-CDR EPA-FRS

EPA-SRS DWS-Group

Semant...

semant...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Inspec...

Czech ...

Geospa...

YAGO

Wikidata Nation...

Associ...

CiteSe...

Commun...

ReSIST...

DBLP C...

ePrint...

Univer...

Univer...

Resear...

School...

ReSIST...

Uberbl...

TIP

Linked...

Influe...

Advers...

BioAss...

Bone D...

Basic ...

BIRNLex

Gene R...

BioTop

CAO

Cell C...

Chemic...

Cell L...

Cognit...

Ontolo...

Electr...

Human ...

Cardia...

eagle-...

eVOC (...

Fly ta...

Genera...

Gene O...

Gene R...

Host P...

Inform...

Intern...

Infect...

Brucel...

Malari...

Intera...

SysMO-...

Mental...

Emotio...

Protei...

Mosqui...

Neural...

Neomar...

NIF Cell

Neural...

NMR-in...

Ontolo...

Ontolo...

OBOE SBC Ontolo...

Ontolo...

Ontolo...

Ontolo...

Ontolo...

Ontolo...

Ontolo...

Ontolo...

Phenot...

Pediat...

PRotei...

RNA on...

Subcel...

Sleep ...

Semant...

Softwa...

Time E...

Transl...

VIVO

Vaccin...

MGED O...

Mass s...

Solana...

Units ...

Units ...

Rechts...

Parole...

lexinfo

Rat St...

Africa...

Minima...

Physic...

PHARE Pathwa...

El Via...

GeoLin...

DBpedi...

2000 U...

DBTune...

flickr...

DailyMed

DBLP B...

Diseasome

DrugBank Eurost...

Projec...

SIDER:...

Linked...

RDF Bo...

Revyu....

TCMGen...

WordNe...

World ...

Gemeen...

zhishi...

BabelNet

DBpedi...

Zhishi.me

status...

status...

status...

status...

status...

status...

status...

AI/RHEUM

Bleedi...

Curren...

Common...

Plant ...

FlyBas...

HCPCS Human ...

ICD10

ICD10CM

Intern...

Intern...

Molecu...

Breast...

Cell l...

Master...

Mammal...

Mouse ...

Metath...

NCBI o...

Ontolo...

Orphan...

Studen...

Reuter...

Amphib...

Anatom...

Basic ...

Bilate...

BRENDA...

Cerebr...

Human ...

Human ...

Drosop...

Hymeno...

Mouse ...

Medaka...

Teleos...

Uber a...

Verteb...

verteb...

Xenopu...

Zebraf...

CLLD-WOLD CLLD-G...

Lexvo

Persée...

data.b...

IdRef:...

VIAF: ...

EnAKTi...

Ordnan...

Prince...

WordNe...

openda...

statis...

Agenda...

Instit...

Ascomy...

System...

Cognit...

Fungal...

Fissio...

Gene O...

Cereal...

Event ...

IxnO

MeGO

Plant ...

Plant ...

Physic...

System...

SoyOnt...

Plant ...

Verteb...

Yeast ...

status...

Linked...

U.S. S...

ichoose

eagle-...

Biomed...

Basisr...

Open D...

eagle-...

EventKG Deaths...

Regist...

data.g...

status...

status...

Univer...

EPA-TRI

Family...

Intern...

eagle-...

Intera...

Didact...

Focus ...

status...

status...

status...

status...

status...

MLSA -...

wiktio...

Dendri...

Protei...

openda...

Linked...

EUR-Le...

ABA Ad...

Cell type

Enviro...

Spider...

Mosqui...

C. ele...

Tender...

State ...

R&D Pr...

Temple...

Semant...

Syndro...

Atheli...

LemonW...

Tradit...

Multip...

EARTh

GEnera...

ThIST UMTHES

Deusto...

MORElab

CLLD-E...

DBkWik Europe...

Bundes...

Food a...

Intern...

Transp...

World ...

ICD-10...

Ontolo...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Breast...

Dictyo...

Tick g...

BBC Music openda...

refere...

RISM A...

Gemein...

Fundaç...

Budape...

Instit...

France...

Divers...

Korean...

Univer...

Prince...

Librar...

Brown ...

ICANE

Lista ...

cablegate

Situat...

Sample...

Facete...

Thai W...

Reacto...

UniProtKB

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

Bio2RD...

IMGT-O...

Parasi...

Proyec...

openda...

Biolog...

FDA Me...

Lipid ...

PKO_Re Experi...

dbnary ALPINO...

School...

Resili...

DEPLOY...

dotAC ...

epsrc IBM Re...

IEEE P...

UK JIS...

LAAS-C...

Open A...

Univer...

RISKS ...

Univer...

ECS So...

C. ele...

Amphib...

Taxono...

Teleos...

TOK_On...

TWC: L...

GovTra...

vivo2doi CrossR...

VIVO S...

VIVO U...

VIVO W...

VIVO W...

tags2c...

WordNe...

Europe...

EEA Re...

EIONET...

Telegr...

Linked...

DBTune...

Multil...

Neomar...

DATATU...

NASA S...

BBC Pr...

Integr...

Clinic...

DBpedi...

openda...

eagle-...

EUMIDA...

Linked...

NUTS (...

Sudoc ...

CE4R K...

eagle-...

OpenMo...

Linked...

lobid-...

B3Kat ...

Dewey ...

Projec...

lobid-...

Open L...

Automa...

fun

Linked...

Bio2RD...

Aperti...

Animal...

Spatia...

ExO

Logger...

MIxS C...

Sentim...

openda...

Google...

LinkedCT

Univer...

Aperti...

xLiD-L...

dbpedi...

Projet...

DBpedi...

Bio2RD...

Manual...

Debian...

Bricklink

Bio2RD...

sloWNe...

openda...

Job ap...

status...

status...

bio2rd...

CLLD-afbo Aperti...

ReSIST...

southa...

BPR ? ...

Univer...

Aperti...

Open M...

ISOcat

wordpress

Univer...

lemonUby

Univer...

Univer...

The Li...

Univer...

MARC C...

lingvo...

Englis...

Genera...

TDS

SmartL...

iServe...

Verrij...

Cornet...

DBpedi...

Art & ... ERA - ...

openda...

Medici...

ATC gr... YSA - ...

YSO - ...

SALDO-RDF Data a...

Compre...

Alpine...

BibBase

busine...

Chroni...

Discog...

Mosele...

Data I...

data.o...

DBTropes DBTune...

data.dcs

educat...

EnAKTi...

EnAKTi...

EnAKTi...

enviro...

ESD St...

Eurost...

EventM...

TheSoz...

Hungar...

John G...

Linked...

Linked...

Linked...

The Lo...

Lotico myExpe...

Nation...

OpenCa...

Openly...

patent...

Englis...

Last.F...

resear...

Techni...

Deep B...

UN/LOC...

WordNe...

Semant...

STW Th...

Surge ...

Thesau...

Open L...

The Vi...

transp...

UK Leg...

UK Pos...

Univer...

URIBurner

VIVO C...

VIVO I...

20th C...

GeoEcu...

Nation...

Linked...

Diagno...

Non Ra...

Random...

datos....

Thesau...

openda...

Diavgeia

Hellen...

Hellen...

status...

status...

status...

status...

status...

status...

status...

status...

Bio2RD...

Linked...

Schema...

openda...

associ...

Edublogs

EnAKTi...

Accomm...

Inever...

Inever...

CLLD-P...

CLLD-WALS

status...

status...

Genera...

Code l...

Cadast...

status...

Aperti...

Public...

openda...

PreLex Linked...

Drosop...

eagle-...

DBpedi...

Amster...

Commun...

Italia...

Albane...

SIMPLE

Weathe...

MetaSh...

TEKORD eagle-...

ciard-...

Univer...

EU Age...

Linked...

OpenEI...

KORE 5...

MultiW...

Federa...

IATI a...

The Eu...

UNESCO...

openda...

openda...

GeoWor...

FrameB...

LODAC ...

Persia...

status...

Univer...

theses.fr

Polyma...

Regist...

EU Par...

EU Who...

Educat...

CTIC P...

Public...

Bio2RD...

DIKB-E...

Epilepsy ICPS N...

MaHCO ...

Measur...

Proteo...

Role O...

Traffi...

CLLD-S...

eagle-...

Univer...

Datos ...

openda...

proven...

DBLP i...

Reprod...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

status...

DataGo...

BulTre...

Univer...

IPTC N...

apache Archiv...

berlios

Deutsc...

Eniped...

FAO ge...

greek-...

Linked...

Linked...

LOD2 P...

myopen...

NHS Ja...

oreilly Planet...

RDFohloh

status...

status...

status...

Chines...

DBpedi...

The Eu...

Norweg...

Tradit...

Univer...

EU: fi...

Linked...

MExiCo

Instit...

Organi...

Univer...

Smokin...

FiESTA

Bio2RD...

Bio2RD...

Airpor...

unipro...

Open D...

Comput...

Physic...

C. ele...

Linked...

Univer...

OpenWN...

Univer...

Nomenc...

MediCare Social...

openda...

Active...

Romani...

Audite...

Data a...

Edinbu...

eagle-...

Linked...

World ...

Slovak...

SORS

openda...

Nation...

Linked...

status...

Rådata...

Produc...

Produc...

photos

status...

eagle-...

Univer...

eagle-...

eagle-...

Deutsc...

Instan...

openda...

status...

Italia...

Result...

R&D Pr...

Face Link Yahoo ...

FinnWo...

Univer...

RAMEAU...

World ...

ISIL->...

Bio2RD...

DisGeNET

Global...

Univer...

Univer...

oceand...

Aperti...

Kallik...

Bio2RD...

Nobel ...

ZBW Labs

Univer...

CLLD-A...

HUGO IATE RDF

Ocean ...

Ocean ...

Linked...

Univer...

openda...

vulner...

Salzbu...

Univer...

Betwee...

openda...

Summar...

CIPFA

Aperti...

DBTune...

OBOE openda...

Bio2RD...

thesaurus

status...

Univer...

Norsk ...

Univer...

Entrez...

status...

Univer... Founda...

Wordne...

BioPAX

Klapps...

Chem2B...

bio2rd...

Univer...

JITA C...

GeoSpe...

openda...

PanLex Vytaut...

Shoah ...

Reposi...

Open D...

OLAC M...

Images...

OpenCo...

openda...

openda...

Requir...

Austra...

Bank f...

Spring...

Schola...

status...

Mis Mu...

Univer...

Organi...

VIVO status...

Averag...

Ruben ...

NPM

Ruben ...

Bio2RD...

Semant...

EURAXE...

QBOAir...

Aperti...

Wheat ...

Nation...

Aperti...

Open D...

Multex...

WarSampo

Aperti...

Red Un...

Univer...

yso-fi...

yso-fi...

Copyri...

eagle-...

Univer...

EMN

Accomm...

Taxons The Co...

openda...

Lexico...

Bio2RD...

semanlink Europe...

prefix.cc

ProductDB

typepad Univer...

openda...

openda...

webconf

Addgene

SwetoDblp

AGROVOC

Norweg...

Scotti...

Climb ...

notube Unempl...

Univer...

ItalWo...

status...

Univer...

Aperti...

NERC V...

WordLi...

mEduca...

FOODpe...

German...

Job ap...

eagle-...

openda...

ISOcat...

openda...

Basque...

taxonc...

Open D...

Period...

Englis...

Pleiades

Europe...

openda...

Univer...

Univer...

AragoD...

Aragon...

Instit...

Univer...

tharaw...

Ocean ...

EPA-RCRA

Prospe...

Univer...

Swedis...

Univer...

geodom...

SLI Ga...

data-h...

ECCO-T...

Linkin...

openda...

Merite...

Plant ...

LinkLi...

ePrint...

School...

Biblio...

Galici...

AEMET ...

Yovist...

Courts...

Univer...

Green ...

Europe...

status...

status...

CORE -...

RDFLic...

Univer...

Univer...

Enviro...

Metoff...

Aperti...

Ordnan...

IEEE V...

The Or...

LCSubj...

MASC-B...

DanNet...

Univer...

openda...

twc-op...

Regist...

IWN

DBTune...

Italia...

Univer...

RSS-50... Interc...

status...

Japane...

openda...

STITCH...

PreMOn

Lingui...

Garnic...

Univer...

Select...

SALDOM...

EnAKTi...

Lexvo.org

openda...

List o...

IceWor...

Renewa...

Salzbu...

webnma...

Aperti...

Chemic...

Aperti...

Farmac...

Whisky...

openda...

openda...

openda...

openda...

Influe...

Eventseer Social...

Univer...

openda...

eagle-...

Mi Guí...

ASN:US Univer...

Europe...

Swedis...

status...

openda...

Number...

openda...

OLiA D...

Hedatuz

Termin...

BioMod...

Univer...

eagle-...

Aperti...

Univer...

Finnis...

openda...

Framester Biblio...

status...

plWord...

CareLex openda...

sears.com Open E...

Univer...

BioSam...

Gene E...

Phonet...

HeBIS ...

ESD-To...

Calames Standa...

Mathem...

Univer...

Brazil... Univer...

Serend...

eagle-...

My Fam...

LIBRIS

eagle-...

eagle-...

Univer...

Britis...

openda...

Learni...

aliada...

Aperti...

Englis...

eagle-...

Univer...

openda...

de-gaa...

Chines...

Univer...

Muninn...

USPTO ...

Thesau...

Regist...

Museos...

taxonc...

openda...

Aperti...

Univer...

Aperti...

openda...

Europe...

Aperti...

Datos....

Catala...

openda...

GNOSS....

Evalua...

GovWIL...

EEA Vo...

eagle-...

Univer...

List o...

DBTune...

eagle-...

Allie ... Ontos ...

WordLi...

Sancti...

Univer...

Kidney...

Salzbu...

Freeyork

DBTune...

The Ge...

2011 U...

Aperti...

Open B...

RDFizi...

DM2E Judaic...

N-Lex ...

"Raini...

Bans o...

JRC-Na...

Taiwan...

Univer...

data-s...

Polyth...

News-1... Hebrew...

TAXREF...

Orthol...

Geolog...

ISTAT ...

Univer...

status...

Organi...

gemet-...

Publis...

Lichfi...

Web Sc...

xxxxx

UNODC ...

BibSon...

gdlc crowds...

Confis...

Street...

Linked...

Croati...

Inspec...

Struct...

Wikili...

Greek ...

AgriNe...

Univer...

Univer...

eagle-...

interv...

Univer...

Glottolog Entorn...

Aperti...

ietflang

Univer...

ChEMBL...

Biblio...

Univer...

Twarql Aperti...

status...

OntoBe...

TCGA R...

Drug D...

World ...

OSM Se...

WOLF W...

openda...

Aperti...

EuroSe...

SweFN-RDF

sandra...

SPARQL...

datos-...

ISPRA ...

Open W...

Deusto...

Social...

Transc...

PDEV-L...

Geogra...

bio2rd...

NTNU s...

Arabic...

Open D...

dev8d openda...

Greek ...

medline

Source...

linked...

openda...

AEGP, ...

openda...

openda...

Next W...

Linked...

Univer...

Near

eagle-...

WebIsALOD zarago...

Biogra...

Chat G...

Univer...

AGRIS

Linked...

Atlant...

Bio2RD...

semant...

The Linked Open Data Cloud from lod-cloud.net

18/31

(21)

Gathering Semantic Web Data

Browsing online versions of KBs

• Using ad-hoc APIs to retrieve relevant triples

Using a SPARQL endpoint

• Downloading a dump

Crawling other knowledge bases, e.g., dereferencing Cool URIs

19/31

(22)

Systems

• RDF stores (triplestores) with relational or native backend, open-source or commercial, related to graph databases

• Apache Jena

• Virtuoso

• Blazegraph, essentially acquired by Amazon

• Amazon Neptune

• SPARQL engines, usually on top of a triplestore.

http://en.wikipedia.org/wiki/SPARQL

• Tool to view semantic data in Web pages: http:

//www.google.com/webmasters/tools/richsnippets

20/31

(23)

Semantic Web challenges

• Complexity:

• Writing structured content is harder than writing text!

• Using structured content (with heterogeneous schema) is complicated!

• Discoverability problem for knowledge bases, vocabularies

• Performance:

• Data is large

• Running queries on graphs is tricky

• Reasoning makes it even worse

• Federation makes things worse again

21/31

(24)

Semantic Web challenges, cont’d

• Data quality:

• Vagueness and modeling issues

• Trust (anyone can add a triple)

• Canonicity and alignment

• Temporality, sources often complicated to represent

• Open-world semantics: missing values vs no values

• Incentives: many data providers do not want to be eaten by others

22/31

(25)

Wikidata

(26)

Why Wikidata matters

• Backed by the Wikimedia foundation: credible and noncommercial

Not run by academics, but some academics are involved

• Genuine uses on Wikipedia (to some extent)

• Centralized model, which is a good idea for now

• Good tradeoffs in terms of expressiveness, scope...

• Uses the successful wiki model

23/31

(27)

Wikidata basics

Entities: Q1, Q2, Q3, ..., Q60527475 and beyond

• Properties: P1, P2, P3, ..., P6343 and beyond

• Entities and properties have a label and short description in each language, along with aliases (search engine)

• Entities can also have sitelinks to Wikimedia projects (e.g., the corresponding Wikimedia pages)

• For each entity and property, we can have facts (or claims) with different objects

• Everyone can create and edit entities and facts

• Discussion is needed before creating a property

Software: Wikibase, a set of extensions to Mediawiki

24/31

(28)

Wikidata basics

Entities: Q1, Q2, Q3, ..., Q60527475 and beyond

• Properties: P1, P2, P3, ..., P6343 and beyond

• Entities and properties have a label and short description in each language, along with aliases (search engine)

• Entities can also have sitelinks to Wikimedia projects (e.g., the corresponding Wikimedia pages)

• For each entity and property, we can have facts (or claims) with different objects

• Everyone can create and edit entities and facts

• Discussion is needed before creating a property

Software: Wikibase, a set of extensions to Mediawiki

24/31

(29)

Wikidata basics

Entities: Q1, Q2, Q3, ..., Q60527475 and beyond

• Properties: P1, P2, P3, ..., P6343 and beyond

• Entities and properties have a label and short description in each language, along with aliases (search engine)

• Entities can also have sitelinks to Wikimedia projects (e.g., the corresponding Wikimedia pages)

• For each entity and property, we can have facts (or claims) with different objects

• Everyone can create and edit entities and facts

• Discussion is needed before creating a property

Software: Wikibase, a set of extensions to Mediawiki

24/31

(30)

Wikidata basics

Entities: Q1, Q2, Q3, ..., Q60527475 and beyond

• Properties: P1, P2, P3, ..., P6343 and beyond

• Entities and properties have a label and short description in each language, along with aliases (search engine)

• Entities can also have sitelinks to Wikimedia projects (e.g., the corresponding Wikimedia pages)

• For each entity and property, we can have facts (or claims) with different objects

• Everyone can create and edit entities and facts

• Discussion is needed before creating a property

Software: Wikibase, a set of extensions to Mediawiki

24/31

(31)

Wikidata basics

Entities: Q1, Q2, Q3, ..., Q60527475 and beyond

• Properties: P1, P2, P3, ..., P6343 and beyond

• Entities and properties have a label and short description in each language, along with aliases (search engine)

• Entities can also have sitelinks to Wikimedia projects (e.g., the corresponding Wikimedia pages)

• For each entity and property, we can have facts (or claims) with different objects

• Everyone can create and edit entities and facts

• Discussion is needed before creating a property

Software: Wikibase, a set of extensions to Mediawiki

24/31

(32)

Qualifiers, references, ranks, data types

• Each fact can have qualifiers to indicate things like start/end time, details (e.g., major/degree for P69 “educated at”)

• Each fact can also have sources to indicate where it comes from (a source is a set of key–value pairs)

• Each fact can have a rank among “normal”, “preferred” (e.g., for the current value), or “deprecated”.

• Literal values can have data types

https://www.wikidata.org/wiki/Special:ListDatatypes

• Also two special values

• “unknown value” (a value exists but is unknown)

• “no value” (it is known that there is no value)

25/31

(33)

Constraints

• Wikidata has constraints which are only advisory (= you can create violations) and are quite simple. Main ones:

• “single (best) value constraint”

• “inverse constraint” (mother vs child), “symmetric constraint”

• “type constraint”, or requiring/disallowing certain facts

• “range constraint” “contemporary constraint”, “format constraint”

• “one-of/none-of constraint” (list of allowed/forbidden values)

• Requiring/allowing qualifiers or units

• Allowing use as a qualifier/unit

• There is a mechanism for exceptions

Many constraint violations in practice

26/31

(34)

Usage on Wikipedia

Used for interwiki links, i.e., the links between Wikipedia pages across languages

Used in some infoboxes on Wikipedia, e.g., to automatically populate some fields

• Can be used for other things, e.g., filling tables, or external links to other sources

Policy depends on each Wikipedia: some communities are more welcoming than others...

27/31

(35)

Ongoing Wikidata discussions

• Project scope: what belongs in Wikidata?

• The public domain license is a strong requirement

• Concerns, e.g., about the high number of bibliographic entities (almost half of the entities)

• Some external datasets are imported, but Wikipedia (historically) gave much importance to human validation of imports

• Some support for federation in queries; and many external links

• Notability: essentially no policy currently

Managing vandalism?

• Importance of references?

28/31

(36)

Accessing Wikidata data

Simply by browsing

• Can retrieve in multiple formats, e.g.,

https://www.wikidata.org/wiki/Special:

EntityData/Q42.json

• For simple queries (triple patterns), Linked data fragments https://query.wikidata.org/bigdata/ldf

• Wikimedia API, e.g., API for recent changes

• SPARQL queries, https://query.wikidata.org/ (and API)

Weekly dumps in JSON, RDF, XML (around 50 GB compressed)

29/31

(37)

Other cool Wikidata stuff

• Distributed Wikidata Game: crowdsourcing edits on Wikidata

https://tools.wmflabs.org/wikidata-game/distributed/

• Reasonator: automatically generate a Wikipedia-like page from a Wikidata entity https://tools.wmflabs.org/reasonator/

Lexemes: ongoing effort to add linguistic data to Wikidata

• OWL ontology: http://wikiba.se/ontology

• askplatyp.us: natural language question answering tool

• File captions on Wikimedia Commons to have a structured way to give labels to images (deployed on January 10)

OpenRefine to reconcile datasets with Wikidata and add Wikidata facts https://www.wikidata.org/wiki/Wikidata:

Tools/OpenRefine/Editing/Tutorials/Video

30/31

(38)

Slide acknowledgements

• Many thanks to Thomas Pellissier-Tanon for his helpful feedback

Slide 4: https://en.wikipedia.org/wiki/FOAF_(ontology)

Slide 5: https://www.w3.org/Submission/ccREL/

• Slide 8–10: https://schema.org/Event

Slide 13:

https://commons.wikimedia.org/wiki/File:DBpediaLogo.svg

Slide 14: https://en.wikipedia.org/wiki/File:YAGO.svg

Slide 15: https://commons.wikimedia.org/wiki/File:

Freebase_Logo_optimised.svg

• Slide 16, 23:

https://en.wikipedia.org/wiki/File:Wikidata-logo-en.svg

31/31

Références

Documents relatifs

Firefox Gecko, and (work-in-progress) Servo, using Rust Safari WebKit engine. Chrome Blink (fork of Webkit, in

• Let’s Encrypt : automated check (ACME protocol) and signature of an HTTPS certificate. •

placeholder Help text indicated when the field is empty value Indicate the default value. required Must be filled in to submit the form type

• CSS3 makes it possible to position elements in two dimensions using CSS Grid. • Use display: grid on the element which will serve as

In the era of Big Data and the Web of Linked Data, one would expect that schema-free search over both text and structured key-value pairs becomes more semantic, Systems should,

Thus, we propose an IR system, called SENSE (SEmantic N-levels Search Engine), which manages documents indexed at multiple separate levels: keywords, senses (word mean- ings)

For the set of candidate experts retrieved from document metadata and via Named Entity extraction from the documents them- selves, our metadata-based re-ranking procedure

One of the scenarios we are currently implementing with the help of OKKAM is to support Knowledge Extraction (KE) processes and the resulting Knowledge Representation (KR) in a