Ontologies, Knowledge Bases, Wikidata
MPRI 2.26.2: Web Data Management
Antoine Amarilli Friday, January 11th
1/31
Reminder
• Ontology: vocabulary (classes and relations) to describe things
• Knowledge base: set of facts in one or several ontologies
→ Focus on Wikidata: a general-purpose knowledge base and ontology
2/31
Ontologies
Ontologies
• Various domain-specific vocabularies used across knowledge bases
• One general-purpose ontology used by Google, Microsoft, Yahoo, Yandex: schema.org
• Other ontologies that come together with a knowledge base
3/31
Friend of a friend (FOAF)
Describe people, relationship, profiles, activities (social network)
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<#JW>
a foaf:Person ;
foaf:name "Jimmy Wales" ;
foaf:mbox <mailto:[email protected]> ; foaf:homepage <http://www.jimmywales.com> ; foaf:nick "Jimbo" ;
foaf:depiction <http://www.jimmywales.com/aus_img_small.jpg> ; foaf:interest <http://www.wikimedia.org> ;
foaf:knows [
a foaf:Person ;
foaf:name "Angela Beesley"
] .
4/31
Creative Commons
Describe the license and rights on documents
<div about="http://lessig.org/blog/"
xmlns:cc="http://creativecommons.org/ns#">
This page, by <a property="cc:attributionName"
rel="cc:attributionURL"
href="http://lessig.org/">Lawrence Lessig</a>, is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by/3.0/">
Creative Commons Attribution License</a>.
</div>
• Many content providers add this kind of markup (e.g., Flickr)
• Search engines can use it (e.g., Google)
5/31
Other domain-specific ontologies
• Dublin Core (DC): Describe digital resources (videos, images, etc.) and physical resources (books, CDs, etc.)
• Simple knowledge organization system (SKOS): describe thesauri, taxonomies, etc.
• Open Graph Protocol: metadata for Web pages to be integrated in Facebook’s social graph; also Twitter Cards for Twitter
• DOAP (Description of a Project): describe software projects
• VoID (Vocabulary of Interlinked Datasets): describe a linked dataset
• Countless others
6/31
Schema.org: a general-purpose ontology
• General-purpose ontology: 598 types and 862 properties in version 3.5
• Intended to be used on Web pages to annotate the semantics of elements
• Used by search engines for rich search results
• Used in over 10 million sites 1
1 Source: https://schema.org/
7/31
Format: Microdata
<div class="event-wrapper" itemscope itemtype="http://schema.org/Event">
<div class="event-date" itemprop="startDate"
content="2013-09-14T21:30">Sat Sep 14</div>
<div class="event-title" itemprop="name">
Typhoon with Radiation City</div>
<div class="event-venue" itemprop="location"
itemscope itemtype="http://schema.org/Place">
<span itemprop="name">The Hi-Dive</span>
<div class="address" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">7 S. Broadway</span><br>
<span itemprop="addressLocality">Denver</span>,
<span itemprop="addressRegion">CO</span>
<span itemprop="postalCode">80209</span>
</div>
</div>
<div class="event-time">9:30 PM</div>
</div>
• itemscope creates an item and itemtype gives its type
• itemprop gives values for properties of the item 8/31
Format: RDFa
Competing format to Microdata, seems less common 2
<div vocab="http://schema.org/" class="event-wrapper" typeof="Event">
<div class="event-date" property="startDate"
content="2013-09-14T21:30">Sat Sep 14</div>
<div class="event-title" property="name">
Typhoon with Radiation City</div>
<div class="event-venue" property="location" typeof="Place">
<span property="name">The Hi-Dive</span>
<div class="address" property="address" typeof="PostalAddress">
<span property="streetAddress">7 S. Broadway</span><br>
<span property="addressLocality">Denver</span>,
<span property="addressRegion">CO</span>
<span property="postalCode">80209</span>
</div>
</div>
<div class="event-time">9:30 PM</div>
</div>
2 http://webdatacommons.org/structureddata/index.html#toc2
9/31
Format: JSON-LD
Alternative approach: give the structured data separately in JSON
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Event",
"location": {
"@type": "Place",
"address": {
"@type": "PostalAddress",
"addressLocality": "Denver",
"addressRegion": "CO",
"postalCode": "80209",
"streetAddress": "7 S. Broadway"
},
"name": "The Hi-Dive"
},
"name": "Typhoon with Radiation City",
"startDate": "2013-09-14T21:30"
}
</script>
• The @context
attribute gives the namespace for the
@type.
• No longer gives any link to the page contents
• Also @id to give an URI
to a node
• Many other features (editor’s draft of the spec is 167 pages)
10/31
Web Data Commons Structured Data
• Extraction of semantic content from the Common Crawl
• Also useful to measure usage of structured data:
• In November 2017, the Common Crawl contained 66 TB (compressed), 260 TB (uncompressed), 3.2G pages
• 39% of pages (and 28% of domains) contained semantic data
• 9G entities and 38G triples
• http://webdatacommons.org/structureddata/
11/31
Knowledge bases
Common Knowledge bases
• Generalistic: DBpedia, YAGO, Freebase (defunct), Wikidata
• Proprietary: Google Knowledge Graph, Bing Knowledge Graph (aka Satori)
• Domain-specific
• We will focus afterwards on Wikidata
12/31
DBpedia
• Started in 2007
• License: CC-BY-SA
• Code license: GPLv2
• Actors: Leipzig University, University of Mannheim, Open Link Software
• Latest release: 2016-10
• Extracted from Wikimedia projects
• 6M entities and 10G triples in 2016-04 3 ,
3 https://blog.dbpedia.org/2016/10/19/yeah-we-did-it-again-new-2016-04-dbpedia-release/
13/31
YAGO
• Started in 2008
• License: CC-BY
• Code license: GPLv3
• Actors: Max Planck Institute for Informatics, Télécom ParisTech
• Latest release: YAGO 3.1 (2017)
• Extracted from Wikipedias and other sources; manual evaluation
• 10M entities and 120M triples 4 ,
4 http://yago-knowledge.org/
14/31
Freebase
• Started in 2007, discontinued in 2016
• License: CC-BY
• Code license: Apache2 (provided after-the-fact by Google)
• Actors: Metaweb, acquired by Google in 2010
• Initially imported from various sources
• Could be edited by anyone
• Partially imported into Wikidata (but not completely)
• Last release: 2016
• Last dump has 1.9G triples
15/31
Wikidata
• Started in 2012
• License: public domain
• Code license: GPLv2
• Actors: Wikimedia Deutschland, Wikimedia
• Last release: weekly
• Around 650M statements and 54M items
• Can be edited by anyone! Around 20k active users.
16/31
Domain-specific
• MusicBrainz, for CDs and music in general (20 million recordings)
• British National Bibliography: bibliographic details about books published in the UK since 1950
• data.bnf.fr, data from the French national library
• OpenStreetMaps, and Geonames
• Medicine and chemistry with SNOMED CT, and other databases:
DrugBank, KEGG, UniProt, ChEMBL, etc.
• Linguistic resources, e.g., Babelnet
• Bibliography, e.g., DBLP, Crossref
17/31
Linked Open Data
Legend Cross Domain Geography Government Life Sciences Linguistics Media Publications Social Networking User Generated
status...
GeoNam...
Person...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status... Amino ...
Compar...
Chemic...
CRISP ...
Logica...
Cell l...
MESH T...
Medica...
NCI Th...
Nation...
Nation...
NIFSTD
NanoPa...
Read C...
RxNORM
SNOMED...
SNP-On...
Sequen...
Sugges...
VANDF DBpedi...
DBpedia
datahub
openli...
W3C Arthro...
DBLP R...
Freebase
New Yo...
status...
status...
status...
status...
status...
status...
status...
status...
TaxonC...
BBC Wi...
Europe...
Fishes...
GeoSpe...
OpenCyc
UMBEL ...
UniProt status...
status...
DBTune...
MusicB...
Poképé...
Pokede...
Univer...
OLiA
Japane...
Web ND...
DBpedi...
HEALTH...
Cancer...
Cancer...
COSTART
Human ...
Experi...
Health...
ICPC-2...
MedDRA Medlin...
Natura...
NIF Dy...
Online...
PMA 2010
RadLex WHO Ad...
ChEMBL...
Bio2RD...
EPA-CDR EPA-FRS
EPA-SRS DWS-Group
Semant...
semant...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Inspec...
Czech ...
Geospa...
YAGO
Wikidata Nation...
Associ...
CiteSe...
Commun...
ReSIST...
DBLP C...
ePrint...
Univer...
Univer...
Resear...
School...
ReSIST...
Uberbl...
TIP
Linked...
Influe...
Advers...
BioAss...
Bone D...
Basic ...
BIRNLex
Gene R...
BioTop
CAO
Cell C...
Chemic...
Cell L...
Cognit...
Ontolo...
Electr...
Human ...
Cardia...
eagle-...
eVOC (...
Fly ta...
Genera...
Gene O...
Gene R...
Host P...
Inform...
Intern...
Infect...
Brucel...
Malari...
Intera...
SysMO-...
Mental...
Emotio...
Protei...
Mosqui...
Neural...
Neomar...
NIF Cell
Neural...
NMR-in...
Ontolo...
Ontolo...
OBOE SBC Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Ontolo...
Phenot...
Pediat...
PRotei...
RNA on...
Subcel...
Sleep ...
Semant...
Softwa...
Time E...
Transl...
VIVO
Vaccin...
MGED O...
Mass s...
Solana...
Units ...
Units ...
Rechts...
Parole...
lexinfo
Rat St...
Africa...
Minima...
Physic...
PHARE Pathwa...
El Via...
GeoLin...
DBpedi...
2000 U...
DBTune...
flickr...
DailyMed
DBLP B...
Diseasome
DrugBank Eurost...
Projec...
SIDER:...
Linked...
RDF Bo...
Revyu....
TCMGen...
WordNe...
World ...
Gemeen...
zhishi...
BabelNet
DBpedi...
Zhishi.me
status...
status...
status...
status...
status...
status...
status...
AI/RHEUM
Bleedi...
Curren...
Common...
Plant ...
FlyBas...
HCPCS Human ...
ICD10
ICD10CM
Intern...
Intern...
Molecu...
Breast...
Cell l...
Master...
Mammal...
Mouse ...
Metath...
NCBI o...
Ontolo...
Orphan...
Studen...
Reuter...
Amphib...
Anatom...
Basic ...
Bilate...
BRENDA...
Cerebr...
Human ...
Human ...
Drosop...
Hymeno...
Mouse ...
Medaka...
Teleos...
Uber a...
Verteb...
verteb...
Xenopu...
Zebraf...
CLLD-WOLD CLLD-G...
Lexvo
Persée...
data.b...
IdRef:...
VIAF: ...
EnAKTi...
Ordnan...
Prince...
WordNe...
openda...
statis...
Agenda...
Instit...
Ascomy...
System...
Cognit...
Fungal...
Fissio...
Gene O...
Cereal...
Event ...
IxnO
MeGO
Plant ...
Plant ...
Physic...
System...
SoyOnt...
Plant ...
Verteb...
Yeast ...
status...
Linked...
U.S. S...
ichoose
eagle-...
Biomed...
Basisr...
Open D...
eagle-...
EventKG Deaths...
Regist...
data.g...
status...
status...
Univer...
EPA-TRI
Family...
Intern...
eagle-...
Intera...
Didact...
Focus ...
status...
status...
status...
status...
status...
MLSA -...
wiktio...
Dendri...
Protei...
openda...
Linked...
EUR-Le...
ABA Ad...
Cell type
Enviro...
Spider...
Mosqui...
C. ele...
Tender...
State ...
R&D Pr...
Temple...
Semant...
Syndro...
Atheli...
LemonW...
Tradit...
Multip...
EARTh
GEnera...
ThIST UMTHES
Deusto...
MORElab
CLLD-E...
DBkWik Europe...
Bundes...
Food a...
Intern...
Transp...
World ...
ICD-10...
Ontolo...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Breast...
Dictyo...
Tick g...
BBC Music openda...
refere...
RISM A...
Gemein...
Fundaç...
Budape...
Instit...
France...
Divers...
Korean...
Univer...
Prince...
Librar...
Brown ...
ICANE
Lista ...
cablegate
Situat...
Sample...
Facete...
Thai W...
Reacto...
UniProtKB
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
Bio2RD...
IMGT-O...
Parasi...
Proyec...
openda...
Biolog...
FDA Me...
Lipid ...
PKO_Re Experi...
dbnary ALPINO...
School...
Resili...
DEPLOY...
dotAC ...
epsrc IBM Re...
IEEE P...
UK JIS...
LAAS-C...
Open A...
Univer...
RISKS ...
Univer...
ECS So...
C. ele...
Amphib...
Taxono...
Teleos...
TOK_On...
TWC: L...
GovTra...
vivo2doi CrossR...
VIVO S...
VIVO U...
VIVO W...
VIVO W...
tags2c...
WordNe...
Europe...
EEA Re...
EIONET...
Telegr...
Linked...
DBTune...
Multil...
Neomar...
DATATU...
NASA S...
BBC Pr...
Integr...
Clinic...
DBpedi...
openda...
eagle-...
EUMIDA...
Linked...
NUTS (...
Sudoc ...
CE4R K...
eagle-...
OpenMo...
Linked...
lobid-...
B3Kat ...
Dewey ...
Projec...
lobid-...
Open L...
Automa...
fun
Linked...
Bio2RD...
Aperti...
Animal...
Spatia...
ExO
Logger...
MIxS C...
Sentim...
openda...
Google...
LinkedCT
Univer...
Aperti...
xLiD-L...
dbpedi...
Projet...
DBpedi...
Bio2RD...
Manual...
Debian...
Bricklink
Bio2RD...
sloWNe...
openda...
Job ap...
status...
status...
bio2rd...
CLLD-afbo Aperti...
ReSIST...
southa...
BPR ? ...
Univer...
Aperti...
Open M...
ISOcat
wordpress
Univer...
lemonUby
Univer...
Univer...
The Li...
Univer...
MARC C...
lingvo...
Englis...
Genera...
TDS
SmartL...
iServe...
Verrij...
Cornet...
DBpedi...
Art & ... ERA - ...
openda...
Medici...
ATC gr... YSA - ...
YSO - ...
SALDO-RDF Data a...
Compre...
Alpine...
BibBase
busine...
Chroni...
Discog...
Mosele...
Data I...
data.o...
DBTropes DBTune...
data.dcs
educat...
EnAKTi...
EnAKTi...
EnAKTi...
enviro...
ESD St...
Eurost...
EventM...
TheSoz...
Hungar...
John G...
Linked...
Linked...
Linked...
The Lo...
Lotico myExpe...
Nation...
OpenCa...
Openly...
patent...
Englis...
Last.F...
resear...
Techni...
Deep B...
UN/LOC...
WordNe...
Semant...
STW Th...
Surge ...
Thesau...
Open L...
The Vi...
transp...
UK Leg...
UK Pos...
Univer...
URIBurner
VIVO C...
VIVO I...
20th C...
GeoEcu...
Nation...
Linked...
Diagno...
Non Ra...
Random...
datos....
Thesau...
openda...
Diavgeia
Hellen...
Hellen...
status...
status...
status...
status...
status...
status...
status...
status...
Bio2RD...
Linked...
Schema...
openda...
associ...
Edublogs
EnAKTi...
Accomm...
Inever...
Inever...
CLLD-P...
CLLD-WALS
status...
status...
Genera...
Code l...
Cadast...
status...
Aperti...
Public...
openda...
PreLex Linked...
Drosop...
eagle-...
DBpedi...
Amster...
Commun...
Italia...
Albane...
SIMPLE
Weathe...
MetaSh...
TEKORD eagle-...
ciard-...
Univer...
EU Age...
Linked...
OpenEI...
KORE 5...
MultiW...
Federa...
IATI a...
The Eu...
UNESCO...
openda...
openda...
GeoWor...
FrameB...
LODAC ...
Persia...
status...
Univer...
theses.fr
Polyma...
Regist...
EU Par...
EU Who...
Educat...
CTIC P...
Public...
Bio2RD...
DIKB-E...
Epilepsy ICPS N...
MaHCO ...
Measur...
Proteo...
Role O...
Traffi...
CLLD-S...
eagle-...
Univer...
Datos ...
openda...
proven...
DBLP i...
Reprod...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
status...
DataGo...
BulTre...
Univer...
IPTC N...
apache Archiv...
berlios
Deutsc...
Eniped...
FAO ge...
greek-...
Linked...
Linked...
LOD2 P...
myopen...
NHS Ja...
oreilly Planet...
RDFohloh
status...
status...
status...
Chines...
DBpedi...
The Eu...
Norweg...
Tradit...
Univer...
EU: fi...
Linked...
MExiCo
Instit...
Organi...
Univer...
Smokin...
FiESTA
Bio2RD...
Bio2RD...
Airpor...
unipro...
Open D...
Comput...
Physic...
C. ele...
Linked...
Univer...
OpenWN...
Univer...
Nomenc...
MediCare Social...
openda...
Active...
Romani...
Audite...
Data a...
Edinbu...
eagle-...
Linked...
World ...
Slovak...
SORS
openda...
Nation...
Linked...
status...
Rådata...
Produc...
Produc...
photos
status...
eagle-...
Univer...
eagle-...
eagle-...
Deutsc...
Instan...
openda...
status...
Italia...
Result...
R&D Pr...
Face Link Yahoo ...
FinnWo...
Univer...
RAMEAU...
World ...
ISIL->...
Bio2RD...
DisGeNET
Global...
Univer...
Univer...
oceand...
Aperti...
Kallik...
Bio2RD...
Nobel ...
ZBW Labs
Univer...
CLLD-A...
HUGO IATE RDF
Ocean ...
Ocean ...
Linked...
Univer...
openda...
vulner...
Salzbu...
Univer...
Betwee...
openda...
Summar...
CIPFA
Aperti...
DBTune...
OBOE openda...
Bio2RD...
thesaurus
status...
Univer...
Norsk ...
Univer...
Entrez...
status...
Univer... Founda...
Wordne...
BioPAX
Klapps...
Chem2B...
bio2rd...
Univer...
JITA C...
GeoSpe...
openda...
PanLex Vytaut...
Shoah ...
Reposi...
Open D...
OLAC M...
Images...
OpenCo...
openda...
openda...
Requir...
Austra...
Bank f...
Spring...
Schola...
status...
Mis Mu...
Univer...
Organi...
VIVO status...
Averag...
Ruben ...
NPM
Ruben ...
Bio2RD...
Semant...
EURAXE...
QBOAir...
Aperti...
Wheat ...
Nation...
Aperti...
Open D...
Multex...
WarSampo
Aperti...
Red Un...
Univer...
yso-fi...
yso-fi...
Copyri...
eagle-...
Univer...
EMN
Accomm...
Taxons The Co...
openda...
Lexico...
Bio2RD...
semanlink Europe...
prefix.cc
ProductDB
typepad Univer...
openda...
openda...
webconf
Addgene
SwetoDblp
AGROVOC
Norweg...
Scotti...
Climb ...
notube Unempl...
Univer...
ItalWo...
status...
Univer...
Aperti...
NERC V...
WordLi...
mEduca...
FOODpe...
German...
Job ap...
eagle-...
openda...
ISOcat...
openda...
Basque...
taxonc...
Open D...
Period...
Englis...
Pleiades
Europe...
openda...
Univer...
Univer...
AragoD...
Aragon...
Instit...
Univer...
tharaw...
Ocean ...
EPA-RCRA
Prospe...
Univer...
Swedis...
Univer...
geodom...
SLI Ga...
data-h...
ECCO-T...
Linkin...
openda...
Merite...
Plant ...
LinkLi...
ePrint...
School...
Biblio...
Galici...
AEMET ...
Yovist...
Courts...
Univer...
Green ...
Europe...
status...
status...
CORE -...
RDFLic...
Univer...
Univer...
Enviro...
Metoff...
Aperti...
Ordnan...
IEEE V...
The Or...
LCSubj...
MASC-B...
DanNet...
Univer...
openda...
twc-op...
Regist...
IWN
DBTune...
Italia...
Univer...
RSS-50... Interc...
status...
Japane...
openda...
STITCH...
PreMOn
Lingui...
Garnic...
Univer...
Select...
SALDOM...
EnAKTi...
Lexvo.org
openda...
List o...
IceWor...
Renewa...
Salzbu...
webnma...
Aperti...
Chemic...
Aperti...
Farmac...
Whisky...
openda...
openda...
openda...
openda...
Influe...
Eventseer Social...
Univer...
openda...
eagle-...
Mi Guí...
ASN:US Univer...
Europe...
Swedis...
status...
openda...
Number...
openda...
OLiA D...
Hedatuz
Termin...
BioMod...
Univer...
eagle-...
Aperti...
Univer...
Finnis...
openda...
Framester Biblio...
status...
plWord...
CareLex openda...
sears.com Open E...
Univer...
BioSam...
Gene E...
Phonet...
HeBIS ...
ESD-To...
Calames Standa...
Mathem...
Univer...
Brazil... Univer...
Serend...
eagle-...
My Fam...
LIBRIS
eagle-...
eagle-...
Univer...
Britis...
openda...
Learni...
aliada...
Aperti...
Englis...
eagle-...
Univer...
openda...
de-gaa...
Chines...
Univer...
Muninn...
USPTO ...
Thesau...
Regist...
Museos...
taxonc...
openda...
Aperti...
Univer...
Aperti...
openda...
Europe...
Aperti...
Datos....
Catala...
openda...
GNOSS....
Evalua...
GovWIL...
EEA Vo...
eagle-...
Univer...
List o...
DBTune...
eagle-...
Allie ... Ontos ...
WordLi...
Sancti...
Univer...
Kidney...
Salzbu...
Freeyork
DBTune...
The Ge...
2011 U...
Aperti...
Open B...
RDFizi...
DM2E Judaic...
N-Lex ...
"Raini...
Bans o...
JRC-Na...
Taiwan...
Univer...
data-s...
Polyth...
News-1... Hebrew...
TAXREF...
Orthol...
Geolog...
ISTAT ...
Univer...
status...
Organi...
gemet-...
Publis...
Lichfi...
Web Sc...
xxxxx
UNODC ...
BibSon...
gdlc crowds...
Confis...
Street...
Linked...
Croati...
Inspec...
Struct...
Wikili...
Greek ...
AgriNe...
Univer...
Univer...
eagle-...
interv...
Univer...
Glottolog Entorn...
Aperti...
ietflang
Univer...
ChEMBL...
Biblio...
Univer...
Twarql Aperti...
status...
OntoBe...
TCGA R...
Drug D...
World ...
OSM Se...
WOLF W...
openda...
Aperti...
EuroSe...
SweFN-RDF
sandra...
SPARQL...
datos-...
ISPRA ...
Open W...
Deusto...
Social...
Transc...
PDEV-L...
Geogra...
bio2rd...
NTNU s...
Arabic...
Open D...
dev8d openda...
Greek ...
medline
Source...
linked...
openda...
AEGP, ...
openda...
openda...
Next W...
Linked...
Univer...
Near
eagle-...
WebIsALOD zarago...
Biogra...
Chat G...
Univer...
AGRIS
Linked...
Atlant...
Bio2RD...
semant...
The Linked Open Data Cloud from lod-cloud.net