Technologies du Web Master COMASIC
Information Extraction and the Semantic Web
Antoine Amarilli 1
December 2, 2014
1 Course material adapted from Fabian Suchanek’s slides:
http://suchanek.name/work/teaching/IESW2010.pdf.
SPARQL example from:
Motivation
We have seen how search engines work at the level of words.
Sometimes, this works...
Sometimes, it doesn’t:
Those hard queries would be easy on RDBMSes!
→ We need to extract structured information.
→ We would like to understand its semantics.
Other motivations: job offerings Motivation: Examples
Title Type Location
Business strategy Associate Part time Palo Alto, CA
Registered Nurse Full time Los Angeles
Other motivations: scientific papers
Motivation: Examples
Author Publication Year Grishman Information Extraction... 2006
... ... ... 9
Other motivations: price comparison
Motivation: Examples
Product Type Price
Dynex 32” LCD TV $1000
Table of contents
1 Introduction
2 Information Extraction
3 Semantic Web
Roadmap Information Extraction
Instance Extraction
Fact Extraction
Ontological Information Extraction
and beyond Information Extraction (IE) is the process
of extracting structured information (e.g., database tables) from unstructured machine-readable documents
(e.g., Web documents).
Person Nationality Elvis Presley American
nationality
Elvis Presley singer
Angela Merkel politician
Instance extraction with Hearst patterns
Entities can be extracted and categorized by automatic extraction of simple patterns:
Many scientists, including Einstein, believed...
France, Germany and other countries have been plagued with...
Other forms of government such as constitutional monarchy...
Difficulties:
→ Must parse correctly.
→ Must be resilient to noise.
Instance extraction with set expansion
Start with a seed set of entities of a certain type.
Find occurrences of them at specific position in documents:
Lists.
Table columns.
Assume that other items are other entities of the same nature.
→ Once again, this is noisy...
→ Precision and recall, see previous slides.
Set expansion example Instance Extraction: Set Expansion
Seed set: {Russia, USA, Australia}
Result set: {Russia, Canada, China, USA, Brazil, Australia, India, Argentina,Kazakhstan, Sudan}
17
Fact extraction with wrapper induction Fact Extraction: Wrapper Induction
Observation: On Web pages of a certain domain, the information is
often in the same spot.
Specifying a wrapper
A wrapper can be expressed:
As a path in the DOM (usually XPath).
Extensions to multiple pages, e.g., OXPath.
As a regular expression.
A wrapper can be produced:
Through manual annotation of the relevant fields.
Using specific knowledge of the source.
→ Wikipedia categories and infoboxes.
By comparison between similar pages to find what changed.
Using seed pairs (known facts).
Possibility to iterate between patterns and facts.
→ Risk of semantic drift.
Fact extraction on text
Entity extraction:
→ find entities in the text.
Named entity recognition:
→ identify the type of entities.
→ person
→ organization
→ quantity
→ address
→ etc.
NLP patterns to extract facts
→ POS patterns
→ Parse trees.
Fact extraction on text Fact Extraction: Pattern Matching
Einstein ha scoperto il K68, quando aveva 4 anni.
Bohr ha scoperto il K69 nel anno 1960.
Person Discovery
Einstein K68
X ha scoperto il Y
Person Discovery
Bohr K69
The patterns can be more complex, e.g.
• regular expressions
X discovered the .{0,20} Y
• POS patterns
X discovered the ADJ? Y
• Parse trees
X discovered Y PN
NP S VP
V PN
NP
34
Try
Ontologies
Ontology: a set of entities and relations.
Name Ment. Mfact Domain Publisher Start Update
YAGO >10 >120 general MPI 2008 2012
DBPedia 2 4.6 3 general OpenLink et al 2007 2014
Wikidata 3 15 30 general Wikimedia 2012 2014
Freebase 4 46 2 680 general Metaweb (Google) 2007 2014 Knowl. Vault 5 >570? >1 600 general Google 2014 2014 MusicBrainz 6 >35 >180 music MetaBrainz 2003 2013
WordNet 0.4 2 English Princeton 1985 2006
ConceptNet 7 5.3 13 obvious MIT 2000 2014
2 http://blog.dbpedia.org/2014/09/09/dbpedia-version-2014-released/
3 http://cacm.acm.org/magazines/2014/10/178785-wikidata/fulltext#body-9 4 http://freebase.com/
5 http://www.newscientist.com/article/mg22329832.
Introduction Information Extraction Semantic Web
Entity disambiguation
Map mentions in text to entities.
Problem: mentions are ambiguous!
→ Use the importance of entities.
→ Use the likelihood that a term refers to an entity.
→ Use semantic consistency on the mappings of a document.
Entity disambiguation
Map mentions in text to entities.
Problem: mentions are ambiguous!
→ Use the importance of entities.
→ Use the likelihood that a term refers to an entity.
→ Use semantic consistency on the mappings of a document.
(Demo: https://gate.d5.mpi-inf.mpg.de/webaida/)
Ontological IE
Use the existing ontology as reference.
Extract information from additional documents.
Use fuzzy rules to extend the ontology (often manual):
→ Extraction rules
→ Logical constraints
→ Common sense Reasoning:
→ Datalog
→ Weighted MAX-SAT
→ Markov Logic Networks
Ontological IE
Ontology
NL documents
Elvis was born in 1935
Semantic Rules
birthdate<deathdate
type(Elvis_Presley,singer) subclassof(singer,person) ...
appears(“Elvis”,”was born in”, ”1935”)
...
means(“Elvis”,Elvis_Presley,0.8) means(“Elvis”,Elvis_Costello,0.2) ...
born(X,Y) & died(X,Z) => Y<Z appears(A,P,B) & R(A,B) => expresses(P,R) appears(A,P,B) & expresses(P,R) => R(A,B)
...
First Order Logic Formulae
1935 born New facts with 1. canonic relations 2. canonic entities 3. consistency with the ontology
Ontological IE: Reasoning
SOFIE
system
Introduction Information Extraction Semantic Web
Open IE
Use the entire Web as corpus.
Crawl the Web for new facts.
Create new rules from what is extracted.
Examples:
Open IE, University of Washington.
Read the Web, CMU.
→ Demo: http://rtw.ml.cmu.edu/rtw/
Introduction Information Extraction Semantic Web
Open IE
Use the entire Web as corpus.
Crawl the Web for new facts.
Create new rules from what is extracted.
Examples:
Open IE, University of Washington.
→ Demo: http://openie.cs.washington.edu/
→ Demo: http://rtw.ml.cmu.edu/rtw/
Open IE
Use the entire Web as corpus.
Crawl the Web for new facts.
Create new rules from what is extracted.
Examples:
Open IE, University of Washington.
→ Demo: http://openie.cs.washington.edu/
Read the Web, CMU.
→ Demo: http://rtw.ml.cmu.edu/rtw/
Table of contents
1 Introduction
2 Information Extraction
3 Semantic Web
Motivation
Having structured data is nice.
However, independent sources are not useful.
We need to create links between data sources.
→ Run a query across multiple relevant data stores.
→ Perform complex transactions (booking a flight, a hotel...).
→ Rich data visualization (integrating e.g. maps and statistics).
We need to define the semantics of the data.
We need to enforce constraints.
We need to evaluate complex queries over multiple sources.
The Semantic Web
URIs Globally unique identification of entities and relations.
OWL Constraint language over structured data.
RDF Storage format for structured data.
SPARQL Query language for structured data.
LOD Linked Open Data: draw links between data sources.
URIs
Uniform Resource Identifier Like URLs.
Not always dereferenceable.
URNs: urn:isbn:0486415864 URLs, often with namespaces:
→ dbp:Paris for http://dbpedia.org/resource/Paris
RDF
Resource Description Framework.
Triples: Subject predicate object.
<dbp:Paris> <dbp:country> <dbp:France>
→ The country of Paris (DBPedia resources).
<dbp:Paris> <foaf:homepage> <http://www.paris.fr>
→ The homepage of Paris (FOAF relation, website).
<dbp:Paris> <foaf:name> "Paris"@en
→ The name of Paris (FOAF relation, literal value).
Multiple serializations.
RDFS
RDF Schema.
<dbp:Paris> <rdf:type> <dbp:Settlement>
→ Paris is a settlement.
<dbp:Settlement> <rdfs:subclassOf> <dbp:Place>
→ If I am a Settlement then I am a Place.
<dbp:writer> <rdfs:subPropertyOf> <y:created>
→ If you are the writer of something (for DBPedia)
then you are the creator of that thing (for YAGO).
OWL
Ontology Web Language.
<dbp:birthPlace> <rdf:type> <owl:FunctionalProperty>
→ People are born in at most one place.
<dbp:Person> <owl:disjointWith> <dbp:Settlement>
→ Something cannot be both a Person and a Settlement.
<schema:spouse> <owl:equivalentProperty> <dbp:spouse>
→ spouse in Schema.org in DBPedia are equivalent properties.
<myonto:p4242> <owl:sameAs> <dbp:Douglas_Adams>
→ Assert equalities between resources.
Introduction Information Extraction Semantic Web
SPARQL
SPARQL Protocol And RDF Query Language.
Query language for RDF.
PREFIX abc: <http://example.com/exampleOntology#>
SELECT ?capital ?country WHERE {
?x abc:cityname ?capital ; abc:isCapitalOf ?y .
?y abc:countryname ?country ;
abc:isInContinent abc:Africa .
}
SPARQL
SPARQL Protocol And RDF Query Language.
Query language for RDF.
PREFIX abc: <http://example.com/exampleOntology#>
SELECT ?capital ?country WHERE {
?x abc:cityname ?capital ; abc:isCapitalOf ?y .
?y abc:countryname ?country ; abc:isInContinent abc:Africa . }
(Demo: http://dbpedia-live.openlinksw.com/sparql/)
Linked Open Data
Linked Datasets as of August 2014 Uniprot
Alexandria Digital Library Gazetteer
lobid Organizations
chem2 bio2rdf
Multimedia Lab UniversityGhent
Open Data Ecuador Geo Ecuador
Serendipity UTPL LOD
GovAgriBus Denmark
DBpedia
live URI
Burner
Linguistics Social Networking Life Sciences Cross-Domain
Government
User-Generated Content Publications
Geographic
Media Identifiers
Eionet RDF
lobid Resources Wiktionary
DBpedia
Viaf Umthes
RKB Explorer Courseware
Opencyc
Olia Gem.
Thesaurus Audiovisuele Archieven
Diseasome FU-Berlin Eurovoc
in SKOS
DNB GND Cornetto
Bio2RDF Pubmed Bio2RDF
NDC Bio2RDF
Mesh IDS
Ontos News Portal AEMET
ineverycrea Linked User Feedback
Museos Espania GNOSS
Europeana
Nomenclator Asturias
Red Uno Internacional GNOSS
Geo Wordnet
Bio2RDF HGNC Ctic
Public Dataset
Bio2RDF Homologene Bio2RDF Affymetrix
Muninn World War I
CKAN
Government Web IntegrationLinkedfor Data
Universidad de Cuenca Linkeddata
Freebase
Linklion
Ariadne
Organic Edunet
Gene Expression Atlas RDF
Chembl RDF
Biosamples RDF Identifiers
Org Biomodels
RDF
Reactome RDF Disgenet
Semantic Quran
IATI as Linked Data
Dutch Ships and Sailors Verrijktkoninkrijk
IServe Arago-
dbpedia
Linked TCGA ABS
270a.info
RDF License EnvironmentalApplications
Reference Thesaurus
Thist
JudaicaLink
BPR
OCD
Shoah Victims Names Reload
Data for Tourists in Castilla y Leon
2001 Spanish Census to RDF
RKB Explorer Webscience
RKB Explorer Eprints Harvest
NVS EU Agencies
Bodies EPO LinkedNUTS
RKB Explorer Epsrc
Open Mobile Network
RKB Explorer Lisbon
RKB Explorer Italy CE4R
EnvironmentAgency Bathing WaterQuality
RKB Explorer Kaunas
Open Data Thesaurus RKB Explorer Wordnet
RKB Explorer ECS Austrian
Ski Racers Social-
semweb Thesaurus
Data Open Ac Uk
RKB Explorer IEEE
ExplorerRKB LAAS
RKB Explorer Wiki RKB Explorer JISC
RKB Explorer Eprints
RKB Explorer Pisa
RKB Explorer Darmstadt RKB Explorer unlocode
RKB Explorer Newcastle
RKB Explorer OS
RKB Explorer Curriculum
RKB Explorer Resex
RKB Explorer Roma
RKB Explorer Eurecom RKB Explorer IBM RKB
Explorer NSF
RKB Explorerkisti RKB Explorer DBLP RKB ExplorerACM
RKB Explorer Citeseer
RKB Explorer Southampton RKB Explorer Deepblue
RKB Explorer Deploy RKB Explorer Risks RKB ExplorerERA
RKB Explorer OAI
RKB ExplorerFT RKB Explorer Ulm
RKB Explorer Irit
RKB Explorer RAE2001
RKB Explorer Dotac
RKB Explorer Budapest
Swedish Open Cultural Heritage Radatana Courts
Thesaurus
German Labor Law Thesaurus
GovUK TransportData
GovUK Education Data
Enakting Mortality Enakting
Energy
Enakting Crime
Enakting Population Enakting CO2Emission EnaktingNHS
RKB Explorer Crime
RKB Explorer cordis
Govtrack
Geological Survey of Austria Thesaurus
Geo Linked Data
Gesis Thesoz
Bio2RDF Pharmgkb
Bio2RDF Sabiork Bio2RDF
Ncbigene Bio2RDF Irefindex
Bio2RDF Iproclass
Bio2RDF GOA Bio2RDF Drugbank
Bio2RDF CTD
Bio2RDF Biomodels Bio2RDF DBSNP
Bio2RDF Clinicaltrials
Bio2RDF LSR Bio2RDF Orphanet
Bio2RDF Wormbase BIS
270a.info
DM2E
DBpedia PT DBpedia
ES
DBpedia CS DBnary
Alpino RDF YAGO
Pdev Lemon Lemonuby
Isocat Ietflang
Core
KUPKB
Getty AAT Semantic Web Journal OpenlinkSW
Dataspaces MyOpenlink Dataspaces Jugem
Typepad
Aspire Harper Adams
NBN Resolving Worldcat
Bio2RDF
Bio2RDF ECO Taxon- concept Assets Indymedia
GovUK Societal Wellbeing Deprivation imdEmployment Rank La 2010
GNU Licenses Greek Wordnet
DBpedia
CIPFA
Yso.fi Allars
Glottolog
StatusNet Bonifaz StatusNet
shnoulle
Revyu
StatusNet Kathryl Charging
Stations
Aspire UCL
Tekord Didactalia
Artenue Vosmedios GNOSS
Linked Crunchbase
ESD Standards
VIVO University of Florida
Bio2RDF SGD Resources
Product Ontology
Datos Bne.es
StatusNet Mrblog
Bio2RDF Dataset EUNIS
GovUK Housing Market
LCSH GovUK
TransparencyImpact ind.
HouseholdsIn temp.Accom.
Uniprot KB StatusNet
Timttmy
SemanticWeb Grundlagen
GovUK Input ind.
Local Authority Funding From GovernmentGrant
StatusNet Fcestrada
JITA
StatusNet Somsants
StatusNet Ilikefreedom
Drugbank FU-Berlin Semanlink
StatusNet Dtdns
StatusNet Status.net
DCS Sheffield Athelia
RFID
StatusNet Tekk
Lista Encabeza Mientos Materia
StatusNet Fragdev
Morelab DBTune
John Peel Sessions
RDFize last.fm
Open Data Euskadi
GovUK Transparency Input ind.
Local auth.
Funding f.
Gvmnt. Grant
MSC Lexinfo
StatusNet Equestriarp
Asn.us GovUK
Societal Wellbeing Deprivation Imd Health Rank la2010
StatusNet Macno Oceandrilling
Borehole
Aspire Qmul
GovUK Impact Indicators Planning Applications Granted
Loius
Datahub.io
StatusNet Maymay
Prospects and Trends GNOSS
GovUK Transparency Impact Indicators Energy Efficiencynew Builds
DBpedia EU
Bio2RDF Taxon
StatusNet Tschlotfeldt Jamendo DBTune
Aspire NTU
GovUK Societal Wellbeing Deprivation Imd Health Score2010
Lotico
GNOSS
Uniprot Metadata Linked
Eurostat
Aspire Sussex Lexvo
Linked Geo Data
StatusNet Spip
SORS
GovUK Homeless-ness Accept. per 1000
TWC IEEEvis
Aspire Brunel
PlanetData Project Wiki
StatusNet Freelish Statistics
data.gov.uk
StatusNet Mulestable
Enipedia UK
Legislation API
Linked MDB
StatusNet Qth
Sider FU-Berlin
DBpedia DE GovUK
Households Social lettings General NeedsLettings Prp Number Bedrooms
Agrovoc Skos My Experiment Proyecto
Apadrina GovUK
Imd Crime Rank 2010
SISVU GovUK
Societal Wellbeing Deprivation Imd Housing Rank la2010
StatusNet Uni Siegen Opendata
Scotland SimdEducation Rank
StatusNet Kaimi GovUK
Households Accommodatedper 1000
StatusNet Planetlibre
DBpedia EL
Sztaki LOD
DBpedia Lite
Drug Interaction Knowledge Base StatusNet
Qdnx
Amsterdam Museum AS EDN LOD
RDF Ohloh DBTune
artists last.fm
Aspire Uclan Hellenic
Fire Brigade
Bibsonomy
Nottingham Trent ResourceLists Opendata
Scotland Simd Income Rank
Randomness Guide London Opendata Scotland Simd Health Rank
SouthamptonECS Eprints
FRB 270a.info
StatusNet Sebseb01 StatusNet
Bka ESD
Toolkit
Hellenic Police
StatusNet Ced117 Open
Energy Info Wiki
StatusNet Lydiastench Open Data RISP
Taxon- concept Occurences
Bio2RDF SGD UIS
270a.info
NYTimes Linked Open Data
Aspire Keele
GovUK Households Projections Population
W3C Opendata
Scotland Simd HousingRank
ZDB
StatusNet 1w6
StatusNet Alexandre Franke
Dewey Decimal Classification
StatusNet Status StatusNet
doomicile Currency
Designators
StatusNet Hiico
Linked Edgar
GovUK Households 2008
DOI
StatusNet Pandaid Brazilian
Politicians
NHS Jargon
Theses.fr
Linked Life Data Semantic WebDogFood
UMBEL Openly
Local
StatusNet Ssweeny
Linked Food Interactive
Maps GNOSS
OECD 270a.info
Sudoc.fr Green
Competitive-ness GNOSS
StatusNet Integralblue
WOLD
Linked Stock Index
Apache
KDATA Linked
Open Piracy GovUK
Societal Wellbeing Deprv. Imd Empl. RankLa 2010
BBC Music
StatusNet Quitter
StatusNet Scoffoni Open
ElectionData Project
Reference data.gov.uk
StatusNet Jonkman
Project Gutenberg FU-Berlin DBTropes
StatusNet Spraci
Libris
ECB 270a.info
StatusNet Thelovebug Icane
Greek Administrative Geography
Bio2RDF OMIM StatusNet
Orangeseeds
National Diet Library WEB NDL Authorities
Uniprot Taxonomy DBpedia
NL
L3S DBLP FAO Geopolitical Ontology
GovUK Impact Indicators Housing Starts
Deutsche Biographie
StatusNet ldnfai StatusNet
Keuser
StatusNet Russwurm GovUK SocietalWellbeing
Deprivation Imd Crime Rank 2010
GovUK Imd Income Rank La 2010
StatusNet Datenfahrt StatusNet Imirhil
Southampton ac.uk LOD2 Project Wiki
DBpedia KO
Dailymed FU-Berlin WALS
DBpedia IT
StatusNet Recit
Livejournal StatusNet
Exdc Elviajero
Aves3D Open
Calais Zaragoza
Turruta
Aspire Manchester Wordnet
(VU)
GovUK Transparency Impact Indicators NeighbourhoodPlans
StatusNet David Haberthuer
B3Kat Pub Bielefeld Prefix.cc
NALT Vulnera-
pedia
GovUK Impact Indicators Affordable Housing Starts
GovUK Wellbeing lsoaHappy YesterdayMean
Flickr Wrappr
Yso.fi YSA
Open Library
Aspire Plymouth
StatusNet Johndrink Water StatusNet Gomertronic
Tags2con Delicious
StatusNet tl1n
StatusNet Progval
Testee
World Factbook FU-Berlin
DBpedia JA
StatusNet Cooleysekula
Product DB IMF
270a.info
StatusNet Postblue
StatusNet Skilledtests Nextweb GNOSS
Eurostat FU-Berlin GovUK Households Social Lettings General Needs Lettings PrpHousehold Composition
StatusNet Fcac
DWS Group OpendataScotland
Graph Simd Rank
DNB
Clean Energy Data Reegle
Opendata Scotland Simd Employment Rank
Chronicling America
GovUK Societal Wellbeing Deprivation Imd Rank 2010
StatusNet Belfalas
Aspire MMU
StatusNet Legadolibre
Bluk BNB
StatusNet Lebsanft
GADM Geovocab
GovUK Imd Score2010
Semantic XBRL
UK Postcodes
Geo Names
EEARod
Aspire Roehampton BFS
270a.info
Camera Deputati Linked Data
Bio2RDF GeneID GovUK
Transparency Impact IndicatorsApplicationsPlanning Granted
StatusNet Sweetie Belle
O'Reilly
GNI City
Lichfield
GovUK Imd Rank 2010
Bible Ontology Idref.fr
StatusNet Atari Frosch
Dev8d
Nobel Prizes StatusNet
Soucy
Archiveshub Linked Data
Linked Railway Data Project FAO 270a.info GovUK
Wellbeing Worthwhile Mean
Bibbase Semantic-
web.org
British Museum Collection
GovUK Dev Local Authority Services
Code Haus
Lingvoj
OrdnanceSurvey Linked Data
Wordpress
Eurostat RDF
StatusNet Kenzoid GEMET
GovUK Societal Wellbeing Deprv. imd Score '10
Mis Museos GNOSS
GovUK Households Projectionstotal Houseolds
StatusNet 20100 EEA
Ciard Ring
Opendata Scotland GraphEducationPupils bySchool and
Datazone
VIVO Indiana University Pokepedia
Transparency 270a.info
StatusNet Glou GovUK
Homelessness Households AccommodatedTemporary Housing Types
STW Thesaurus for Economics Debian
Package Tracking System
DBTune Magnatune
NUTS Geo- vocab GovUK
Societal Wellbeing Deprivation Imd Income Rank La2010
BBC Wildlife Finder
StatusNet Mystatus
Miguiad Eviajes GNOSS Acorn
Sat
Data Bnf.fr GovUK
imd env.
rank 2010
StatusNet Opensimchat
Open Food Facts GovUK
Societal Wellbeing Deprivation Imd Education Rank La2010
LOD ACBDLS FOAF-
Profiles
StatusNet Samnoble GovUK
Transparency Impact IndicatorsAffordableHousing Starts
StatusNet Coreyavis Enel Shops
DBpedia FR
StatusNet Rainbowdash
StatusNet Mamalibre
Princeton Library Findingaids
WWW Foundation
Bio2RDFOMIM Resources Opendata
Scotland Simd Geographic Access Rank
Gutenberg
StatusNet Otbm ODCL
SOA
StatusNet Ourcoffs
Colinda Web
Nmasuno Traveler
StatusNet Hackerposse
LOV Garnica Plywood
GovUK wellb. happy yesterday std. dev.
StatusNet Ludost
BBC Program- mes GovUK
Societal Wellbeing Deprivation ImdEnvironmentRank 2010
Bio2RDF Taxonomy Worldbank
270a.info
OSM
DBTune Music- brainz
Linked Mark Mail
StatusNet Deuxpi GovUK
TransparencyIndicatorsImpact Housing Starts
Bizkai Sense GovUK impact indicators energyefficiency new builds
StatusNet Morphtown GovUK
Transparency Input indicators Local authoritiesWorking w. tr.Families
ISO 639 Oasis
Aspire Portsmouth Zaragoza
Datos Abiertos Opendata ScotlandSimd Crime Rank
Berlios
StatusNet piana GovUK
Net Add.
Dwellings
Bootsnall
StatusNet chromic
Geospecies
linkedct Wordnet (W3C)
StatusNet thornton2 StatusNet
mkuttner
StatusNet linuxwrangling Eurostat
Linked Data
GovUK societal wellbeing deprv. imdrank '07 GovUK societal wellbeing deprv. imd rank la '10
Linked Open Data of Ecology
StatusNet chickenkiller
StatusNet gegeweb
Deusto Tech
StatusNet schiessle GovUK
transparencyimpact indicators tr. families
Taxon concept
GovUK service expenditure GovUK societal wellbeing deprivation imdemploymentscore 2010