• Aucun résultat trouvé

October10,2013 AntoineAmarilli TechnologiesduWebMasterCOMASICInformationExtractionandtheSemanticWeb .......

N/A
N/A
Protected

Academic year: 2022

Partager "October10,2013 AntoineAmarilli TechnologiesduWebMasterCOMASICInformationExtractionandtheSemanticWeb ......."

Copied!
35
0
0

Texte intégral

(1)

.

...

Technologies du Web Master COMASIC

Information Extraction and the Semantic Web

Antoine Amarilli 1

October 10, 2013

1 Course material adapted from Fabian Suchanek’s slides:

http://suchanek.name/work/teaching/IESW2010.pdf.

SPARQL example from:

en.wikipedia.org/w/index.php?title=SPARQL&oldid=575552762.

(2)

Motivation

We have seen how search engines work at the level of words.

Sometimes, this works...

Sometimes, it doesn’t:

Those hard queries would be easy on RDBMSes!

We need to extract structured information.

We would like to understand its semantics.

(3)

Other motivations: job offerings Motivation: Examples

Title Type Location

Business strategy Associate Part time Palo Alto, CA

Registered Nurse Full time Los Angeles

(4)

Other motivations: scientific papers

Motivation: Examples

Author Publication Year Grishman Information Extraction... 2006

... ... ... 9

(5)

Other motivations: price comparison

Motivation: Examples

Product Type Price

Dynex 32” LCD TV $1000

(6)

Table of contents

1 ... Introduction

2 ... Information Extraction

3 ... Semantic Web

(7)

Roadmap Information Extraction

Instance Extraction

Fact Extraction

Ontological Information Extraction

and beyond Information Extraction (IE) is the process

of extracting structured information (e.g., database tables) from unstructured machine-readable documents

(e.g., Web documents).

Person Nationality Elvis Presley American

nationality

Elvis Presley singer

Angela Merkel politician

(8)

Instance extraction with Hearst patterns

Entities can be extracted and categorized by automatic extraction of simple patterns:

Many scientists, including Einstein, believed...

France, Germany and other countries have been plagued with...

Other forms of government such as constitutional monarchy...

Difficulties:

Must parse correctly.

Must be resilient to noise.

(9)

Instance extraction with set expansion

Start with a seed set of entities of a certain type.

Find occurrences of them at specific position in documents:

Lists.

Table columns.

Assume that other items are other entities of the same nature.

Once again, this is noisy...

Precision and recall, see previous slides.

(10)

Set expansion example Instance Extraction: Set Expansion

Seed set: {Russia, USA, Australia}

Result set: {Russia, Canada, China, USA, Brazil, Australia, India, Argentina,Kazakhstan, Sudan}

17

(11)

Fact extraction with wrapper induction Fact Extraction: Wrapper Induction

Observation: On Web pages of a certain domain, the information is

often in the same spot.

(12)

Specifying a wrapper

A wrapper can be expressed:

As a path in the DOM (usually XPath).

Extensions to multiple pages, e.g., OXPath.

As a regular expression.

A wrapper can be produced:

Through manual annotation of the relevant fields.

Using specific knowledge of the source.

Wikipedia categories and infoboxes.

By comparison between similar pages to find what changed.

Using seed pairs (known facts).

Possibility to iterate between patterns and facts.

Risk of semantic drift.

(13)

Fact extraction on text

Entity extraction:

find entities in the text.

Named entity recognition:

identify the type of entities.

person

organization

quantity

address

etc.

NLP patterns to extract facts

POS patterns

Parse trees.

(14)

Fact extraction on text Fact Extraction: Pattern Matching

Einstein ha scoperto il K68, quando aveva 4 anni.

Bohr ha scoperto il K69 nel anno 1960.

Person Discovery

Einstein K68

X ha scoperto il Y

Person Discovery

Bohr K69

The patterns can be more complex, e.g.

•  regular expressions

X discovered the .{0,20} Y

•  POS patterns

X discovered the ADJ? Y

•  Parse trees

X discovered Y PN

NP S VP

V PN

NP

34

Try

(15)

Ontologies

Ontology: a set of entities and relations.

Name Ment. Mfact Domain Publisher Start Update

YAGO >10 >120 general MPI 2008 2012

DBPedia 4 2 460 general Openlink etc. 2007 2013

Wikidata 14 121 general Wikimedia 2012 2013

Freebase 40 1 200 general Metaweb (Google) 2007 2013 Google KG 570 18 000 general Google 2012 N/A MusicBrainz 35 180 music MetaBrainz 2003 2013

WordNet 0.4 2 English Princeton 1985 2006

ConceptNet 3 4 obvious MIT 2000 2013

(16)

Introduction

. . . . Information Extraction

. . . . Semantic Web

Entity disambiguation

Map mentions in text to entities.

Problem: mentions are ambiguous!

Use the importance of entities.

Use the likelihood that a term refers to an entity.

Use semantic consistency on the mappings of a document.

(17)

Entity disambiguation

Map mentions in text to entities.

Problem: mentions are ambiguous!

Use the importance of entities.

Use the likelihood that a term refers to an entity.

Use semantic consistency on the mappings of a document.

(Demo: https://d5gate.ag5.mpi-sb.mpg.de/webaida/)

(18)

Ontological IE

Use the existing ontology as reference.

Extract information from additional documents.

Use fuzzy rules to extend the ontology (often manual):

Extraction rules

Logical constraints

Common sense Reasoning:

Datalog

Weighted MAX-SAT

Markov Logic Networks

(19)

Ontological IE

Ontology

NL documents

Elvis was born in 1935

Semantic Rules

birthdate<deathdate

type(Elvis_Presley,singer) subclassof(singer,person) ...

appears(“Elvis”,”was born in”, ”1935”)

...

means(“Elvis”,Elvis_Presley,0.8) means(“Elvis”,Elvis_Costello,0.2) ...

born(X,Y) & died(X,Z) => Y<Z appears(A,P,B) & R(A,B) => expresses(P,R) appears(A,P,B) & expresses(P,R) => R(A,B)

...

First Order Logic Formulae

1935 born New facts with 1. canonic relations 2. canonic entities 3. consistency with the ontology

Ontological IE: Reasoning

SOFIE

system

(20)

Introduction

. . . . Information Extraction

. . . . Semantic Web

Open IE

Use the entire Web as corpus.

Crawl the Web for new facts.

Create new rules from what is extracted.

Examples:

Open IE, University of Washington.

Read the Web, CMU.

Demo: http://rtw.ml.cmu.edu/rtw/

(21)

Introduction

. . . . Information Extraction

. . . . Semantic Web

Open IE

Use the entire Web as corpus.

Crawl the Web for new facts.

Create new rules from what is extracted.

Examples:

Open IE, University of Washington.

Demo: http://openie.cs.washington.edu/

Demo: http://rtw.ml.cmu.edu/rtw/

(22)

Open IE

Use the entire Web as corpus.

Crawl the Web for new facts.

Create new rules from what is extracted.

Examples:

Open IE, University of Washington.

Demo: http://openie.cs.washington.edu/

Read the Web, CMU.

Demo: http://rtw.ml.cmu.edu/rtw/

(23)

Table of contents

1 ... Introduction

2 ... Information Extraction

3 ... Semantic Web

(24)

Motivation

Having structured data is nice.

However, independent sources are not useful.

We need to create links between data sources.

Run a query across multiple relevant data stores.

Perform complex transactions (booking a flight, a hotel...).

Rich data visualization (integrating e.g. maps and statistics).

We need to define the semantics of the data.

We need to enforce constraints.

We need to evaluate complex queries over multiple sources.

(25)

The Semantic Web

URIs Globally unique identification of entities and relations.

OWL Constraint language over structured data.

RDF Storage format for structured data.

SPARQL Query language for structured data.

LOD Linked Open Data: draw links between data sources.

(26)

URIs

Uniform Resource Identifier Like URLs.

Not always dereferenceable.

URNs: urn:isbn:0486415864 URLs, often with namespaces:

dbp:Paris for http://dbpedia.org/resource/Paris

(27)

RDF

Resource Description Framework.

Triples: Subject predicate object.

<dbp:Paris> <dbp:country> <dbp:France> .

The country of Paris (DBPedia resources).

<dbp:Paris> <foaf:homepage>

<http://www.paris.fr/> .

The homepage of Paris (FOAF relation, website).

<dbp:Paris> <foaf:name> "Paris"@en .

The name of Paris (FOAF relation, literal value).

Multiple serializations.

(28)

RDFS

RDF Schema.

<dbp:Paris> <rdf:type> <dbp:Settlement>

Paris is a settlement.

<dbp:Settlement> <rdfs:subclassOf> <dbp:Place>

If I am a Settlement then I am a Place.

<dbp:writer> <rdfs:subPropertyOf> <y:created>

If you are the writer of something (for DBPedia)

then you are the creator of that thing (for YAGO).

(29)

OWL

Ontology Web Language.

<dbp:birthPlace> <rdf:type> <owl:FunctionalProperty>

People are born in at most one place.

<dbp:Person> <owl:disjointWith> <dbp:Settlement>

Something cannot be both a Person and a Settlement.

<schema:spouse> <owl:equivalentProperty> <dbp:spouse>

spouse in Schema.org in DBPedia are equivalent properties.

<myonto:p4242> <owl:sameAs> <dbp:Douglas_Adams>

Assert equalities between resources.

(30)

Introduction

. . . . Information Extraction

. . . . Semantic Web

SPARQL

SPARQL Protocol And RDF Query Language.

Query language for RDF.

PREFIX abc: <http://example.com/exampleOntology#>

SELECT ?capital ?country WHERE {

?x abc:cityname ?capital ; abc:isCapitalOf ?y .

?y abc:countryname ?country ;

abc:isInContinent abc:Africa .

}

(31)

SPARQL

SPARQL Protocol And RDF Query Language.

Query language for RDF.

PREFIX abc: <http://example.com/exampleOntology#>

SELECT ?capital ?country WHERE {

?x abc:cityname ?capital ; abc:isCapitalOf ?y .

?y abc:countryname ?country ; abc:isInContinent abc:Africa . }

(Demo: http://dbpedia-live.openlinksw.com/sparql/)

(32)

Linked Open Data

World Fact- book John (DBTune)Peel

Pokedex

Pfam

US SEC (rdfabout)

Linked LCCN

Europeana EEA

IEEE

ChEMBL Semantic

XBRL

SW FoodDog

CORDIS (FUB)

AGROVOC Openly

Local

Discogs (Data Incubator)

DBpedia yovisto

Tele- graphis

tags2con delicious

NSF

MediCare Brazilian

Poli- ticians

dotAC ERA

Open Cyc

Italian public schools

UB Mann- heim

JISC MoseleyFolk

Semantic Tweet

OS GTAA

totl.net

OAI Portu-

guese DBpedia

LOCAH

KEGG Glycan CORDIS

(RKB Explorer)

UMBEL

Affy- metrix riese

business.

data.gov.uk

OpenData Thesau-rus Geo

Linked Data UK Post-

codes

Smart Link

ECCO- TCP UniProt

(Bio2RDF) SSW Thesau- rus RDF

ohloh

Freebase

London Gazette

Open Corpo- rates Airports

GEMET

P20

TCM GeneDIT

Source Code Ecosystem Linked Data

OMIM Hellenic

FBD

Data Gov.ie

Music Brainz (DBTune)

data.gov.uk intervals

LODE Climbing

SIDER Project Guten- berg Music

Brainz (zitgist)

ProDom HGNC

SMC Journals

Reactome National

Radio- activity JP legislation

data.gov.uk

AEMET

Product Types Ontology

Linked FeedbackUser

Revyu

Gene Ontology NHS

(En- AKTing)

URI Burner DB

Tropes

Eurécom ISTAT

Immi- gration

LichfieldSpen- ding

Surge Radio

Euro-stat (FUB)

Piedmont Accomo- dations

New York Times

Klapp- stuhl-club

EUNIS

Bricklink

reegle

CO2 Emission AKTing)(En-

Audio Scrobbler (DBTune)

GovTrack GovWILD

ECS South- ampton EPrints

KEGG Reaction Linked

EDGAR (Ontology Central)

LIBRIS Open

Library

KEGG Drug research.

data.gov.uk

VIVO Cornell

UniRef WordNet(RKB

Explorer) Cornetto

medu- cator

DDC DeutscheBio-

graphie

Wiki Ulm

NASA (Data Incu-

bator) BBC Music

Drug Bank

Turismode Zaragoza

Plymouth ReadingLists

education.

data.gov.uk

KISTI

Uni Pathway Eurostat

(OntologyCentral)

OGOLOD Twarql

Music Brainz (Data Incubator)

Geo Names

Pub Chem

Italian Museums

Good-win Family flickr

wrappr

Eurostat

Thesau- rus W Open Library (Talis)

LOIUS

Linked GeoData

Linked Open Colors WordNet

(VUA) patents.

data.gov.

uk

Greek DBpedia

Sussex Reading Lists

Metoffice Weather Forecasts

GND

LinkedCT

SISVU transport.

data.gov.

uk

Didac- talia

dbpedia lite

BNB Ontos News Portal

LAAS

Product DB

iServe Recht-

spraak.

nl

KEGG Com- pound Geo

Species

VIVO UF Linked

Sensor Data (Kno.e.sis)

lobid Organi- sations LEM Linked

Crunch- base

FTS

Ocean Drilling Codices

Janus AMP

ntnusc

Weather Stations

Amster- dam Museum lingvoj Crime

AKTing)(En-

Course-ware

PubMed

ACM BBC

Wildlife Finder

Calames Chronic-

ling America

data- open- ac- uk ElectionOpen

ProjectData

Slide- share2RDF

Finnish Munici- palities

OpenEI

CodesMARC List

VIVO Indiana HellenicPD

LCSH FanHubz

bible ontology IdRef

Sudoc

KEGG Enzyme NTU Resource

Lists

PRO- SITE

Linked Open Numbers Energy

(En- AKTing)

Roma Open

Calais

data bnf.fr

lobid Resources

IRIT theses.

fr LOV

Rådata nå!

Daily Med

Taxo- nomy

New- castle

Google Art wrapper Poké- pédia

EURES

BibBase

RESEX

STITCH PDB

EARTh

IBM Last.FM

artists (DBTune)

YAGO

ECS(RKB Explorer) Event

Media

STW my

Experi- ment BBC

Program- mes

NDL subjects

Taxon Concept

Pisa

KEGG Pathway

UniParc Jamendo (DBtune) Popula-

tion (En- AKTing)

Geo- WordNet

RAMEAU SH

UniSTS Mortality

AKTing)(En-

Alpine AustriaSki

DBLP (RKB Explorer)

Chem2 Bio2RDF MGI

DBLP (L3S)

Yahoo!

Geo Planet

GeneID RDF Book Mashup

El Viajero Tourism

Uberblic

SwedishOpen Cultural Heritage

GESIS

datadcs Last.FM

(rdfize) Ren.

Energy Genera- tors

Sears

RAE2001 NSZL Catalog

Homolo- Gene Ord-

nance Survey

TWC LOGD

Disea- some Produc-EUTC

tions

PSH

WordNet (W3C)

semantic web.org

Scotland graphyGeo-

Magna-tune

Norwe- MeSHgian

SGD Traffic

Scotland

statistics.

data.gov.

uk Crime Reports UK

UniProt

US Census (rdfabout)

Man- chester Reading Lists

EU Insti- tutions

PBAC VIAF

LOCODEUN/

Lexvo Linked MDB

stan-ESD dards reference.

data.gov.uk

t4gm info

Sudoc

ECS South- ampton

ePrints Classical

(DB Tune)

DBLP (FU Berlin)

Scholaro- meter St.

Andrews Resource Lists

NVD Fishes

of Texas Scotland

Pupils &

Exams

RISKS gnoss

DEPLOY

InterPro Lotico

Ox Points

Enipedia

ndlna

Budapest

CiteSeer

(33)

Linked Open Data (zoom)

Pfam DBpedia

ERA OS

TCM Gene DIT

SIDER Project

Guten- berg

Eurécom Drug Bank

LinkedCT dbpedia

lite

data- open- ac- Daily uk

Med

BibBase

PDB

DBLP (L3S) data dcs

Disea- some

UniProt

UN/

LOCODE DBLP

Berlin) (FU

Enipedia

(34)

Linked Open Data

Integrate many sources:

General ontologies.

Domain-specific ontologies.

Open data dumps.

Existing relational databases.

Find links

Automatically.

By hand (relations, rules...).

Statistics:

Hundreds of ontologies.

504 million RDF links (2011).

52 billion triples (2012). 2

5 billion entities (2012, incl., e.g., 854 million people).

Wealth of data, still underused

2 http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/

(35)

Challenges

Manage trust and attribution.

Manage time.

Manage uncertainty throughout the pipeline.

Extraction.

Trust.

Intrisic uncertainty.

Manage complex facts.

Reification.

Manage noise.

Compute alignments.

Références

Documents relatifs

In this paper, we present a system which makes scientific data available following the linked open data principle using standards like RDF und URI as well as the popular

This paper has shown how LODeX is able to provide a visual and navigable summary of a LOD dataset including its inferred schema starting from the URL of a SPARQL Endpoint3. The

The information ex- tracted by [7] overlaps the intensional knowledge that can be present within the triples of a dataset; LODeX is able to extract the triples belonging the

The main contribution of this paper is the definition of a system that achieves high precision/sensitivity in the tasks of filtering by entities, using semantic technologies to

For example, given two snapshots of a dataset captured at two different points in time, the change analysis at the triple level includes which triples from the previous snapshot

The LOD Russia research project funded by the Ministry of Educa- tion aims to create a first Linked Open Data Set in Russia enabling scientists, researchers and commercial users

A data integration subsystem will provide other subsystems or external agents with uniform access interface to all of the system’s data sources. Requested information

The given LOS GeoNames Weather service (getNearByWeather) augments a callers knowledge base with Linked Data according to the output pattern which is specified by means of a