• Aucun résultat trouvé

Data Acquisition and Extraction from the Variety of Web Sources

N/A
N/A
Protected

Academic year: 2022

Partager "Data Acquisition and Extraction from the Variety of Web Sources"

Copied!
97
0
0

Texte intégral

(1)

10 September 2014,Yves Rocher

Data Acquisition and

Extraction from the Variety of Web Sources

Pierre Senellart

(2)

2 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Opportunities for Market Insights

(3)

3 / 74 Télécom ParisTech Pierre Senellart

Internet and the Web

Internet: physical network of computers (orhosts)

World Wide Web, Web, WWW: logical collection of hyperlinked documents

static and dynamic

public Web andprivateWebs

each document (or Web page, orresource) identified by a URL

(4)

4 / 74 Télécom ParisTech Pierre Senellart

Uniform Resource Locators

https

| {z }

scheme

:// www.example.com

| {z }

hostname

:443| {z }

port

/ path/to/doc

| {z }

path

?name=foo&town=bar

| {z }

query string

#para

| {z }

fragment

scheme: way the resource can be accessed; generallyhttporhttps hostname: domain nameof a host (cf. DNS); hostname of a website

may start withwww., but not a rule.

port: TCP port; defaults: 80 for http and 443 for https path: logical pathof the document

query string: additional parameters (dynamic documents), optional fragment: subpartof the document, optional

Relative URIs with respect to a context(e.g., the URI above):

/titi https://www.example.com/titi

tata https://www.example.com/path/to/tata

(5)

5 / 74 Télécom ParisTech Pierre Senellart

(X)HTML

Choice format forWeb pages

Dialect of SGML(the ancestor of XML), but seldom parsed as is HTML 4.01: most common version, W3C recommendation XHTML 1.0: XML-izationof HTML 4.01, minor differences HTML5: most recent version, still in development, adds some better structuring

Actual situation of the Web: tag soup

(6)

6 / 74 Télécom ParisTech Pierre Senellart

XHTML example

<!DOCTYPE html PUBLIC

"-//W3C//DTD XHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml"

lang="en" xml:lang="en">

<head>

<meta http-equiv="Content-Type"

content="text/html; charset=utf-8" />

<title>Example XHTML document</title>

</head>

<body>

<p>This is a

<a href="http://www.w3.org/">link to the

<strong>W3C</strong>!</a></p>

</body>

</html>

(7)

7 / 74 Télécom ParisTech Pierre Senellart

HTTP

Client-server protocolfor the Web, on top of TCP/IP Example request/response

GET /myResource HTTP/1.1 Host: www.example.com

HTTP/1.1 200 OK

Content-Type: text/html; charset=ISO-8859-1

<html>

<head><title>myResource</title></head>

<html>

<head><title>myResource</title></head>

<body><p>Hello world!</p></body>

</html>

HTTPS: secureversion of HTTP

(8)

8 / 74 Télécom ParisTech Pierre Senellart

Features of HTTP/1.1

virtual hosting: different Web content for different hostnames on a single machine

login/password protection

content negociation: same URL identifying several resources, client indicates preferences

cookies: chunks of information persistently stored on the client keep-alive connections: several requests using the same TCP

connection etc.

(9)

9 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(10)

10 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(11)

11 / 74 Télécom ParisTech Pierre Senellart

Web Crawlers

crawlers,(Web) spiders,(Web) robots: autonomous user agents that retrieve pages from the Web

Basics of crawling:

1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (cf. next slide)

4. Repeat on each found URL

No real termination condition (virtual unlimited number of Web pages!)

Graph-browsingproblem

deep-first: not well adapted, can be lost in robot traps best: breadth-first with limited-depth deep-first on each

discovered website

(12)

12 / 74 Télécom ParisTech Pierre Senellart

Sources of new URLs

From HTML pages:

hyperlinks<a href="...">...</a>

media<img src="..."> <embed src="...">

<object data="...">

frames<frame src="..."> <iframe src="...">

JavaScript linkswindow.open("...") etc.

Other hyperlinked content (e.g., PDF files)

Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): use regular expressions to extract them

Referrer URLs

Sitemaps [sitemaps.org, 2008]

(13)

13 / 74 Télécom ParisTech Pierre Senellart

Scope of a crawler

Web-scale

The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server

Focus onimportantpages [Abiteboul et al., 2003]

Web servers under a list of DNS domains: easy filtering of URLs A given topic: focused crawling techniques [Chakrabarti et al., 1999, Diligenti et al., 2000, Gouriten et al., 2014] based on classifiers of Web page content and predictors of the interest of a link.

The national Web (cf.public deposit, national libraries): what is this? [Abiteboul et al., 2002]

A given Web site: what is a Web site? [Senellart, 2005]

(14)

14 / 74 Télécom ParisTech Pierre Senellart

Identification of duplicate Web pages

Problem

Identifying duplicates or near-duplicates on the Web to prevent multiple indexing

trivial duplicates: same resource at the same canonized URL:

http://example.com:80/toto http://example.com/titi/../toto exact duplicates: identification by hashing

near-duplicates: (timestamps, tip of the day, etc.) more complex!

(15)

15 / 74 Télécom ParisTech Pierre Senellart

Crawling ethics

Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994].

User-agent: *

Allow: /searchhistory/

Disallow: /search

Per-page exclusion.

<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">

Per-link exclusion.

<a href="toto.html" rel="nofollow">Toto</a>

AvoidDenial Of Service (DOS), wait 1s between two repeated requests to the same Web server

(16)

16 / 74 Télécom ParisTech Pierre Senellart

Parallel processing

Network delays, waits between requests:

Per-server queueof URLs

Parallel processing of requests to different hosts:

multi-threadedprogramming

asynchronousinputs and outputs (select, classes from java.util.concurrent): less overhead

Use of keep-alive to reduce connexion overheads

(17)

General Architecture [Chakrabarti, 2003]

(18)

18 / 74 Télécom ParisTech Pierre Senellart

Refreshing URLs

Content on the Webchanges Differentchange rates:

online newspaper main page: every hour or so published article: virtually no change

Continuous crawling, and identification of change rates for

adaptivecrawling: how to know the time of last modificationof a Web page?

(19)

19 / 74 Télécom ParisTech Pierre Senellart

Estimating the Freshness of a Page

1. Check HTTP timestamp.

2. Check content timestamp.

3. Compare a hash of the page with a stored hash.

4. Non-significant differences (ads, fortunes, request timestamp):

only hash text content, or “useful” text content;

compare distribution ofn-grams (shingling);

or even compute edit distance with previous version.

Adapting strategy to each different archived website?

(20)

20 / 74 Télécom ParisTech Pierre Senellart

Crawling Modern Web Sites

Some modern Web sites only work when cookies are activated (session cookies), or whenJavaScript code is interpreted Regular Web crawlers (wget,Heritrix,Apache Nutch) usually don’t do cookie management and don’t interpret JavaScript code Crawling of some Websites therefore require moreadvanced tools

(21)

21 / 74 Télécom ParisTech Pierre Senellart

Advanced crawling tools

Web scraping frameworks such asscrapy(Python) or

WWW::Mechanize (Perl) simulate a Web browser interaction and cookie management (but no JS interpretation)

Headless browsers such ashtmlunitsimulate a Web browser, including simple JavaScript processing

Browser instrumentors such asSeleniumallow full instrumentation of a regular Web browser (Chrome, Firefox, Internet

Explorer)

OXPath: afull-fledged navigation and extraction languagefor complex Web sites [Sellers et al., 2011] Demo

(22)

22 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(23)

23 / 74 Télécom ParisTech Pierre Senellart

Templated Web Site

Many Web sites (especially, Web forums, blogs) use one of a few content management systems(CMS)

Web sites that use the same CMS will be similarly structured, present a similar layout, etc.

Information is somewhat structuredin CMSs: publication date, author, tags, forums, threads, etc.

Some structure differencesmay exist when Web sites use different versions, or different themes, of a CMS

(24)

24 / 74 Télécom ParisTech Pierre Senellart

Crawling CMS-Based Web Sites

Traditional crawling approaches crawl Web sites independentlyof the nature of the sites and of their CMS

When the CMS is known:

Potential for much moreefficient crawling strategies(avoid pages with redundant information, uninformative pages, etc.)

Potential forautomatic extractionof structured content Two ways of approaching the problem:

Have ahandcrafted knowledge baseof known CMSs, their

characteristics, how to crawl and extract information [Faheem and Senellart, 2013b,a] (AAH) Demo

Automatically inferthe best way to crawl a given CMS [Faheem and Senellart, 2014] (ACE)

Need to be robustw.r.t. template change

(25)

25 / 74 Télécom ParisTech Pierre Senellart

Detecting CMSs

One main challenge in intelligent crawling and content extraction is to identify the CMS and then perform the best crawling strategyaccordingly

Detecting CMS using:

1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.

These can be manually described (AAH), or automatically inferred (ACE)

For instance thevBulletin Web forum content management system, that can be identified by searching for a reference to a vbulletin_global.js JavaScript script by using a simple //script/@srcXPath expression.

(26)

26 / 74 Télécom ParisTech Pierre Senellart

Crawling http://www.rockamring-blog.de/

[Faheem and Senellart, 2014]

0 2;000 4;000 6;000 0

100 200 300

Number of HTTP requests

Numberofdistinct2-grams(1;000)

ACE AAH wget

(27)

27 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(28)

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites

1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp

24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends, user-shared content, user profiles, etc.)

(29)

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites

1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp

24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends, user-shared content, user profiles, etc.)

(30)

28 / 74 Télécom ParisTech Pierre Senellart

Most popular Web sites

1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp

24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com

(Alexa)

Social networking sites

Sites with social networking features (friends, user-shared content, user profiles, etc.)

(31)

29 / 74 Télécom ParisTech Pierre Senellart

Social data on the Web

Hugenumbers of users (2012):

Facebook 900 million QQ 540 million W. Live 330 million Weibo 310 million Google+ 170 million Twitter 140 million LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter (3,000 per second on average!). . . . . . including statements by heads of states, revelations of political activists, etc.

(32)

29 / 74 Télécom ParisTech Pierre Senellart

Social data on the Web

Hugenumbers of users (2012):

Facebook 900 million QQ 540 million W. Live 330 million Weibo 310 million Google+ 170 million Twitter 140 million LinkedIn 100 million

Huge volume of shared data:

250 million tweets per day on Twitter (3,000 per second on average!). . . . . . including statements by heads of states, revelations of political activists, etc.

(33)

30 / 74 Télécom ParisTech Pierre Senellart

Crawling Social Networks

Theoretically possible to crawl social networking sites using a regular Web crawler

Sometimes not possible: https://www.facebook.com/robots.txt Oftenvery inefficient, considering politeness constraints

Better solution: Use provided social networking APIs https://dev.twitter.com/docs/api/1.1

https://developers.facebook.com/docs/graph-api/

reference/v2.1/

https://developer.linkedin.com/apis

https://developers.google.com/youtube/v3/

Also possible to buy access to the data, directly from the social network or from brokers such ashttp://gnip.com/

(34)

31 / 74 Télécom ParisTech Pierre Senellart

Social Networking APIs

Most social networking Web sites (and some other kinds of Web sites) provide APIsto effectively access their content

Usually a RESTfulAPI, occasionally SOAP-baed

Usually require atoken identifying the application using the API, sometimes a cryptographic signature as well

May access the API as an authenticated user of the social network, or as an external party

APIs seriously limit the rate of requests:

https://dev.twitter.com/docs/api/1.1/get/search/tweets

(35)

32 / 74 Télécom ParisTech Pierre Senellart

REST

Mode of interaction with a Web service

Follow the KISS (Keep it Simple, Stupid) principle

Each request to the service is asimple HTTP GET method Base URL is theURL of the service

Parameters of the service are sent as HTTP parameters (in the URL)

HTTP response codeindicates success or failure

Response containsstructured output, usually as JSON or XML No side effect, each request independent of previous ones Example: http://graph.facebook.com:80/?ids=7901103

(36)

33 / 74 Télécom ParisTech Pierre Senellart

The Case of Twitter

Two main APIs:

REST APIs, including search, getting information about a user, a list, followers, etc. https://dev.twitter.com/docs/api/1.1 Streaming API, providing real-time result

Very limited history available

Search can be onkeywords,language,geolocation(for a small portion of tweets)

(37)

34 / 74 Télécom ParisTech Pierre Senellart

Cross-Network Crawling

Often useful to combine results from different social networks Numerous libraries facilitating SN API accesses (twipy, Facebook4J, FourSquare VP C++ API. . . ) incompatible with each other. . . Some efforts at generic APIs (OneAll,

APIBlender [Gouriten et al., 2014]) Demo

Example use case: No API to get all check-ins from FourSquare, but a number of check-ins are available on Twitter; given results of Twitter Search/Streaming, use FourSquare API to get information about check-in locations.

(38)

35 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(39)

36 / 74 Télécom ParisTech Pierre Senellart

The Deep Web

Definition (Deep Web, Hidden Web, Invisible Web)

All the content on the Web that is not directly accessible through hyperlinks. In particular: HTML forms, Web services.

Size estimate: 500 times more content than on the surface Web!

[BrightPlanet, 2000]. Hundreds of thousands of deep Web databases [Chang et al., 2004]

(40)

37 / 74 Télécom ParisTech Pierre Senellart

Sources of the Deep Web

Example

Yellow Pagesand other directories;

Library catalogs;

Weather services;

US Census Bureau data;

etc.

(41)

38 / 74 Télécom ParisTech Pierre Senellart

Discovering Knowledge from the Deep Web [Nayak et al., 2012]

Content of the deep Web hidden to classical Web search engines (they just follow links)

But very valuable and high quality!

Even services allowing access through the surface Web (e.g., e-commerce) have more semantics when accessed from the deep Web

How tobenefit from this information?

How toanalyze,extract and model this information?

Focus here: Automatic, unsupervised, methods, for a given domain of interest

(42)

39 / 74 Télécom ParisTech Pierre Senellart

Extensional Approach

WWW discovery

siphoning

bootstrap Index

indexing

(43)

40 / 74 Télécom ParisTech Pierre Senellart

Notes on the Extensional Approach

Main issues:

Discovering services

Choosing appropriate data to submit forms

Use of data found in result pages to bootstrap the siphoning process Ensure good coverage of the database

Approach favored by Google, used in production [Madhavan et al., 2006]

Not always feasible (huge load on Web servers)

(44)

Intensional Approach

WWW discovery

probing analyzing

Form wrapped as a Web service

query

(45)

42 / 74 Télécom ParisTech Pierre Senellart

Notes on the Intensional Approach

Moreambitious [Chang et al., 2005, Senellart et al., 2008]

Main issues:

Discovering services

Understanding the structure and semantics of a form Understanding the structure and semantics of result pages Semantic analysis of the service as a whole

Query rewriting using the services

No significant load imposed on Web servers

(46)

43 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Regular Web Content

CMS-based Web Content Social Networking Sites The Deep Web

The Semantic Web

Exploiting Acquired Information Opportunities for Market Insights

(47)

44 / 74 Télécom ParisTech Pierre Senellart

The Semantic Web

A Web in which the resources are semantically described

annotations give information about a page, explain an expression in a page, etc.

More precisely, a resource is anything that can be referred to by a URI

a web page, identified by a URL

a fragment of an XML document, identified by an element node of the document,

a web service,

a thing, an object, a concept, a property, etc.

Semantic annotations: logical assertions that relate resources to some terms in associated ontologies

(48)

45 / 74 Télécom ParisTech Pierre Senellart

Ontologies

Formal descriptions providing humanusers a shared understanding of a given domain

A controlled vocabulary

Formally defined so that it can also be processed bymachines Logical semantics that enables reasoning

Reasoning is the key for different important tasks of Web data management, in particular:

to answer queries (over possibly distributed data)

to relate objects in different data sources enabling their integration to detect inconsistencies or redundancies

to refine queries with too many answers, or to relax queries with no answer

(49)

46 / 74 Télécom ParisTech Pierre Senellart

Where Do Ontologies Come From?

Manually craftedto represent the knowledge of a specific domain (e.g., life sciences)

Exported fromclassical Web databases

Through information extractionfrom the Web, Wikipedia, etc.

(e.g., DBpedia, YAGO)

Privateto a company or public

Some ontologies focus on instances, others on aschema (see further)

Value of the Semantic Web: bits of ontologies can be re-usedin another, and ontologies can be mapped through anowl:sameAs link

(50)

As of September 2011 Music

Brainz (zitgist)

P20

Turismo de Zaragoza

yovisto

Yahoo!

Geo Planet

YAGO World Fact- book

ViajeroEl Tourism

WordNet (W3C) WordNet (VUA)

VIVO UF VIVO Indiana

VIVO Cornell

VIAF

URI Burner

Sussex Reading Lists

Plymouth Reading Lists

UniRef UniProt UMBEL

UK Post- codes legislation data.gov.uk

Uberblic

UB Mann- heim

TWC LOGD

Twarql transport data.gov.

uk

Traffic Scotland

theses.

fr Thesau-

rus W

totl.net Tele- graphis

TCM GeneDIT Taxon

Concept

Open Library (Talis) tags2con

delicious

t4gm info

Swedish Open Cultural Heritage Surge

Radio

Sudoc

STW RAMEAU

SH

statistics data.gov.

uk

Andrews St.

Resource Lists

ECS South- ampton EPrints SSW

Thesaur us

Smart Link

Slideshare 2RDF

semantic web.org Semantic

Tweet

Semantic XBRL

SW Dog Food Source Code Ecosystem Linked Data

US SEC (rdfabout)

Sears Scotland

Geo- graphy Scotland Pupils &

Exams

Scholaro- meter

WordNet (RKB Explorer)

Wiki

UN/

LOCODE Ulm

ECS (RKB Explorer)

Roma

RISKS RESEX

RAE2001 Pisa OS

OAI

NSF New-castle

LAAS KISTI JISC

IRIT

IEEE IBM

Eurécom ERA

ePrints dotAC

DEPLOY DBLP

Explorer)(RKB Crime

Reports UK

Course- ware CORDIS

Explorer)(RKB

CiteSeer

Budapest

ACM

riese

Revyu research

data.gov.

Ren. uk Energy Genera- tors

reference data.gov.

uk

Recht- spraak.

nl

ohlohRDF

Last.FM (rdfize)

RDF Book Mashup

Rådata nå!

PSH

Product Types Ontology Product

DB

PBAC Poké-

pédia patents

data.go v.uk Ox

Points

Ord- nance Survey

Openly Local

Open Library

Open Cyc

Open Corpo- rates

Open Calais OpenEI

Open Election Data Project

Open Data Thesau- rus

Ontos News Portal

OGOLOD Janus

AMP Ocean Drilling Codices

New York Times

NVD ntnusc Resource NTU

Lists

Norwe- gian MeSH NDL

subjects

ndlna Experi-my

ment

Italian Museums

medu- cator

MARC Codes List Man- chester Reading Lists Lotico

Weather Stations London

Gazette

LOIUS

Linked Open Colors

lobid Resources

lobid Organi- sations LEM

Linked MDB

LinkedL CCN

Linked GeoData

LinkedCT Linked

User Feedback LOV

Linked Open Numbers LODE

Eurostat (Ontology Central) Linked

EDGAR (Ontology Central)

Linked Crunch- base

lingvoj Lichfield

Spen- ding

LIBRIS

Lexvo

LCSH

DBLP (L3S)

Linked Sensor Data (Kno.e.sis)

Klapp- stuhl- club

Good- win Family

National Radio- activity JP

Jamendo (DBtune)

Italian public schools ISTAT

Immi- gration

iServe

IdRef Sudoc

NSZL Catalog Hellenic

PD Hellenic

FBD

Piedmont Accomo- dations GovTrack GovWILD

Google wrapperArt gnoss

GESIS

GeoWord Net

Geo Species NamesGeo

Geo Linked

Data

GEMET GTAA

STITCH SIDER

Project Guten- berg

MediCare Euro-

stat (FUB)

EURES

Drug Bank

Disea- some

DBLP (FU Berlin) Daily

Med CORDIS

(FUB)

Freebase flickr wrappr

Fishes of Texas

Finnish Munici- palities

ChEMBL FanHubz

Event Media EUTC

Produc- tions

Eurostat

Europeana

EUNIS EU

Insti- tutions

ESD stan- dards

EARTh

Enipedia Popula-

tion (En- AKTing) NHS (En- AKTing) Mortality

(En- AKTing) Energy

(En- AKTing)

Crime (En- AKTing)

CO2 Emission

(En- AKTing) EEA

SISVU educatio

n.data.g ov.uk

ECS South- ampton

ECCO- TCP GND Didactal

ia

DDC Deutsche

Bio- graphie

data dcs Music

Brainz (DBTune) Magna-

tune John Peel (DBTune)

Classical Tune)(DB

Audio Scrobbler (DBTune)

Last.FM artists (DBTune) DB Tropes

Portu- guese DBpedia

dbpedia lite

Greek DBpedia

DBpedia

data- open- ac-uk

SMC Journals

Pokedex

Airports NASA (Data Incu- bator)

Music Brainz (Data Incubator) Moseley

Folk

Metoffice Weather Forecasts

Discogs (Data Incubator)

Climbing data.gov.uk

intervals

Data Gov.ie

data bnf.fr

Cornetto reegle

Chronic- ling America

Chem2 Bio2RDF

Calames business

data.gov.

uk

Bricklink

Brazilian Poli- ticians

BNB

UniSTS

UniPath way UniParc

Taxono my

UniProt (Bio2RDF)

SGD

Reactome

PubMed Pub

Chem PRO-

SITE ProDom

Pfam PDB

OMIM

MGI

KEGG Reaction KEGG

Pathway KEGG Glycan KEGG Enzyme KEGG Drug

KEGG Com- pound InterPro

Homolo Gene HGNC

Gene Ontology

GeneID Affy-

metrix

bible ontology BibBase

FTS

BBC Wildlife Finder BBC Program

mes BBC

Music

Alpine Ski Austria

LOCAH

Amster- Museumdam AGROV

OC AEMET

US Census (rdfabout)

Media Geographic Publications

Government Cross-domain Life sciences User-generated content

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod- cloud.net/

(51)

48 / 74 Télécom ParisTech Pierre Senellart

Classes and class hierarchy

Backbone of the ontology

AcademicStaff is a Class(A class will be interpreted as a set of objects)

AcademicStaff isaStaff (isa is interpreted as set inclusion)

FacultyComponent

Course

MathCourse

Probabilities Algebra Logic CSCourse

DB AI Java Student

UndergraduateStudent MasterStudent

PhDStudent Department

PhysicsDept MathsDept CSDept Staff

AcademicStaff

Lecturer Researcher Professor AdministrativeStaff

(52)

49 / 74 Télécom ParisTech Pierre Senellart

Relations

Declaration of relationswith their signature

(Relations will be interpreted as binary relations between objects) TeachesIn(AcademicStaff,Course)

if one states that “X TeachesInY”, then X belongs to AcademicStaffandY toCourse

TeachesTo(AcademicStaff,Student) Leads(Staff,Department)

(53)

50 / 74 Télécom ParisTech Pierre Senellart

Instances

Classes haveinstances

Dupondis an instance of the classProfessor corresponds to the fact: Professor(Dupond)

Relations also have instances

(Dupond,CS101) is an instance of the relationTeachesIn corresponds to the fact: TeachesIn(Dupond,CS101)

The instance statements can be seen as (and stored in) adatabase

(54)

51 / 74 Télécom ParisTech Pierre Senellart

Ontology = schema + instance

Schema (TBox)

The set of class and relation names

Thesignaturesof relations and alsoconstraints The constraints are used for two purposes

checking data consistency (like dependencies in databases) inferring new facts

Instance (ABox) The set of facts

The set of base facts together with the inferred facts should satisfy the constraints

Ontology(i.e., Knowledge Base) = Schema + Instance

(55)

52 / 74 Télécom ParisTech Pierre Senellart

Where can Semantic Content be Found?

In the linked data, through Web-available RDF data:

dumpsof an entire ontology, in one of the RDF serialization formats (RDF/XML, Turtle, N-Triples)

crawlableRDF content, with small fragments pointing to other fragments

aSPARQL endpoint

HTML annotated withRDFa,

cf.http://www.w3.org/TR/rdfa-syntax/

Other popular semantic content embedded in Web pages:

microformats (hCard, vCard, etc.), microdata

(cf. http://www.schemas.org/). Not directly the spirit of the Semantic Web, but heavily used.

RDF content used internally in a company

(56)

53 / 74 Télécom ParisTech Pierre Senellart

How to Acquire Semantic Content?

Much easier to exploit, as it is already semantically described Individual resources (dumps, SPARQL endpoints) that have been identified as valuablecan be directly exploited

RDFa content, microformats, microdata, can be discovered from regular Web crawls

Not perfect! There are errors, lies, etc.

(57)

54 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Information Extraction Graph Mining

Opinion Mining

Opportunities for Market Insights

(58)

55 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Information Extraction Graph Mining

Opinion Mining

Opportunities for Market Insights

(59)

56 / 74 Télécom ParisTech Pierre Senellart

Information Extraction

See Parts “Instance Extraction” and “Fact Extraction” from my colleague Fabian Suchanek’s lecture

http://suchanek.name/work/teaching/IE2010a.pdf

(60)

57 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Information Extraction Graph Mining

Opinion Mining

Opportunities for Market Insights

(61)

58 / 74 Télécom ParisTech Pierre Senellart

The Web Graph

The World Wide Web seen as a (directed) graph:

Vertices: Web pages Edges: hyperlinks

Same for other interlinkedenvironments:

dictionaries encyclopedias

scientific publications social networks

(62)

59 / 74 Télécom ParisTech Pierre Senellart

Google’s PageRank [Brin and Page, 1998]

Idea

Important pages are pages pointed to by importantpages.

8<

:

gij =0 if there is no link between pagei and j;

gij = n1i otherwise, withni the number of outgoing links of pagei.

Definition (Tentative)

Probabilitythat the surfer following the random walkin G has arrived on page i at some distant given point in the future.

pr(i) =

k!+1lim (GT)kv

i

where v is some initial column vector.

(63)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.100 0.100

0.100

0.100

0.100 0.100

0.100

0.100

0.100

0.100

(64)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.033 0.317

0.075

0.108

0.025 0.058

0.083

0.150

0.117

0.033

(65)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.036 0.193

0.108

0.163

0.079 0.090

0.074

0.154

0.094

0.008

(66)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.054 0.212

0.093

0.152

0.048 0.051

0.108

0.149

0.106

0.026

(67)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.051 0.247

0.078

0.143

0.053 0.062

0.097

0.153

0.099

0.016

(68)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.048 0.232

0.093

0.156

0.062 0.067

0.087

0.138

0.099

0.018

(69)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.052 0.226

0.092

0.148

0.058 0.064

0.098

0.146

0.096

0.021

(70)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.049 0.238

0.088

0.149

0.057 0.063

0.095

0.141

0.099

0.019

(71)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.050 0.232

0.091

0.149

0.060 0.066

0.094

0.143

0.096

0.019

(72)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.050 0.233

0.091

0.150

0.058 0.064

0.095

0.142

0.098

0.020

(73)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.050 0.234

0.090

0.148

0.058 0.065

0.095

0.143

0.097

0.019

(74)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.049 0.233

0.091

0.149

0.058 0.065

0.095

0.142

0.098

0.019

(75)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.050 0.233

0.091

0.149

0.058 0.065

0.095

0.143

0.097

0.019

(76)

60 / 74 Télécom ParisTech Pierre Senellart

Illustrating PageRank Computation

0.050 0.234

0.091

0.149

0.058 0.065

0.095

0.142

0.097

0.019

(77)

61 / 74 Télécom ParisTech Pierre Senellart

PageRank With Damping

May not always converge, or convergence may not be unique.

To fix this, the random surfer can at each step randomly jumpto any page of the Web with some probability d (1 d: damping factor).

pr(i) =

k!+1lim ((1 d)GT +dU)kv

i

whereU is the matrix with all N1 values withN the number of vertices.

(78)

62 / 74 Télécom ParisTech Pierre Senellart

Using PageRank to Score Search Results

PageRank: globalscore, independent of the query

Can be used to raise the weight of importantpages, associated with some scoring function dependent of the query:

final(q;d) =score(q;d) pr(d),

PageRank only useful indirectedgraphs! Proportional to degree otherwise

(79)

63 / 74 Télécom ParisTech Pierre Senellart

HITS [Kleinberg, 1999]

Idea

Two kinds of important pages: hubs and authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.

G0 adjacency matrix (with 0 and 1 values) of asubgraphof the Web.

We use the following iterative process (starting with a andh vectors of norm 1):

8<

:

a := kG01Thk G0Th h := kG10ak G0a

Convergesunder some technical assumptions to authorityand hub scores.

(80)

64 / 74 Télécom ParisTech Pierre Senellart

Using HITS to Order Web Query Results

1. Retrieve the set D of Web pagesmatching a keyword query.

2. Retrieve the set D of Web pages obtained from D by adding all linked pages, as well as allpages linking topages ofD.

3. Build fromD the corresponding subgraphG0 of the Web graph.

4. Computeiterativelyhubs and authority scores.

5. Sort documents from D by authority scores.

Less efficient than PageRank, becauselocal scores.

(81)

65 / 74 Télécom ParisTech Pierre Senellart

Discovery of communities

Classical problem in social networks: identifyingcommunities of users (or of content) using thegraph structure

Two subproblems:

1. Given some initial vertex or vertex set, finding the corresponding community

2. Given the graph as a whole, finding a partition in communities

(82)

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

/4

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph

ComplexityO(n2m) (n: vertices, m: edges)

(83)

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3 source

4 0

3 2

1 /4 4

1 sink

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph

ComplexityO(n2m) (n: vertices, m: edges)

(84)

66 / 74 Télécom ParisTech Pierre Senellart

Maximum Flow / Minimum Cut

/6 /2

/1 /5 /2

/3

sink source

4 0

3 2

1 /4 4

1

Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph

ComplexityO(n2m) (n: vertices, m: edges)

(85)

67 / 74 Télécom ParisTech Pierre Senellart

Markov Cluster Algorithm (MCL) [van Don- gen, 2000]

Graphclusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

Expansion(matrix multiplication, corresponding to flow propagation)

Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

(86)

67 / 74 Télécom ParisTech Pierre Senellart

Markov Cluster Algorithm (MCL) [van Don- gen, 2000]

Graphclusteringalgorithm

Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:

Expansion(matrix multiplication, corresponding to flow propagation)

Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one

[van Dongen, 2000]

(87)

68 / 74 Télécom ParisTech Pierre Senellart

Deletion of the edges with the highest be- twenness [Newman and Girvan, 2004]

Top-down graph clustering algorithm

Betwenness of an edge: number of minimal paths between two arbitrary vertices going through this edge

General principle:

1. Compute thebetweennessof each edge in the graph 2. Removethe edge with the highest betweenness

3. Redo the whole process, betweenness computation included Complexity: O(n3) for a sparse graph

[Newman and Girvan, 2004]

(88)

69 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Information Extraction Graph Mining

Opinion Mining

Opportunities for Market Insights

(89)

70 / 74 Télécom ParisTech Pierre Senellart

Opinion Mining

See my colleague Chloé Clavel’s lecture http://pierre.senellart.

com/enseignement/2013-2014/inf344/10-opinion-mining.pdf

(90)

71 / 74 Télécom ParisTech Pierre Senellart

Outline

The World Wide Web

Acquiring Various Forms of Web Content Exploiting Acquired Information

Opportunities for Market Insights

(91)

72 / 74 Télécom ParisTech Pierre Senellart

Opportunities for Market Insights

Crawl a competitor’s Web site, apply awrapper to extract structured information, regularlyrefresh this crawl) a local database of a competitor’s products and prices, ready to be analyzed

Crawl Web forums,blogs,social networking sites, foropinions about a brand, and mine the obtained social network ) follow identify opinion leaders, and target them for marketing

ExploitDeep Web forms to crawl all patents pertaining to a particular topic, performinstance extraction to identify all molecules cited in the patent, uselinked open data ontologies to connect these molecules to known metabolic pathways )get more insight onto which biological phenomena are targeted by

competitors’ inventions

Références

Documents relatifs

Lots of data sources can be seen as intensional: accessing all the data in the source (in extension) is impossible or very costly, but it is possible to access the data through

Combine this with the inverted index you built yesterday for the Simple English dataset, so that queries over this dataset use a combination of tf-idf and PageRank. Combine this

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first four labs: How to use PageRank to improve the results of

Create a class InvertedIndex that will be used to store an in-memory version of an inverted index (that is, for each token occurring in the collection, this token and the set of

For each of the following queries (without the quotation marks), note the number of answers given by Google: “Bonnie and Clyde”, “bonnie clyde”, “bonny and Clyde”, “Bonnie

. Implement the PageRank iterative algorithm on such a graph. Do not forget to normalize the adjacency lists so that the sum of all outgoing edges of a given node is one. 15) and

Once you follow the links for creating a pipe, you will be presented with the interface of the graphical editor: on the left, a list of all boxes that can be used inside a pipe; in

There is no fixed list of assignment for this lab session, but focus on connecting the systems produced in the first three labs: How to use PageRank to improve the results of