10 September 2014,Yves Rocher
Data Acquisition and
Extraction from the Variety of Web Sources
Pierre Senellart
2 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Opportunities for Market Insights
3 / 74 Télécom ParisTech Pierre Senellart
Internet and the Web
Internet: physical network of computers (orhosts)
World Wide Web, Web, WWW: logical collection of hyperlinked documents
static and dynamic
public Web andprivateWebs
each document (or Web page, orresource) identified by a URL
4 / 74 Télécom ParisTech Pierre Senellart
Uniform Resource Locators
https
| {z }
scheme
:// www.example.com
| {z }
hostname
:443| {z }
port
/ path/to/doc
| {z }
path
?name=foo&town=bar
| {z }
query string
#para
| {z }
fragment
scheme: way the resource can be accessed; generallyhttporhttps hostname: domain nameof a host (cf. DNS); hostname of a website
may start withwww., but not a rule.
port: TCP port; defaults: 80 for http and 443 for https path: logical pathof the document
query string: additional parameters (dynamic documents), optional fragment: subpartof the document, optional
Relative URIs with respect to a context(e.g., the URI above):
/titi https://www.example.com/titi
tata https://www.example.com/path/to/tata
5 / 74 Télécom ParisTech Pierre Senellart
(X)HTML
Choice format forWeb pages
Dialect of SGML(the ancestor of XML), but seldom parsed as is HTML 4.01: most common version, W3C recommendation XHTML 1.0: XML-izationof HTML 4.01, minor differences HTML5: most recent version, still in development, adds some better structuring
Actual situation of the Web: tag soup
6 / 74 Télécom ParisTech Pierre Senellart
XHTML example
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
lang="en" xml:lang="en">
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=utf-8" />
<title>Example XHTML document</title>
</head>
<body>
<p>This is a
<a href="http://www.w3.org/">link to the
<strong>W3C</strong>!</a></p>
</body>
</html>
7 / 74 Télécom ParisTech Pierre Senellart
HTTP
Client-server protocolfor the Web, on top of TCP/IP Example request/response
GET /myResource HTTP/1.1 Host: www.example.com
HTTP/1.1 200 OK
Content-Type: text/html; charset=ISO-8859-1
<html>
<head><title>myResource</title></head>
<html>
<head><title>myResource</title></head>
<body><p>Hello world!</p></body>
</html>
HTTPS: secureversion of HTTP
8 / 74 Télécom ParisTech Pierre Senellart
Features of HTTP/1.1
virtual hosting: different Web content for different hostnames on a single machine
login/password protection
content negociation: same URL identifying several resources, client indicates preferences
cookies: chunks of information persistently stored on the client keep-alive connections: several requests using the same TCP
connection etc.
9 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
10 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
11 / 74 Télécom ParisTech Pierre Senellart
Web Crawlers
crawlers,(Web) spiders,(Web) robots: autonomous user agents that retrieve pages from the Web
Basics of crawling:
1. Start from a given URL or set of URLs 2. Retrieve and process the corresponding page 3. Discover new URLs (cf. next slide)
4. Repeat on each found URL
No real termination condition (virtual unlimited number of Web pages!)
Graph-browsingproblem
deep-first: not well adapted, can be lost in robot traps best: breadth-first with limited-depth deep-first on each
discovered website
12 / 74 Télécom ParisTech Pierre Senellart
Sources of new URLs
From HTML pages:
hyperlinks<a href="...">...</a>
media<img src="..."> <embed src="...">
<object data="...">
frames<frame src="..."> <iframe src="...">
JavaScript linkswindow.open("...") etc.
Other hyperlinked content (e.g., PDF files)
Non-hyperlinked URLs that appear anywhere on the Web (in HTML text, text files, etc.): use regular expressions to extract them
Referrer URLs
Sitemaps [sitemaps.org, 2008]
13 / 74 Télécom ParisTech Pierre Senellart
Scope of a crawler
Web-scale
The Web is infinite! Avoid robot traps by putting depth or page numberlimitson each Web server
Focus onimportantpages [Abiteboul et al., 2003]
Web servers under a list of DNS domains: easy filtering of URLs A given topic: focused crawling techniques [Chakrabarti et al., 1999, Diligenti et al., 2000, Gouriten et al., 2014] based on classifiers of Web page content and predictors of the interest of a link.
The national Web (cf.public deposit, national libraries): what is this? [Abiteboul et al., 2002]
A given Web site: what is a Web site? [Senellart, 2005]
14 / 74 Télécom ParisTech Pierre Senellart
Identification of duplicate Web pages
Problem
Identifying duplicates or near-duplicates on the Web to prevent multiple indexing
trivial duplicates: same resource at the same canonized URL:
http://example.com:80/toto http://example.com/titi/../toto exact duplicates: identification by hashing
near-duplicates: (timestamps, tip of the day, etc.) more complex!
15 / 74 Télécom ParisTech Pierre Senellart
Crawling ethics
Standard for robot exclusion: robots.txtat the root of a Web server [Koster, 1994].
User-agent: *
Allow: /searchhistory/
Disallow: /search
Per-page exclusion.
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW">
Per-link exclusion.
<a href="toto.html" rel="nofollow">Toto</a>
AvoidDenial Of Service (DOS), wait 1s between two repeated requests to the same Web server
16 / 74 Télécom ParisTech Pierre Senellart
Parallel processing
Network delays, waits between requests:
Per-server queueof URLs
Parallel processing of requests to different hosts:
multi-threadedprogramming
asynchronousinputs and outputs (select, classes from java.util.concurrent): less overhead
Use of keep-alive to reduce connexion overheads
General Architecture [Chakrabarti, 2003]
18 / 74 Télécom ParisTech Pierre Senellart
Refreshing URLs
Content on the Webchanges Differentchange rates:
online newspaper main page: every hour or so published article: virtually no change
Continuous crawling, and identification of change rates for
adaptivecrawling: how to know the time of last modificationof a Web page?
19 / 74 Télécom ParisTech Pierre Senellart
Estimating the Freshness of a Page
1. Check HTTP timestamp.
2. Check content timestamp.
3. Compare a hash of the page with a stored hash.
4. Non-significant differences (ads, fortunes, request timestamp):
only hash text content, or “useful” text content;
compare distribution ofn-grams (shingling);
or even compute edit distance with previous version.
Adapting strategy to each different archived website?
20 / 74 Télécom ParisTech Pierre Senellart
Crawling Modern Web Sites
Some modern Web sites only work when cookies are activated (session cookies), or whenJavaScript code is interpreted Regular Web crawlers (wget,Heritrix,Apache Nutch) usually don’t do cookie management and don’t interpret JavaScript code Crawling of some Websites therefore require moreadvanced tools
21 / 74 Télécom ParisTech Pierre Senellart
Advanced crawling tools
Web scraping frameworks such asscrapy(Python) or
WWW::Mechanize (Perl) simulate a Web browser interaction and cookie management (but no JS interpretation)
Headless browsers such ashtmlunitsimulate a Web browser, including simple JavaScript processing
Browser instrumentors such asSeleniumallow full instrumentation of a regular Web browser (Chrome, Firefox, Internet
Explorer)
OXPath: afull-fledged navigation and extraction languagefor complex Web sites [Sellers et al., 2011] Demo
22 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
23 / 74 Télécom ParisTech Pierre Senellart
Templated Web Site
Many Web sites (especially, Web forums, blogs) use one of a few content management systems(CMS)
Web sites that use the same CMS will be similarly structured, present a similar layout, etc.
Information is somewhat structuredin CMSs: publication date, author, tags, forums, threads, etc.
Some structure differencesmay exist when Web sites use different versions, or different themes, of a CMS
24 / 74 Télécom ParisTech Pierre Senellart
Crawling CMS-Based Web Sites
Traditional crawling approaches crawl Web sites independentlyof the nature of the sites and of their CMS
When the CMS is known:
Potential for much moreefficient crawling strategies(avoid pages with redundant information, uninformative pages, etc.)
Potential forautomatic extractionof structured content Two ways of approaching the problem:
Have ahandcrafted knowledge baseof known CMSs, their
characteristics, how to crawl and extract information [Faheem and Senellart, 2013b,a] (AAH) Demo
Automatically inferthe best way to crawl a given CMS [Faheem and Senellart, 2014] (ACE)
Need to be robustw.r.t. template change
25 / 74 Télécom ParisTech Pierre Senellart
Detecting CMSs
One main challenge in intelligent crawling and content extraction is to identify the CMS and then perform the best crawling strategyaccordingly
Detecting CMS using:
1. URL patterns, 2. HTTP metadata, 3. textual content, 4. XPath patterns, etc.
These can be manually described (AAH), or automatically inferred (ACE)
For instance thevBulletin Web forum content management system, that can be identified by searching for a reference to a vbulletin_global.js JavaScript script by using a simple //script/@srcXPath expression.
26 / 74 Télécom ParisTech Pierre Senellart
Crawling http://www.rockamring-blog.de/
[Faheem and Senellart, 2014]
0 2;000 4;000 6;000 0
100 200 300
Number of HTTP requests
Numberofdistinct2-grams(1;000)
ACE AAH wget
27 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites
1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp
24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends, user-shared content, user profiles, etc.)
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites
1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp
24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends, user-shared content, user profiles, etc.)
28 / 74 Télécom ParisTech Pierre Senellart
Most popular Web sites
1 google.com 2 facebook.com 3 youtube.com 4 yahoo.com 5 baidu.com 6 wikipedia.org 7 live.com 8 twitter.com 9 qq.com 10 amazon.com 11 blogspot.com 12 linkedin.com 13 google.co.in 14 taobao.com 15 sina.com.cn 16 yahoo.co.jp 17 msn.com 18 wordpress.com 19 google.com.hk 20 t.co 21 google.de 22 ebay.com 23 google.co.jp
24 googleusercontent.com 25 google.co.uk 26 yandex.ru 27 163.com 28 weibo.com
(Alexa)
Social networking sites
Sites with social networking features (friends, user-shared content, user profiles, etc.)
29 / 74 Télécom ParisTech Pierre Senellart
Social data on the Web
Hugenumbers of users (2012):
Facebook 900 million QQ 540 million W. Live 330 million Weibo 310 million Google+ 170 million Twitter 140 million LinkedIn 100 million
Huge volume of shared data:
250 million tweets per day on Twitter (3,000 per second on average!). . . . . . including statements by heads of states, revelations of political activists, etc.
29 / 74 Télécom ParisTech Pierre Senellart
Social data on the Web
Hugenumbers of users (2012):
Facebook 900 million QQ 540 million W. Live 330 million Weibo 310 million Google+ 170 million Twitter 140 million LinkedIn 100 million
Huge volume of shared data:
250 million tweets per day on Twitter (3,000 per second on average!). . . . . . including statements by heads of states, revelations of political activists, etc.
30 / 74 Télécom ParisTech Pierre Senellart
Crawling Social Networks
Theoretically possible to crawl social networking sites using a regular Web crawler
Sometimes not possible: https://www.facebook.com/robots.txt Oftenvery inefficient, considering politeness constraints
Better solution: Use provided social networking APIs https://dev.twitter.com/docs/api/1.1
https://developers.facebook.com/docs/graph-api/
reference/v2.1/
https://developer.linkedin.com/apis
https://developers.google.com/youtube/v3/
Also possible to buy access to the data, directly from the social network or from brokers such ashttp://gnip.com/
31 / 74 Télécom ParisTech Pierre Senellart
Social Networking APIs
Most social networking Web sites (and some other kinds of Web sites) provide APIsto effectively access their content
Usually a RESTfulAPI, occasionally SOAP-baed
Usually require atoken identifying the application using the API, sometimes a cryptographic signature as well
May access the API as an authenticated user of the social network, or as an external party
APIs seriously limit the rate of requests:
https://dev.twitter.com/docs/api/1.1/get/search/tweets
32 / 74 Télécom ParisTech Pierre Senellart
REST
Mode of interaction with a Web service
Follow the KISS (Keep it Simple, Stupid) principle
Each request to the service is asimple HTTP GET method Base URL is theURL of the service
Parameters of the service are sent as HTTP parameters (in the URL)
HTTP response codeindicates success or failure
Response containsstructured output, usually as JSON or XML No side effect, each request independent of previous ones Example: http://graph.facebook.com:80/?ids=7901103
33 / 74 Télécom ParisTech Pierre Senellart
The Case of Twitter
Two main APIs:
REST APIs, including search, getting information about a user, a list, followers, etc. https://dev.twitter.com/docs/api/1.1 Streaming API, providing real-time result
Very limited history available
Search can be onkeywords,language,geolocation(for a small portion of tweets)
34 / 74 Télécom ParisTech Pierre Senellart
Cross-Network Crawling
Often useful to combine results from different social networks Numerous libraries facilitating SN API accesses (twipy, Facebook4J, FourSquare VP C++ API. . . ) incompatible with each other. . . Some efforts at generic APIs (OneAll,
APIBlender [Gouriten et al., 2014]) Demo
Example use case: No API to get all check-ins from FourSquare, but a number of check-ins are available on Twitter; given results of Twitter Search/Streaming, use FourSquare API to get information about check-in locations.
35 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
36 / 74 Télécom ParisTech Pierre Senellart
The Deep Web
Definition (Deep Web, Hidden Web, Invisible Web)
All the content on the Web that is not directly accessible through hyperlinks. In particular: HTML forms, Web services.
Size estimate: 500 times more content than on the surface Web!
[BrightPlanet, 2000]. Hundreds of thousands of deep Web databases [Chang et al., 2004]
37 / 74 Télécom ParisTech Pierre Senellart
Sources of the Deep Web
Example
Yellow Pagesand other directories;
Library catalogs;
Weather services;
US Census Bureau data;
etc.
38 / 74 Télécom ParisTech Pierre Senellart
Discovering Knowledge from the Deep Web [Nayak et al., 2012]
Content of the deep Web hidden to classical Web search engines (they just follow links)
But very valuable and high quality!
Even services allowing access through the surface Web (e.g., e-commerce) have more semantics when accessed from the deep Web
How tobenefit from this information?
How toanalyze,extract and model this information?
Focus here: Automatic, unsupervised, methods, for a given domain of interest
39 / 74 Télécom ParisTech Pierre Senellart
Extensional Approach
WWW discovery
siphoning
bootstrap Index
indexing
40 / 74 Télécom ParisTech Pierre Senellart
Notes on the Extensional Approach
Main issues:
Discovering services
Choosing appropriate data to submit forms
Use of data found in result pages to bootstrap the siphoning process Ensure good coverage of the database
Approach favored by Google, used in production [Madhavan et al., 2006]
Not always feasible (huge load on Web servers)
Intensional Approach
WWW discovery
probing analyzing
Form wrapped as a Web service
query
42 / 74 Télécom ParisTech Pierre Senellart
Notes on the Intensional Approach
Moreambitious [Chang et al., 2005, Senellart et al., 2008]
Main issues:
Discovering services
Understanding the structure and semantics of a form Understanding the structure and semantics of result pages Semantic analysis of the service as a whole
Query rewriting using the services
No significant load imposed on Web servers
43 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Regular Web Content
CMS-based Web Content Social Networking Sites The Deep Web
The Semantic Web
Exploiting Acquired Information Opportunities for Market Insights
44 / 74 Télécom ParisTech Pierre Senellart
The Semantic Web
A Web in which the resources are semantically described
annotations give information about a page, explain an expression in a page, etc.
More precisely, a resource is anything that can be referred to by a URI
a web page, identified by a URL
a fragment of an XML document, identified by an element node of the document,
a web service,
a thing, an object, a concept, a property, etc.
Semantic annotations: logical assertions that relate resources to some terms in associated ontologies
45 / 74 Télécom ParisTech Pierre Senellart
Ontologies
Formal descriptions providing humanusers a shared understanding of a given domain
A controlled vocabulary
Formally defined so that it can also be processed bymachines Logical semantics that enables reasoning
Reasoning is the key for different important tasks of Web data management, in particular:
to answer queries (over possibly distributed data)
to relate objects in different data sources enabling their integration to detect inconsistencies or redundancies
to refine queries with too many answers, or to relax queries with no answer
46 / 74 Télécom ParisTech Pierre Senellart
Where Do Ontologies Come From?
Manually craftedto represent the knowledge of a specific domain (e.g., life sciences)
Exported fromclassical Web databases
Through information extractionfrom the Web, Wikipedia, etc.
(e.g., DBpedia, YAGO)
Privateto a company or public
Some ontologies focus on instances, others on aschema (see further)
Value of the Semantic Web: bits of ontologies can be re-usedin another, and ontologies can be mapped through anowl:sameAs link
As of September 2011 Music
Brainz (zitgist)
P20
Turismo de Zaragoza
yovisto
Yahoo!
Geo Planet
YAGO World Fact- book
ViajeroEl Tourism
WordNet (W3C) WordNet (VUA)
VIVO UF VIVO Indiana
VIVO Cornell
VIAF
URI Burner
Sussex Reading Lists
Plymouth Reading Lists
UniRef UniProt UMBEL
UK Post- codes legislation data.gov.uk
Uberblic
UB Mann- heim
TWC LOGD
Twarql transport data.gov.
uk
Traffic Scotland
theses.
fr Thesau-
rus W
totl.net Tele- graphis
TCM GeneDIT Taxon
Concept
Open Library (Talis) tags2con
delicious
t4gm info
Swedish Open Cultural Heritage Surge
Radio
Sudoc
STW RAMEAU
SH
statistics data.gov.
uk
Andrews St.
Resource Lists
ECS South- ampton EPrints SSW
Thesaur us
Smart Link
Slideshare 2RDF
semantic web.org Semantic
Tweet
Semantic XBRL
SW Dog Food Source Code Ecosystem Linked Data
US SEC (rdfabout)
Sears Scotland
Geo- graphy Scotland Pupils &
Exams
Scholaro- meter
WordNet (RKB Explorer)
Wiki
UN/
LOCODE Ulm
ECS (RKB Explorer)
Roma
RISKS RESEX
RAE2001 Pisa OS
OAI
NSF New-castle
LAAS KISTI JISC
IRIT
IEEE IBM
Eurécom ERA
ePrints dotAC
DEPLOY DBLP
Explorer)(RKB Crime
Reports UK
Course- ware CORDIS
Explorer)(RKB
CiteSeer
Budapest
ACM
riese
Revyu research
data.gov.
Ren. uk Energy Genera- tors
reference data.gov.
uk
Recht- spraak.
nl
ohlohRDF
Last.FM (rdfize)
RDF Book Mashup
Rådata nå!
PSH
Product Types Ontology Product
DB
PBAC Poké-
pédia patents
data.go v.uk Ox
Points
Ord- nance Survey
Openly Local
Open Library
Open Cyc
Open Corpo- rates
Open Calais OpenEI
Open Election Data Project
Open Data Thesau- rus
Ontos News Portal
OGOLOD Janus
AMP Ocean Drilling Codices
New York Times
NVD ntnusc Resource NTU
Lists
Norwe- gian MeSH NDL
subjects
ndlna Experi-my
ment
Italian Museums
medu- cator
MARC Codes List Man- chester Reading Lists Lotico
Weather Stations London
Gazette
LOIUS
Linked Open Colors
lobid Resources
lobid Organi- sations LEM
Linked MDB
LinkedL CCN
Linked GeoData
LinkedCT Linked
User Feedback LOV
Linked Open Numbers LODE
Eurostat (Ontology Central) Linked
EDGAR (Ontology Central)
Linked Crunch- base
lingvoj Lichfield
Spen- ding
LIBRIS
Lexvo
LCSH
DBLP (L3S)
Linked Sensor Data (Kno.e.sis)
Klapp- stuhl- club
Good- win Family
National Radio- activity JP
Jamendo (DBtune)
Italian public schools ISTAT
Immi- gration
iServe
IdRef Sudoc
NSZL Catalog Hellenic
PD Hellenic
FBD
Piedmont Accomo- dations GovTrack GovWILD
Google wrapperArt gnoss
GESIS
GeoWord Net
Geo Species NamesGeo
Geo Linked
Data
GEMET GTAA
STITCH SIDER
Project Guten- berg
MediCare Euro-
stat (FUB)
EURES
Drug Bank
Disea- some
DBLP (FU Berlin) Daily
Med CORDIS
(FUB)
Freebase flickr wrappr
Fishes of Texas
Finnish Munici- palities
ChEMBL FanHubz
Event Media EUTC
Produc- tions
Eurostat
Europeana
EUNIS EU
Insti- tutions
ESD stan- dards
EARTh
Enipedia Popula-
tion (En- AKTing) NHS (En- AKTing) Mortality
(En- AKTing) Energy
(En- AKTing)
Crime (En- AKTing)
CO2 Emission
(En- AKTing) EEA
SISVU educatio
n.data.g ov.uk
ECS South- ampton
ECCO- TCP GND Didactal
ia
DDC Deutsche
Bio- graphie
data dcs Music
Brainz (DBTune) Magna-
tune John Peel (DBTune)
Classical Tune)(DB
Audio Scrobbler (DBTune)
Last.FM artists (DBTune) DB Tropes
Portu- guese DBpedia
dbpedia lite
Greek DBpedia
DBpedia
data- open- ac-uk
SMC Journals
Pokedex
Airports NASA (Data Incu- bator)
Music Brainz (Data Incubator) Moseley
Folk
Metoffice Weather Forecasts
Discogs (Data Incubator)
Climbing data.gov.uk
intervals
Data Gov.ie
data bnf.fr
Cornetto reegle
Chronic- ling America
Chem2 Bio2RDF
Calames business
data.gov.
uk
Bricklink
Brazilian Poli- ticians
BNB
UniSTS
UniPath way UniParc
Taxono my
UniProt (Bio2RDF)
SGD
Reactome
PubMed Pub
Chem PRO-
SITE ProDom
Pfam PDB
OMIM
MGI
KEGG Reaction KEGG
Pathway KEGG Glycan KEGG Enzyme KEGG Drug
KEGG Com- pound InterPro
Homolo Gene HGNC
Gene Ontology
GeneID Affy-
metrix
bible ontology BibBase
FTS
BBC Wildlife Finder BBC Program
mes BBC
Music
Alpine Ski Austria
LOCAH
Amster- Museumdam AGROV
OC AEMET
US Census (rdfabout)
Media Geographic Publications
Government Cross-domain Life sciences User-generated content
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.http://lod- cloud.net/
48 / 74 Télécom ParisTech Pierre Senellart
Classes and class hierarchy
Backbone of the ontology
AcademicStaff is a Class(A class will be interpreted as a set of objects)
AcademicStaff isaStaff (isa is interpreted as set inclusion)
FacultyComponent
Course
MathCourse
Probabilities Algebra Logic CSCourse
DB AI Java Student
UndergraduateStudent MasterStudent
PhDStudent Department
PhysicsDept MathsDept CSDept Staff
AcademicStaff
Lecturer Researcher Professor AdministrativeStaff
49 / 74 Télécom ParisTech Pierre Senellart
Relations
Declaration of relationswith their signature
(Relations will be interpreted as binary relations between objects) TeachesIn(AcademicStaff,Course)
if one states that “X TeachesInY”, then X belongs to AcademicStaffandY toCourse
TeachesTo(AcademicStaff,Student) Leads(Staff,Department)
50 / 74 Télécom ParisTech Pierre Senellart
Instances
Classes haveinstances
Dupondis an instance of the classProfessor corresponds to the fact: Professor(Dupond)
Relations also have instances
(Dupond,CS101) is an instance of the relationTeachesIn corresponds to the fact: TeachesIn(Dupond,CS101)
The instance statements can be seen as (and stored in) adatabase
51 / 74 Télécom ParisTech Pierre Senellart
Ontology = schema + instance
Schema (TBox)
The set of class and relation names
Thesignaturesof relations and alsoconstraints The constraints are used for two purposes
– checking data consistency (like dependencies in databases) – inferring new facts
Instance (ABox) The set of facts
The set of base facts together with the inferred facts should satisfy the constraints
Ontology(i.e., Knowledge Base) = Schema + Instance
52 / 74 Télécom ParisTech Pierre Senellart
Where can Semantic Content be Found?
In the linked data, through Web-available RDF data:
dumpsof an entire ontology, in one of the RDF serialization formats (RDF/XML, Turtle, N-Triples)
crawlableRDF content, with small fragments pointing to other fragments
aSPARQL endpoint
HTML annotated withRDFa,
cf.http://www.w3.org/TR/rdfa-syntax/
Other popular semantic content embedded in Web pages:
microformats (hCard, vCard, etc.), microdata
(cf. http://www.schemas.org/). Not directly the spirit of the Semantic Web, but heavily used.
RDF content used internally in a company
53 / 74 Télécom ParisTech Pierre Senellart
How to Acquire Semantic Content?
Much easier to exploit, as it is already semantically described Individual resources (dumps, SPARQL endpoints) that have been identified as valuablecan be directly exploited
RDFa content, microformats, microdata, can be discovered from regular Web crawls
Not perfect! There are errors, lies, etc.
54 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Information Extraction Graph Mining
Opinion Mining
Opportunities for Market Insights
55 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Information Extraction Graph Mining
Opinion Mining
Opportunities for Market Insights
56 / 74 Télécom ParisTech Pierre Senellart
Information Extraction
See Parts “Instance Extraction” and “Fact Extraction” from my colleague Fabian Suchanek’s lecture
http://suchanek.name/work/teaching/IE2010a.pdf
57 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Information Extraction Graph Mining
Opinion Mining
Opportunities for Market Insights
58 / 74 Télécom ParisTech Pierre Senellart
The Web Graph
The World Wide Web seen as a (directed) graph:
Vertices: Web pages Edges: hyperlinks
Same for other interlinkedenvironments:
dictionaries encyclopedias
scientific publications social networks
59 / 74 Télécom ParisTech Pierre Senellart
Google’s PageRank [Brin and Page, 1998]
Idea
Important pages are pages pointed to by importantpages.
8<
:
gij =0 if there is no link between pagei and j;
gij = n1i otherwise, withni the number of outgoing links of pagei.
Definition (Tentative)
Probabilitythat the surfer following the random walkin G has arrived on page i at some distant given point in the future.
pr(i) =
k!+1lim (GT)kv
i
where v is some initial column vector.
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.100 0.100
0.100
0.100
0.100 0.100
0.100
0.100
0.100
0.100
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.033 0.317
0.075
0.108
0.025 0.058
0.083
0.150
0.117
0.033
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.036 0.193
0.108
0.163
0.079 0.090
0.074
0.154
0.094
0.008
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.054 0.212
0.093
0.152
0.048 0.051
0.108
0.149
0.106
0.026
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.051 0.247
0.078
0.143
0.053 0.062
0.097
0.153
0.099
0.016
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.048 0.232
0.093
0.156
0.062 0.067
0.087
0.138
0.099
0.018
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.052 0.226
0.092
0.148
0.058 0.064
0.098
0.146
0.096
0.021
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.049 0.238
0.088
0.149
0.057 0.063
0.095
0.141
0.099
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.050 0.232
0.091
0.149
0.060 0.066
0.094
0.143
0.096
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.050 0.233
0.091
0.150
0.058 0.064
0.095
0.142
0.098
0.020
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.050 0.234
0.090
0.148
0.058 0.065
0.095
0.143
0.097
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.049 0.233
0.091
0.149
0.058 0.065
0.095
0.142
0.098
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.050 0.233
0.091
0.149
0.058 0.065
0.095
0.143
0.097
0.019
60 / 74 Télécom ParisTech Pierre Senellart
Illustrating PageRank Computation
0.050 0.234
0.091
0.149
0.058 0.065
0.095
0.142
0.097
0.019
61 / 74 Télécom ParisTech Pierre Senellart
PageRank With Damping
May not always converge, or convergence may not be unique.
To fix this, the random surfer can at each step randomly jumpto any page of the Web with some probability d (1 d: damping factor).
pr(i) =
k!+1lim ((1 d)GT +dU)kv
i
whereU is the matrix with all N1 values withN the number of vertices.
62 / 74 Télécom ParisTech Pierre Senellart
Using PageRank to Score Search Results
PageRank: globalscore, independent of the query
Can be used to raise the weight of importantpages, associated with some scoring function dependent of the query:
final(q;d) =score(q;d) pr(d),
PageRank only useful indirectedgraphs! Proportional to degree otherwise
63 / 74 Télécom ParisTech Pierre Senellart
HITS [Kleinberg, 1999]
Idea
Two kinds of important pages: hubs and authorities. Hubs are pages that point to good authorities, whereas authorities are pages that are pointed to by good hubs.
G0 adjacency matrix (with 0 and 1 values) of asubgraphof the Web.
We use the following iterative process (starting with a andh vectors of norm 1):
8<
:
a := kG01Thk G0Th h := kG10ak G0a
Convergesunder some technical assumptions to authorityand hub scores.
64 / 74 Télécom ParisTech Pierre Senellart
Using HITS to Order Web Query Results
1. Retrieve the set D of Web pagesmatching a keyword query.
2. Retrieve the set D of Web pages obtained from D by adding all linked pages, as well as allpages linking topages ofD.
3. Build fromD the corresponding subgraphG0 of the Web graph.
4. Computeiterativelyhubs and authority scores.
5. Sort documents from D by authority scores.
Less efficient than PageRank, becauselocal scores.
65 / 74 Télécom ParisTech Pierre Senellart
Discovery of communities
Classical problem in social networks: identifyingcommunities of users (or of content) using thegraph structure
Two subproblems:
1. Given some initial vertex or vertex set, finding the corresponding community
2. Given the graph as a whole, finding a partition in communities
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
/4
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph
ComplexityO(n2m) (n: vertices, m: edges)
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3 source
4 0
3 2
1 /4 4
1 sink
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph
ComplexityO(n2m) (n: vertices, m: edges)
66 / 74 Télécom ParisTech Pierre Senellart
Maximum Flow / Minimum Cut
/6 /2
/1 /5 /2
/3
sink source
4 0
3 2
1 /4 4
1
Use of a maximum flow computation algorithm [Goldberg and Tarjan, 1988] to separate a seedof users from the remaining of the graph
ComplexityO(n2m) (n: vertices, m: edges)
67 / 74 Télécom ParisTech Pierre Senellart
Markov Cluster Algorithm (MCL) [van Don- gen, 2000]
Graphclusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
Expansion(matrix multiplication, corresponding to flow propagation)
Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one
[van Dongen, 2000]
67 / 74 Télécom ParisTech Pierre Senellart
Markov Cluster Algorithm (MCL) [van Don- gen, 2000]
Graphclusteringalgorithm
Based as well on maximum flow simulation, in the whole graph Iteration of a matrix computation alternating:
Expansion(matrix multiplication, corresponding to flow propagation)
Inflation(non-linear operation to increase heterogeneity) Complexity: O(n3) for an exact computation,O(n) for an approximate one
[van Dongen, 2000]
68 / 74 Télécom ParisTech Pierre Senellart
Deletion of the edges with the highest be- twenness [Newman and Girvan, 2004]
Top-down graph clustering algorithm
Betwenness of an edge: number of minimal paths between two arbitrary vertices going through this edge
General principle:
1. Compute thebetweennessof each edge in the graph 2. Removethe edge with the highest betweenness
3. Redo the whole process, betweenness computation included Complexity: O(n3) for a sparse graph
[Newman and Girvan, 2004]
69 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Information Extraction Graph Mining
Opinion Mining
Opportunities for Market Insights
70 / 74 Télécom ParisTech Pierre Senellart
Opinion Mining
See my colleague Chloé Clavel’s lecture http://pierre.senellart.
com/enseignement/2013-2014/inf344/10-opinion-mining.pdf
71 / 74 Télécom ParisTech Pierre Senellart
Outline
The World Wide Web
Acquiring Various Forms of Web Content Exploiting Acquired Information
Opportunities for Market Insights
72 / 74 Télécom ParisTech Pierre Senellart
Opportunities for Market Insights
Crawl a competitor’s Web site, apply awrapper to extract structured information, regularlyrefresh this crawl) a local database of a competitor’s products and prices, ready to be analyzed
Crawl Web forums,blogs,social networking sites, foropinions about a brand, and mine the obtained social network ) follow identify opinion leaders, and target them for marketing
ExploitDeep Web forms to crawl all patents pertaining to a particular topic, performinstance extraction to identify all molecules cited in the patent, uselinked open data ontologies to connect these molecules to known metabolic pathways )get more insight onto which biological phenomena are targeted by
competitors’ inventions