• Aucun résultat trouvé

Intelligent Archiving of the Social Web

N/A
N/A
Protected

Academic year: 2022

Partager "Intelligent Archiving of the Social Web"

Copied!
16
0
0

Texte intégral

(1)

DIADEM 1.0 Workshop, October 3, 2011

Intelligent Archiving of the Social Web

Pierre Senellart, Télécom ParisTech

(2)

The ARCOMEM project (FP7 IP, 2011–2013)

Objective

Leverage the social Web to buildrichWeb archives about specific topics, usingintelligentmechanisms: semantic analysis, focused crawling, social network mining, etc.

Examples

Understand andpreservefor future generations:

What people on the Web think about the Greek economic crisis How a music festival is perceived in social networks

Remark: Archivists are concerned with preserving Web content

(3)

Pierre Senellart, Télécom ParisTech

The ARCOMEM consortium

(4)

Architecture of a classical crawler

Crawl Definition Queue

Management Page

Fetching Link Ex-

traction URL

Selection

(5)

Pierre Senellart, Télécom ParisTech

Architecture of the ARCOMEM crawler

Crawl Definition Queue

Management Page

Fetching Link Ex-

traction URL

Selection

Intelligent Crawl Definition ARCOMEM

Database

Queue Management

Resource Fetching

Application- Aware Helper

Resource Selection &

Prioritization Heritrix Crawl

Definition

Offline Module

Online Module

(6)

Analysis modules

Entity recognition Event extraction Topic recognition Sentiment analysis Social graph mining

The modules are responsible forfocusingthe crawl: changing the crawl definition, give (asynchronous) scored feedback on Web content crawled and crawling actions to be performed.

(7)

Pierre Senellart, Télécom ParisTech

Crawling Web applications

The optimal strategy for crawling different kinds of Web sites and Web applications varies a lot:

In a Wiki, do not follow the “Edit this page” link

On Twitter, you should follow AJAX events to create full pages On Facebook, nothing can be crawled:

User-agent: * Disallow: /

but the Facebook API can be used to get part of the content A Dotclear blog can be archived efficiently by following the monthly archives, and disregarding other links

Certain Web sites need to be accessed through Web forms In addition, social Web sites contain Web objects (blog posts, tweets, user profiles, etc.), to be archived by themselves.

(8)

Application-aware helper

Crawlers should beawareof the Web application they are crawling!

We assume a knowledge base, describing a hierarchy of Web applications (site type, CMS used, specific sites), featuring:

detection of the type of the currently crawled Web application description of crawling actions to perform, depending on the Web application

This KB can be manually created (by developers? by archivists?) or, ultimately, mined from the Web.

(9)

Pierre Senellart, Télécom ParisTech

A navigation and extraction language for ARCOMEM

Requirements for a language describingcrawling actions:

URLs to follow

JavaScript and AJAX actions to perform Web forms to submits

parts of Web pages to extract unbounded number of iterations (possibly) RESTful API interaction

The language should remain declarative (hope to have

expressions written by a non-expert, or learned from examples) Efficient processing (no slowing down of the crawler, at least not more than needed by layout rendering)

OXPathis our best bet!

(10)

A navigation and extraction language for ARCOMEM

Requirements for a language describingcrawling actions:

URLs to follow

JavaScript and AJAX actions to perform Web forms to submits

parts of Web pages to extract unbounded number of iterations (possibly) RESTful API interaction

The language should remain declarative (hope to have

expressions written by a non-expert, or learned from examples) Efficient processing (no slowing down of the crawler, at least not more than needed by layout rendering)

OXPathis our best bet!

(11)

Pierre Senellart, Télécom ParisTech

A navigation and extraction language for ARCOMEM

Requirements for a language describingcrawling actions:

URLs to follow

JavaScript and AJAX actions to perform Web forms to submits

parts of Web pages to extract unbounded number of iterations (possibly) RESTful API interaction

The language should remain declarative (hope to have

expressions written by a non-expert, or learned from examples) Efficient processing (no slowing down of the crawler, at least not more than needed by layout rendering)

OXPathis our best bet!

(12)

Minor things we miss in OXPath

Fine control on the AJAX waiting delays POST requests without form submissions

XPath navigation in JSON documents (does it work in XML documents?)

Regular expressions to break up text nodes?

More generally, what features (say, from XPath 2.0) can be added without hurting efficiency?

A way to feed back Web content into OXPath patterns A browser less buggy than HTMLUnit,

Eager to get our hands on the Mozilla/Webkit version! We can also help!

(13)

Pierre Senellart, Télécom ParisTech

Minor things we miss in OXPath

Fine control on the AJAX waiting delays POST requests without form submissions

XPath navigation in JSON documents (does it work in XML documents?)

Regular expressions to break up text nodes?

More generally, what features (say, from XPath 2.0) can be added without hurting efficiency?

A way to feed back Web content into OXPath patterns A browser less buggy than HTMLUnit,

Eager to get our hands on the Mozilla/Webkit version!

We can also help!

(14)

Marrying OXPath with a crawler

Ideally, every network interaction should go through the crawler, this helps:

1. archival of content

2. prioritization of crawling actions 3. policy management (crawling ethics) 4. parallelization

A few directions to have this work:

The crawler acts as a HTTP proxy for the OXPath processor The state of the OXPath processor can be dumped and restored at a later point

(15)

Pierre Senellart, Télécom ParisTech

The OXPath hammer

OXPath is a pretty big hammer. . .

Not always useful, some crawling actions can just be described by URLs and XPath

We also want to avoid the cost of the layout engine as much as we can

This raises the following two questions:

Can we detect when a layout engine is not needed? (No DOM-modifying JS code? No automatically launched DOM-modifying code?)

Can we detect whether an OXPath expression can actually decomposed into URL fetching and XPath evaluation sequences, without any JS execution?

(16)

Other interests of mine relevant to DIADEM

JavaScript analysis for understanding Web sites Probabilistic databases for information extraction Optimizing the number of requests to a given Web site Extraction of the main content of a Web page using external semantic clues (RSS feeds, anchor text, etc.)

Combining form analysis, probing, and result page analysis into a deep Web understanding loop

Références

Documents relatifs

For example, consider a problem depending on 200 variables, and try to find its optimum with the lean algorithm by performing only 10 experiments per step.. For each step, it could

These preliminary results on the effect of density may reflect the individual strategies of colonization of banana fields by Cosmopolites sordidus that maximize

• We compare through simulations the overhead obtained with the optimal strategy introduced in this work (restart strategy, optimal checkpointing pe- riod) to those used in all

The recent data about the Hubble parameter H(z) was the ocasion to confront the de Sitter models with these data. The results are beyond all that could be expected ; no conflict

Consequently, if we want to identify a model of the true system having the desired accuracy with a white noise excitation signal r(t) of length N = 500, the induced

Note that = 3.75 is an acceptable dilution factor for light water which is an excellent moderator ; However for heavy water it leads to a quite low value for the resonance

Using a lattice- based signature in message-recovery mode is quite generic (i.e it does not depend on the structure of the message), and so it may be used in AKE constructions that

Considering a shared memory system made up of base atomic read/write registers, that can be concurrently accessed by asyn- chronous processes prone to crash, a snapshot object is