• Aucun résultat trouvé

LiWA: Living Web Archives Timestamping Pages on on the Web

N/A
N/A
Protected

Academic year: 2022

Partager "LiWA: Living Web Archives Timestamping Pages on on the Web"

Copied!
7
0
0

Texte intégral

(1)

LiWA: Living Web Archives Timestamping Pages on on the Web

Pierre Senellart

May 2008

(2)

HTTP Timestamping

Two timestamping mechanism in HTTP: entity tagsand modification dates. Potentially provided with all requests:

ETag: "497bef-1fcb-47f20645"

Last-Modified: Tue, 01 Apr 2008 09:54:13 GMT

Etag: unique identifier for the provided document, should change if the document changes; can be used in requests withIf-Matchand If-None-Match.

Lat-Modified: last modification time; can be used in requests with If-Modified-Sinceand If-Unmodified-Since.

Information generally provided and very reliable for static content (often includes media files in CMS).

Information hardly ever provided (or with dummy nowdate) for

(3)

HTTP Cache and Proxy Control Mechanisms

Two additional (redundant) items of freshness information, for caches and proxies:

Cache-Control: max-age=60, private Expires: Tue, 01 Apr 2008 13:25:55 GMT

max-age: maximum time in second a document remains fresh.

Expires: time when a document stops being fresh.

Often provided. . .

. . . but with 0 or very low expiration delay.

) does not give any interesting information.

(4)

Timestamps in HTML Web page Content

Very frequentin CMS:

either as aglobaltimestamps (Last modified:);

or onindividualitems: news stories, blogs, etc. (is the global timestamp the max of the individual ones?);

possibly also in meta-data on the Web page: comments, Dublin Core<meta>tags.

Quite easy to identify and extract from the Web page (keywords, date recognition).

Informal: sometimes partial (no time indication), often without timezone.

Might not always be trustworthy.

(5)

Additional semantic timestamps

Files of other types than HTML may have semantic timestamping mechanism:

PDF, Office suite documents, etc.: both creation and modification date available in metadata. Quite reliable.

RSS feeds: reliablesemantic timestamps.

Images, Sounds: EXIF (or similar) metadata. Not always reliable, and the capture date of a picture may have nothing in

common with its publication date.

Semantic external content used for dating a HTML Web page:

Possibility of mapping aRSS feed to a Web page content to date individual items.

Sitemapsprovided by the Web site owner. Allows for providing both timestamps and change rate indications (hourly,

monthly. . . ), but these functionalities are not often used. A few CMS produce all of this: ideal case!

(6)

Additional semantic timestamps

Files of other types than HTML may have semantic timestamping mechanism:

PDF, Office suite documents, etc.: both creation and modification date available in metadata. Quite reliable.

RSS feeds: reliablesemantic timestamps.

Images, Sounds: EXIF (or similar) metadata. Not always reliable, and the capture date of a picture may have nothing in

common with its publication date.

Semantic external content used for dating a HTML Web page:

Possibility of mapping aRSS feed to a Web page content to date individual items.

Sitemapsprovided by the Web site owner. Allows for providing both timestamps and change rate indications (hourly,

monthly. . . ), but these functionalities are not often used. A few

(7)

Estimating Freshness of a Page

1 Check HTTP timestamp.

2 Check content timestamp.

3 Compare a hash of the page with a stored hash.

4 Non-significant differences (ads, fortunes, request timestamp):

only hash text content, or “useful” text content;

compare distribution ofn-grams (shingling);

or even compute edit distance with previous version.

Adapting strategy to each different archived website?

Conclusion

Estimating freshness and timestamp of a Web page: possible in most cases. May require to retrieve the entire page multiple times (not always possible to do conditional GETs).

Références

Documents relatifs

Similar to the solution presented in [1], we rely on the Semantic Assistants framework [2] for brokering text mining pipelines as web services, but our new architecture is based on

Therefore we chose to use the Automated Readability Index, which was designed for real-time computation of readability on elec- tronic typewriters and does not use the number

From our point of view, the Web page is entity carrying information about these communities and this paper describes techniques, which can be used to extract mentioned informa- tion

Finally, when a retrieved answer is to be inserted in the user edited docu- ment ((rich) text area), in order to preserve the semantic markup, since the user produced content is

Setting up SemWIQ requires: (1) setting up a mediator host, (2) setting up wrappers, (3) specification of mappings from local concepts (i.e. database tables and attributes) to

Third, as a complementary measure to the proposed semi-automated tagging, we have shortly described an algorithm for automatically merging misspelled or similar tags and have shown

Figure 3 depicts the general term extraction process in M USEUM F INLAND. The tool Terminator extracts individual term candidates from the museum collection items presented in XML.

This integrated object schema is called the Business Information Model (BIM) and can be used as the basis for a database schema from which an underlying database can be created.