• Aucun résultat trouvé

Introduction to the Web MPRI 2.26.2: Web Data Management

N/A
N/A
Protected

Academic year: 2022

Partager "Introduction to the Web MPRI 2.26.2: Web Data Management"

Copied!
27
0
0

Texte intégral

(1)

Introduction to the Web

MPRI 2.26.2: Web Data Management

Antoine Amarilli Friday, December 7th

1/18

(2)

The old days

1969 ARPANET (ancestor of the Internet) 1974 TCP

1990 The World Wide Web, HTTP, HTML 1994 Yahoo! was founded

1995 Amazon.com, Ebay, AltaVista are founded 1998 Google are founded

2001 Wikipedia is created

2/18

(3)

Statistics

• Around330 milliondomains, including 130 million in.com1

• 54%of content inEnglishand4%in French2

• 48%of people have Internet access, and71%of ages 15–243

• Google knows over onetrillion(1012) of unique URLs4

Thesame contentcan live in many different URLs

Parts of the Web are not indexable: thehidden Webordeep Web

1https://www.businesswire.com/news/home/20170914006386/en/

Internet-Grows-331.9-Million-Domain-Registrations-Quarter

2https://w3techs.com/technologies/overview/content_language/all

3https://www.itu.int/en/ITU-D/Statistics/Documents/facts/

ICTFactsFigures2017.pdf

4https://googleblog.blogspot.fr/2008/07/we-knew-web-was-big.html

3/18

(4)

Table of Contents

Introduction

Web browsers

Other Web clients URLs

4/18

(5)

Historical web browsers

Mosaic First common graphical browser,1993–1997

• From80%in 1994 to<10%in 1996

Netscape Released in1994, based on Mosaic

• From80%in 1996 to<10%in 2001 Internet Explorer Released in1995

• Provided withWindows 95

• IE 6 released in 2001 and reaches80%market share

• Leads to anantitrustlawsuit in the USA, 1998–2001 Firefox Released in2002from Netscape (open-sourced in 1998)

• Freeandopen-source

• Popularizedtabbed browsing

• Attacked IE 6’smonopoly

5/18

(6)

Historical web browsers

Mosaic First common graphical browser,1993–1997

• From80%in 1994 to<10%in 1996 Netscape Released in1994, based on Mosaic

• From80%in 1996 to<10%in 2001

Internet Explorer Released in1995

• Provided withWindows 95

• IE 6 released in 2001 and reaches80%market share

• Leads to anantitrustlawsuit in the USA, 1998–2001 Firefox Released in2002from Netscape (open-sourced in 1998)

• Freeandopen-source

• Popularizedtabbed browsing

• Attacked IE 6’smonopoly

5/18

(7)

Historical web browsers

Mosaic First common graphical browser,1993–1997

• From80%in 1994 to<10%in 1996 Netscape Released in1994, based on Mosaic

• From80%in 1996 to<10%in 2001 Internet Explorer Released in1995

• Provided withWindows 95

• IE 6 released in 2001 and reaches80%market share

• Leads to anantitrustlawsuit in the USA, 1998–2001

Firefox Released in2002from Netscape (open-sourced in 1998)

• Freeandopen-source

• Popularizedtabbed browsing

• Attacked IE 6’smonopoly

5/18

(8)

Historical web browsers

Mosaic First common graphical browser,1993–1997

• From80%in 1994 to<10%in 1996 Netscape Released in1994, based on Mosaic

• From80%in 1996 to<10%in 2001 Internet Explorer Released in1995

• Provided withWindows 95

• IE 6 released in 2001 and reaches80%market share

• Leads to anantitrustlawsuit in the USA, 1998–2001 Firefox Released in2002from Netscape (open-sourced in 1998)

• Freeandopen-source

• Popularizedtabbed browsing

• Attacked IE 6’smonopoly

5/18

(9)

Current Web browsers

IE IE 7released in 2006, replaced byMicrosoft Edge Firefox Stillactively developed

Safari Released in 2003, default Web browser onMac OS X Opera Released in1996, initiallycommercial(until 2000) then

ad-supported(until 2005), nowfreebutproprietary.

Chrome Released in2008byGoogle, with anopen-source version(Chromium)

Mobile 55%market share according togs.statcounter.com

• Safari(iOS),Android browserorChrome(Android)

• Firefox Mobile, Blackberry, Opera, UC Browser...

To check rendering on old browsers, usebrowsershots.org

6/18

(10)

Recent market share

0 5 10 15 20 25 30 35 40

Chrome Chrome Safari FirefoxUC Browser IE Samsung Int.Other Source:gs.statcounter.com(November 2018)

7/18

(11)

Evolution

0 10 20 30 40 50 60 70

20092010 2011 2012 2013 2014 2015 2016 2017 2018

Chrome FirefoxIE Safari Opera Other Desktop Mobile, tablet, etc.

Source:gs.statcounter.com

8/18

(12)

Evolution (mobile)

0 10 20 30 40 50 60

20092010 2011 2012 2013 2014 2015 2016 2017 2018

Chrome Safari UC Browser Android Opera Samsung Internet Nokia Other

Source:gs.statcounter.com

9/18

(13)

Rendering engine

Firefox Gecko, and (work-in-progress) Servo, using Rust Safari WebKit engine

Chrome Blink (fork of Webkit, in April 2013)

IE Originally Trident, then EdgeHTML, soon Chromium5 Opera Originally Presto, then Blink

Others Dillo, KHTML, and other old/minimalistic engines

5https://www.windowscentral.com/

microsoft-building-chromium-powered-web-browser-windows-10, Dec 3, 2018

10/18

(14)

Summary and perspectives

• Webkit/Blink and Chrome/Chromium aredominant

• Only serious challenger:Firefox(but around5%globally)

• Blink isopen-sourcebutcontrolled by Google

• Differentbrowsersusing this rendering engine

• Someminimalistic, e.g.,uzbl

• Somevariantsof existing browsers, e.g., Pale Moon, Basilisk (Firefox forks)

• Somenew browsersusing Blink: Vivaldi, Brave

11/18

(15)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(16)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(17)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(18)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(19)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(20)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(21)

New topics about Web browsers

• Ecosystem ofextensions: Chrome Web Store, Mozilla Store

• Firefox: Mandatorysigning, andbackward-incompatiblechanges

• Ad blocking(27%of users6), counter-measures

uBlock Originextension, based onEasylist https://easylist.to/easylist/easylist.txt

• More generally,JavaScript blockers, e.g., uMatrix, NoScript, etc.

• Filtering outbots: robots exclusion standard, CAPTCHAs

reCAPTCHA: now volunteer work for Google

• Security: site isolation, one process/site (in Chromium)

• Web browserfingerprintinghttps://panopticlick.eff.org/

• In-browser cryptocurrency mining:CoinHive

around0.01 $/day, vs about0.10 $/dayof electricity

• TorandTor hidden services

6https://www.statista.com/topics/3201/ad-blocking/

12/18

(22)

Table of Contents

Introduction Web browsers

Other Web clients

URLs

13/18

(23)

Textual Web browsers

• lynx(still maintained), w3m, elinks

• Also:screen readersfor visually impaired users 14/18

(24)

Robots

Manyautomated programson the Web:

• Search enginecrawlers: see Pierre’s class

• RSS readersand aggregators

• Email harvesters(spammers)

• API consumers

15/18

(25)

Table of Contents

Introduction Web browsers Other Web clients URLs

16/18

(26)

URL

• Anatomy of a URL:

https://en.wikipedia.org/wiki/Telecom_ParisTech#History

protocol machine path fragment

• Uniform Resource Locator

• Identifies aresourceon the Web

17/18

(27)

Credits

• Course material inspired by course notes by Pierre Senellart

18/18

Références

Documents relatifs

SVG+JS Javascript manipulation of vector graphics WebGL Javascript API for accelerated 3D using a GPU. WebVR Describe scenes for virtual reality headsets with a-scene WebRTC Voice

JSP Integrating Java and a Web server (e.g., Apache Tomcat) node.js Chrome’s JavaScript engine (V8) plus a Web server Python Web frameworks: Django , CherryPy, Flask. Ruby

DOM (Document Object Model) gives an API for the tree representation of HTML and XML documents (independent from the programming language). Schema languages Specify the structure

• YAML: extends JSON with several features, e.g., explicit types, user-defined types, anchors and references. • Some JSON parsers are permissive and allow,

• While inside an element we can check that the word of its children satisfies the deterministic regular expression used to define it. • Can be implemented with a

descendant::C[@att1='1'] is a step which denotes all the Element nodes named C, descendant of the context node, having an Attribute node att1 with value

• Information extraction: creating structured information out of existing Web Data

• Recall , the percentage of correct matches that are extracted. → Extract everything gives 100% recall (and very