Traceroute-like exploration of unknown networks: a statistical analysis

(1)

Traceroute-like exploration of unknown networks: a statistical

analysis

A. Barrat, LPT, Université Paris-Sud, France

I. Alvarez-Hamelin (LPT, France) L. Dall’Asta (LPT, France)

A. Vázquez (Notre-Dame University, USA) A. Vespignani (LPT, France)

cond-mat/0406404

http://www.th.u-psud.fr/page_perso/Barrat

(2)

Plan of the talk

• Apparent complexity of Internet’s structure

• Problem of sampling biases

• Model for traceroute

• Theoretical approach of the traceroute mapping process

• Numerical results

• Conclusions

(3)

•

Multi-probe reconstruction (router-level)

•

Use of BGP tables for the Autonomous System level (domains)

Graph representation

different granularities

Internet representation

Many projects (CAIDA, NLANR, RIPE, IPM, PingER...)

(4)

=>Large-scale visualizations

TOPOLOGICAL ANALYSIS

by STATISTICAL TOOLS

and GRAPH THEORY !

(5)

Main topological characteristics

• Small world

Distribution of Shortest paths (# hops)

between two nodes

(6)

Main topological characteristics

• Small world : captured by Erdös-Renyi graphs

Poisson distribution

<k> = p N

With probability p an edge is established among couple of vertices

(7)

Main topological characteristics

• Small world

• Large clustering: different neighbours of a node will likely know each other

1 2

3 n

Higher probability to be connected

=>graph models with large clustering, e.g. Watts-Strogatz 1998

(8)

Main topological characteristics

• Small world

• Large clustering

• Dynamical network

• Broad connectivity distributions

• also observed in many other contexts

(from biological to social networks)

• huge activity of modeling

(9)

Main topological characteristics

Broad connectivity distributions:

obtained from mapping projects

Govindan et al 2002

Result of a sampling:

is this reliable ?

(10)

Sampling biases

Internet mapping:

Sampling is incomplete

Lateral connectivity is missed (edges are underestimated) Finite size sample

=> spanning tree

(11)

Sampling biases

●

Vertices and edges best sampled in the proximity of sources

●

Bad estimation of some topological properties

Statistical properties of the sampled graph may sharply differ from the original one

Lakhina et al. 2002 Clauset & Moore 2004

De Los Rios & Petermann 2004 Guillaume & Latapy 2004

Bad sampling

?

(12)

Evaluating sampling biases

Real graph G=(V,E)

(known, with given properties )

Sampled graph G’=(V’,E’)

simulated sampling

Analysis of G’, comparison with G

(13)

Model for traceroute

G=(V, E)

Sources Targets

First approximation: union of shortest paths

NB: Unique Shortest Path

(14)

Model for traceroute

G’=(V’, E’)

First approximation: union of shortest paths

Very simple model, but: allows for some analytical

and numerical understanding

(15)

More formally...

G = (V, E): sparse undirected graph with a set of N

_S

sources S = { i

₁

, i

₂

, …, i

_NS

}

N

_T

targets T = {j

₁

, j

₂

, …, j

_NT

}

randomly placed.

The sampled graph G’=(V’,E’) is obtained by considering the union of all the traceroute -like paths connecting source- target pairs.

PARAMETERS: ^,

N N_S

S 

 ,

N N_T

T 

 N

N N_S _T

  (probing effort)

Usually N

_S

=O(1), 

_T

=O(1)

(16)

For each set  = { S , T }, the indicator function that a given edge (i, j) belongs to the sampled graph is

  

  

 

 



 





m l

N s

m l ij N

t

mj li

ij

S T

t

1 s

) , ( 1

1 1   



with

 

  0

)

1

, (l m



ij ^{if (i,j)}^ path between l,m

otherwise, S

N s

ii

N

S

s





1



^N _T

t

ij

N

T

t





1



Analysis of the mapping process

(17)

Averaging over all the possible realizations of the set  = { S,T },



  



  



 



 



 





m l

m l ij T S m

l

N s

N t

m l ij mj li

ij

S T

t

s 1 (1 )

1

1 ⁽ ^, ⁾

1 1

) ,

(

  







WE HAVE NEGLECTED CORRELATIONS !!

• usually _S _T << 1



     



 









 

V j i m l

m l ij V USP

t j i

s (s,t)

t) (s, ij

ij σ

b σ  ⁽ ^, ⁾

•



Mean-field statistical analysis

 _ij > 1 – exp(-  _S  _T b _ij )

betweenness

(18)

Betweenness centrality b

for each pair of nodes (l,m) in the graph, there are



^lm

shortest paths between l and m



_ij^lm

shortest paths going through ij

b

_ij

is the sum of 

_ijlm

/ 

^lm

over all pairs (l,m)

Similar concept:

node betweenness b

_i

i

j

k ij: large betweenness jk: small betweenness

Also: flow of information if each individual

sends a message to all other individuals

(19)

Consequences of the analysis

 _ij > 1 – exp(-  _S  _T b _ij )

•Smallest betweenness

b

_ij

=2 => <

_ij

> 2 

_S



_T

i j

•Largest betweenness

b

_ij

=O(N

²

) => <

_ij

> 1

(i.e. i and j have to be source and target to discover i-j)

(20)

Results for the vertices

 _ij > 1 – exp(-  b _ij /N)

< 

_i

> 1 – (1-

_T

) exp( -  b

_i

/N ) (discovery probability)

N

^*_k

/N

_k

 1 – exp( -  b(k)/N ) (discovery frequency)

<k

^*

>/k   ( 1 + b(k)/N )/k (discovered connectivity)

Summary:

• discovery probability strongly related with the centrality ;

• vertex discovery favored by a finite density of targets ;

• accuracy increased by increasing probing effort  .

(21)

Numerical simulations

1. Homogeneous graphs:

• peaked distributions of k and b

• narrow range of betweenness



¹

^,

¹



max

^ ^

 b b

_e



(ex: ER random graphs)

=> Good sampling expected only for high probing effort

(22)

Numerical simulations

• broad distributions of k and b P(k) ~ k^-3

• large range of available values



^ ¹

k 

1. Homogeneous graphs 2. Heavy-tailed graphs

(ex: Scale-free BA model)

=> Expected: Hubs well-sampled independently of 

(b(k) ~k

^



(23)

Homogeneous graphs

Homogeneously pretty badly sampled

Numerical simulations

(24)

Hubs are well discovered

Numerical simulations

Scale-free graphs

(25)

No heavy-tail behaviour except....

• heavy-tailed P*(k) only for N_S = 1 (cf Clauset and Moore 2004)

• cut-off around <k> => large, unrealistic <k> needed

• bad sampling of P(k)

Numerical simulations

Homogeneous graphs

N

_S

=1

(26)

• good sampling, especially of the heavy-tail;

• almost independent of N_S;

• slight bending for low degree (less central) nodes => bad evaluation of the exponents.