Traceroute-like exploration of unknown networks: a statistical
analysis
A. Barrat, LPT, Université Paris-Sud, France
I. Alvarez-Hamelin (LPT, France) L. Dall’Asta (LPT, France)
A. Vázquez (Notre-Dame University, USA) A. Vespignani (LPT, France)
cond-mat/0406404
http://www.th.u-psud.fr/page_perso/Barrat
Plan of the talk
• Apparent complexity of Internet’s structure
• Problem of sampling biases
• Model for traceroute
• Theoretical approach of the traceroute mapping process
• Numerical results
• Conclusions
•
Multi-probe reconstruction (router-level)
•
Use of BGP tables for the Autonomous System level (domains)
Graph representation
different granularities
Internet representation
Many projects (CAIDA, NLANR, RIPE, IPM, PingER...)
=>Large-scale visualizations
TOPOLOGICAL ANALYSIS
by STATISTICAL TOOLS
and GRAPH THEORY !
Main topological characteristics
• Small world
Distribution of Shortest paths (# hops)
between two nodes
Main topological characteristics
• Small world : captured by Erdös-Renyi graphs
Poisson distribution
<k> = p N
With probability p an edge is established among couple of vertices
Main topological characteristics
• Small world
• Large clustering: different neighbours of a node will likely know each other
1 2
3 n
Higher probability to be connected
=>graph models with large clustering, e.g. Watts-Strogatz 1998
Main topological characteristics
• Small world
• Large clustering
• Dynamical network
• Broad connectivity distributions
• also observed in many other contexts
(from biological to social networks)
• huge activity of modeling
Main topological characteristics
Broad connectivity distributions:
obtained from mapping projects
Govindan et al 2002
Result of a sampling:
is this reliable ?
Sampling biases
Internet mapping:
Sampling is incomplete
Lateral connectivity is missed (edges are underestimated) Finite size sample
=> spanning tree
Sampling biases
●
Vertices and edges best sampled in the proximity of sources
●
Bad estimation of some topological properties
Statistical properties of the sampled graph may sharply differ from the original one
Lakhina et al. 2002 Clauset & Moore 2004
De Los Rios & Petermann 2004 Guillaume & Latapy 2004
Bad sampling
?
Evaluating sampling biases
Real graph G=(V,E)
(known, with given properties )
Sampled graph G’=(V’,E’)
simulated sampling
Analysis of G’, comparison with G
Model for traceroute
G=(V, E)
Sources Targets
First approximation: union of shortest paths
NB: Unique Shortest Path
Model for traceroute
G’=(V’, E’)
First approximation: union of shortest paths
Very simple model, but: allows for some analytical
and numerical understanding
More formally...
G = (V, E): sparse undirected graph with a set of N
Ssources S = { i
1, i
2, …, i
NS}
N
Ttargets T = {j
1, j
2, …, j
NT}
randomly placed.
The sampled graph G’=(V’,E’) is obtained by considering the union of all the traceroute -like paths connecting source- target pairs.
PARAMETERS: ,
N NS
S
,
N NT
T
N
N NS T
(probing effort)
Usually N
S=O(1),
T=O(1)
For each set = { S , T }, the indicator function that a given edge (i, j) belongs to the sampled graph is
m l
N s
m l ij N
t
mj li
ij
S T
t
1 s
) , ( 1
1
1
with
0
)
1
, (l m
ij if (i,j) path between l,motherwise, S
N s
ii
N
S
s
1
N Tt
ij
N
T
t
1
Analysis of the mapping process
Averaging over all the possible realizations of the set = { S,T },
m l
m l ij T S m
l
N s
N t
m l ij mj li
ij
S T
t
s 1 (1 )
1
1 ( , )
1 1
) ,
(
WE HAVE NEGLECTED CORRELATIONS !!
• usually S T << 1
V j i m l
m l ij V USP
t j i
s (s,t)
t) (s, ij
ij σ
b σ ( , )
•
Mean-field statistical analysis
ij > 1 – exp(- S T b ij )
betweenness
Betweenness centrality b
for each pair of nodes (l,m) in the graph, there are
lmshortest paths between l and m
ijlmshortest paths going through ij
b
ijis the sum of
ijlm/
lmover all pairs (l,m)
Similar concept:
node betweenness b
ii
j
k ij: large betweenness jk: small betweenness
Also: flow of information if each individual
sends a message to all other individuals
Consequences of the analysis
ij > 1 – exp(- S T b ij )
•Smallest betweenness
b
ij=2 => <
ij> 2
S
Ti j
•Largest betweenness
b
ij=O(N
2) => <
ij> 1
(i.e. i and j have to be source and target to discover i-j)
Results for the vertices
ij > 1 – exp(- b ij /N)
<
i> 1 – (1-
T) exp( - b
i/N ) (discovery probability)
N
*k/N
k 1 – exp( - b(k)/N ) (discovery frequency)
<k
*>/k ( 1 + b(k)/N )/k (discovered connectivity)
Summary:
• discovery probability strongly related with the centrality ;
• vertex discovery favored by a finite density of targets ;
• accuracy increased by increasing probing effort .
Numerical simulations
1. Homogeneous graphs:
• peaked distributions of k and b
• narrow range of betweenness
1,
1
max
b b
e
(ex: ER random graphs)
=> Good sampling expected only for high probing effort
Numerical simulations
• broad distributions of k and b P(k) ~ k-3
• large range of available values
1k
1. Homogeneous graphs 2. Heavy-tailed graphs
(ex: Scale-free BA model)
=> Expected: Hubs well-sampled independently of
(b(k) ~k
Homogeneous graphs
Homogeneously pretty badly sampled
Numerical simulations
Hubs are well discovered
Numerical simulations
Scale-free graphs
No heavy-tail behaviour except....
• heavy-tailed P*(k) only for NS = 1 (cf Clauset and Moore 2004)
• cut-off around <k> => large, unrealistic <k> needed
• bad sampling of P(k)
Numerical simulations
Homogeneous graphs
N
S=1
• good sampling, especially of the heavy-tail;
• almost independent of NS ;
• slight bending for low degree (less central) nodes => bad evaluation of the exponents.
Scale-free graphs
Numerical simulations
Summary
Analytical approach to a traceroute-like sampling process
Link with the topological properties, in particular the betweenness
Usual random graphs more “difficult” to sample than heavy-tails
Heavy-tails well sampled
Bias yielding a sampled scale-free network from a homogeneous network:
only in few cases
<k> has to be unrealistically large
Heavy tails properties are a genuine feature of the Internet
however
Quantitative analysis might be strongly biased (wrong exponents...)
• Optimized strategies:
•separate influence of
T,
S• location of sources, targets
• investigation of other networks
• Results on redundancy issues
• Massive deployment traceroute@home
Perspectives
•
The internet is a weighted networks
bandwidth, traffic, efficiency, routers capacity
•
Data are scarse and on limited scale
•