Graph visualization by organized clustering: application to social and biological networks

(1)

Graph visualization by organized clustering:

application to social and biological networks

Nathalie Villa-Vialaneix http://www.nathalievilla.org

& Fabrice Rossi

IUT de Carcassonne (UPVD) & Institut de Mathématiques de Toulouse

Challenging problems in Statistical Learning, Paris January 28/29, 2010

(2)

Network data

Many sources of large networks

I social networks (emails, collaborations, phone calls, etc.)

I technological networks (Internet, etc.)

I biological networks (metabolic pathways, gene regulation, gene interactions, etc.)

Scope of the talk: A graphG, with vertices{x₁,x₂, . . . ,x_n}, undirected and weightedwith weightsWsuch that: w_ii =0 (no loop),w_ij =w_ji ≥0 (w_ij >0⇔ ∃edge between nodesx_iandx_j).

●

● ●

●

● ●

●

● ●

●

(3)

Network data

Many sources of large networks

I social networks (emails, collaborations, phone calls, etc.)

I technological networks (Internet, etc.)

I biological networks (metabolic pathways, gene regulation, gene interactions, etc.)

Scope of the talk: A graphG, with vertices{x₁,x₂, . . . ,x_n}, undirected and weightedwith weightsWsuch that: w_ii =0 (no loop),w_ij =w_ji ≥0 (w_ij >0⇔ ∃edge between nodesx_iandx_j).

●

● ●

●

(4)

Network data analysis

Over one hundred vertices,pure manual analysis is infeasible

⇒need for automatic support for exploratory analysis:

I node/edge measures (e.g., degree distribution, betweenness, ...)

I visualization (e.g., force directed algorithm)

I node clustering (community extraction)

I supervised analysis

I ...

(5)

Network data analysis

Over one hundred vertices,pure manual analysis is infeasible

⇒need for automatic support for exploratory analysis:

I node/edge measures (e.g., degree distribution, betweenness, ...)

I visualization(e.g., force directed algorithm)

I node clustering(community extraction)

I supervised analysis

I ...

(6)

Visualization from clustering

Large graph visualization is difficult.

A standard solution: simplify the graph prior drawing. More precisely

1. identify dense clusters of nodes

2. draw the corresponding graph of clusters

[Newman and Girvan, 2004]

(7)

Visualization from clustering

(8)

Visualization from clustering

(9)

Drawing a clustered graph

I Given apartition(C_k)_k=1,...,C

I represent each cluster by a glyph (e.g., a circle) with surface proportional to|C_k|

I draw a segment between glyphsC_kandC_lwith thickness proportional toP

i∈C_k,j∈C_lW_ij

I Thesimplification induced by the clustering has to be faithful: each cluster should be as dense as possible (i.e., Pi,j∈CkW_ij should be high compared to the other weights).

I Thegraph induced by the clustering has to be readable: edge crossing should be minimized.

(10)

Drawing a clustered graph

i∈C_k,j∈C_lW_ij

I Thegraph induced by the clustering has to be readable: edge crossing should be minimized.

(11)

Drawing a clustered graph

i∈C_k,j∈C_lW_ij

I Thegraph induced by the clustering has to be readable:

edge crossing should be minimized.

(12)

Two approaches based on organizing maps

Main idea: organizing the clustering on a grid to constrain clusters’

position and to represent the most connected clusters close to each others.

1. Kernel SOM: generalization of Self-Organizing Maps to graph by the use of a kernel

2. Organized modularity optimization: extension of a well-known clustering measure for graphs to organized clustering

(13)

Two approaches based on organizing maps

Main idea: organizing the clustering on a grid to constrain clusters’

position and to represent the most connected clusters close to each others.

1. Kernel SOM: generalization of Self-Organizing Maps to graph by the use of a kernel

2. Organized modularity optimization: extension of a well-known clustering measure for graphs to organized clustering

(14)

Basic ideas about SOM

Project the graph on a squared grid (each square of the grid is a cluster)

Project the graph on a squared grid (each square of the grid is a cluster) such that:

I the nodes in a same cluster are highly connected

I the nodes in two close clusters are also (less) connected

I the nodes in two distant clusters are (almost) not connected

(15)

Basic ideas about SOM

Project the graph on a squared grid (each square of the grid is a cluster) such that:

I the nodes in a same cluster are highly connected

I the nodes in two close clusters are also (less) connected

I the nodes in two distant clusters are (almost) not connected

(16)

SOM and kernel SOM

Original SOM algorithm (batch): x₁, . . . ,x_n ∈R^d

1. Initalization: Initialize randomlyp₁⁰, ...,p_M⁰ inR^d 2. Forl =1, . . . ,Ldo

3. Assignment: for alli =1, . . . ,ndo f^l(x_i) =arg min

j=1,...,Mkx_i−p_j^l−1k_Rd

4. Representation: for allj =1, . . . ,M,

p_j^l= arg min

p∈R^d

Xn

i=1

h^l(f^l(x_i),j)kx_i−pk²

R^d

Online versions by[Lau et al., 2006]

(17)

SOM and kernel SOM

Kernel SOM (batch): x_i∈ Gdefined by a kernel relation:K(x_i,x_j)

⇒ ∃φ:G →(H,h., .iH):K(x,x⁰) =hφ(x), φ(x⁰)iH

1. Initalization: Initialize randomlyp_j⁰ =P_n

i=1γ⁰_jiφ(x_i) 2. Forl =1, . . . ,Ldo

3. Assignment: for alli =1, . . . ,ndo f^l(x_i) =arg min

j=1,...,Mkφ(x_i)−p_j^l−1k_H 4. Representation: for allj =1, . . . ,M,

γ^l_j =arg min

γ∈Rⁿ

Xn

i=1

h^l(f^l(x_i),j)kφ(x_i)− Xn

k=1

γ_kφ(x_k)k²_H [Villa and Rossi, 2007, Hammer and Hasenfuss, 2007]

(18)

SOM and kernel SOM

1. Initalization: Initialize randomlyp_j⁰ =Pn

i=1γ⁰_jiφ(x_i) 2. Forl =1, . . . ,Ldo

3. Assignment: for alli =1, . . . ,ndo

f^l(x_i) =arg min

j=1,...,M n

X

k,k⁰=1

γ_jk^l−1γ^l−1_jk0 K(x_k,x_k⁰)−2

n

X

k=1

γ^l−1_jk K(x_i,x_k)

4. Representation: for allj =1, . . . ,M, γ^l_jk = ^h(f^l(x_k),j)

Pn

k⁰=1h(f^l(x_k⁰),j)

[Villa and Rossi, 2007, Hammer and Hasenfuss, 2007]

(19)

SOM and kernel SOM

1. Initalization: Initialize randomlyp_j⁰ =Pn

i=1γ⁰_jiφ(x_i) 2. Forl =1, . . . ,Ldo

3. Assignment: for alli =1, . . . ,ndo

f^l(x_i) =arg min

j=1,...,M

Xn

k,k⁰=1

γ_jk^l−1γ^l−1_jk0 K(x_k,x_k⁰)−2 Xn

k=1

γ^l−1_jk K(x_i,x_k)

4. Representation: for allj =1, . . . ,M, γ^l_jk = ^h(f^l(x_k),j)

Pn

k⁰=1h(f^l(x_k⁰),j) Online versions by[Lau et al., 2006]

(20)

Which kernels?

Laplacian:L= (L_i,j)i,j=1,...,n where

L_i,j =

( −w_i,j ifi ,^j d_i =P

j,iw_i,j ifi =j ;

Regularized versions such as

I Heat kernel

[Kondor and Lafferty, 2002, Smola and Kondor, 2003]: for

β >0,K^β=e^−βL =P+∞

k=1 (−βL)^k

k! .

I Generalized inverse of the Laplacian [Fouss et al., 2007]: K =L⁺.

(21)

Which kernels?

L_i,j =

j,iw_i,j ifi =j ;

I Heat kernel

β >0,K^β=e^−βL =P+∞

k=1 (−βL)^k

k! .

(22)

Which kernels?

L_i,j =

j,iw_i,j ifi =j ;

I Heat kernel

β >0,K^β=e^−βL =P+∞

k=1 (−βL)^k

k! .

(23)

A first example: a medieval social network

Example from [Boulet et al., 2008]

In Cahors (Lot, France), stands a big corpus of 5000 agrarian contracts coming from 4 seignories (about 25 little villages) and being established between 1240 and 1520 (just before and after the hundred years’ war).

(24)

A first example: a medieval social network

Network of relations between peasants based on common ci- tations in a given contract.

(25)

A first example: a medieval social network

Graph of clusters: the communities have relations with time and space.

The leading people are empha- sized.

But The biggest communities are still very complex.

(26)

Modularity [Newman and Girvan, 2004]

Popularquality measure for graph clustering: a partition of the vertices inC clusters,(C_k)k=1,...,C has modularity:

Q(C) = ¹ 2m

C

X

k=1

X

i,j∈Ck

(W_ij−P_ij)

whereP_ijare weights corresponding to a “null model” where the weights only depend on the nodes properties and not on the cluster they belong to.

More precisely,

P_ij = ^dⁱ^d^j 2m withd_i = ¹₂P

j,iW_ijis the degree of a vertexx_i. A “good” clustering shouldmaximizeQ.

(27)

Modularity [Newman and Girvan, 2004]

Q(C) = ¹ 2m

C

X

k=1

X

i,j∈Ck

(W_ij−P_ij)

More precisely,

j,iW_ijis the degree of a vertexx_i.

A “good” clustering shouldmaximizeQ.

(28)

Modularity [Newman and Girvan, 2004]

Q(C) = ¹ 2m

C

X

k=1

X

i,j∈Ck

(W_ij−P_ij)

More precisely,

j,iW_ijis the degree of a vertexx_i. A “good” clustering shouldmaximizeQ.

(29)

Interpretation

I Qincreases when(x_i,x_j)are in asame clusterand havetrue weightW_ij greaterthan the ones expected in the null model, P_ij

I Qincreases when(x_i,x_j)are in atwo different clustersand havetrue weightW_ij smallerthan the ones expected in the null model,P_ij because

Q(C) + ¹ 2m

X

k,k⁰

X

i∈C_k,j∈C_k0

(W_ij−P_ij) =0.

I Contrary to the minimization of the number of edges between clusters, modularity can help toseparate nodes with high degreesinto different clusters more easily

(30)

Interpretation

I Qincreases when(x_i,x_j)are in asame clusterand havetrue weightW_ij greaterthan the ones expected in the null model, P_ij

I Qincreases when(x_i,x_j)are in atwo different clustersand havetrue weightW_ij smallerthan the ones expected in the null model,P_ij because

Q(C) + ¹ 2m

X

k,k⁰

X

i∈C_k,j∈C_k0

(W_ij−P_ij) =0.

I Contrary to the minimization of the number of edges between clusters, modularity can help toseparate nodes with high degreesinto different clusters more easily

(31)

Drawing optimized clustering

Combine:

I high modularityto ensure high intra clusters density and low external connectivity

I little edge crossing

by:

I Classic solution: relying on graph drawing algorithm after maximization of the modularity

I Extend the modularity to a criterium adapted to a prior structure (like a grid)

(32)

Drawing optimized clustering

Combine:

I little edge crossingby:

(33)

Drawing optimized clustering

Combine:

I little edge crossingby:

(34)

Self Organizing Map principle

For data inR^d, SOM minimizes (over the clustering and the prototypes(p_k))

C

X

k=1 n

X

i=1

S_f(x

i),kkx_i−p_kk²

R^d

where:

I (p_k)are the prototypes (one for each cluster of the grid) representing the cluster in the original space (R^d⁾

I f(x_i)is the cluster, on the grid, wherex_iis classified

I S_klencodes the prior structure: close to 1 for close clusters and close to 0 for distant clusters

This corresponds to asoft membership: x_ibelongs toC_k with membershipS_f(x_i_),k.

(35)

Self Organizing Map principle

For data inR^d, SOM minimizes (over the clustering and the prototypes(p_k))

C

X

k=1 n

X

i=1

S_f(x

i),kkx_i−p_kk²

R^d

where:

I (p_k)are the prototypes (one for each cluster of the grid) representing the cluster in the original space (R^d⁾

I f(x_i)is the cluster, on the grid, wherex_iis classified

I S_klencodes the prior structure: close to 1 for close clusters and close to 0 for distant clusters

This corresponds to asoft membership: x_ibelongs toC_k with membershipS_f(x_i_),k.

(36)

Organized modularity [Rossi and Villa-Vialaneix, 2010]

Same idea: encode a prior structure via a matrixS. Maximize:

SQ= ¹ 2m

X

i,j

S_f(i)f_(j)(W_ij−P_ij)

Hence:

I if a pair of vertices(x_i,x_j)is such thatW_ij >P_ij,SQincreases with the closeness off(x_i)andf(x_j)in the prior structure

I if a pair of vertices(x_i,x_j)is such thatW_ij <P_ij,SQincreases iff(x_i)andf(x_j)are distant in the prior structure

(37)

Organized modularity [Rossi and Villa-Vialaneix, 2010]

Same idea: encode a prior structure via a matrixS. Maximize:

SQ= ¹ 2m

X

i,j

S_f(i)f_(j)(W_ij−P_ij)

Hence:

I if a pair of vertices(x_i,x_j)is such thatW_ij >P_ij,SQincreases with the closeness off(x_i)andf(x_j)in the prior structure

I if a pair of vertices(x_i,x_j)is such thatW_ij <P_ij,SQincreases iff(x_i)andf(x_j)are distant in the prior structure

(38)

Optimization

The clustering is represented by an×Cassignment matrixMwith M_ik =δ_f_(i)=k. The goal is then tomaximize

SQ=F(M) = ¹ 2m

X

i,j

X

k,l

M_ikS_klM_lj(W_ij−P_ij)

Combinatorial problem is NP-complet⇒use ofdeterministic algorithm:

I Given a temperature ¹_β, assume a Gibbs distribution on the solution spaceP(M) = _Z¹

Pe^βF(M)

I ComputeE(M)with respect toP

I At the limitβ→+∞,E(M)converges toM^∗whereM^∗realizes the maximum ofF(M)

(39)

Optimization

The clustering is represented by an×Cassignment matrixMwith M_ik =δ_f_(i)=k. The goal is then tomaximize

SQ=F(M) = ¹ 2m

X

i,j

X

k,l

M_ikS_klM_lj(W_ij−P_ij)

Combinatorial problem is NP-complet⇒use ofdeterministic algorithm:

I Given a temperature ¹_β, assume a Gibbs distribution on the solution spaceP(M) = _Z¹

Pe^βF(M)

I ComputeE(M)with respect toP

I At the limitβ→+∞,E(M)converges toM^∗whereM^∗realizes the maximum ofF(M)

(40)

Mean field approximation

Problem: Z_P =P

Me^βF(M)is hard to compute (Cⁿ values forM)

except when the distribution factorizes (use of block calculations) ButSQdoes not factorize!!!

Solution: approximateP(M)by a distribution that factorizes:

I P(M)is approximatedby

R(M,E)= ^e

βP

i,kMikEik

P

Ne^β^Pî,k^NîkÊîk.

For the distributionR(M,E),M_ik areindependantsfor i =1, . . . ,n.

I The mean fieldE is tuned byminimizing the Kullback-Leibler divergence:

KL(R|P) =P

MR(M,E)log^R(M,E)_P(M) ⇒mean field equations:

∂ER(F(M))

∂Ejl =P

k

∂ER(Mjk)

∂Ejl E_jk

I E_ik andER(M_ik)are iteratively estimated by an EM-like algorithm; at the limit,ER(M_ik)gives theprobability ofx_ito belong to clusterk for the optimalSQ

(41)

Mean field approximation

Problem: Z_P =P

R(M,E)= ^e

βP

i,kMikEik

P

KL(R|P) =P

∂ER(F(M))

∂Ejl =P

k

∂ER(Mjk)

∂Ejl E_jk

(42)

Mean field approximation

Problem: Z_P =P

R(M,E)= ^e

βP

i,kMikEik

P

KL(R|P) =P

∂ER(F(M))

∂Ejl =P

k

∂ER(Mjk)

∂Ejl E_jk

(43)

Mean field approximation

Problem: Z_P =P

R(M,E)= ^e

βP

i,kMikEik

P

KL(R|P) =P

∂ER(F(M))

∂Ejl =P

k

∂ER(Mjk)

∂Ejl E_jk

(44)

Mean field approximation

Problem: Z_P =P

R(M,E)= ^e

βP

i,kMikEik

P

KL(R|P) =P

∂ER(F(M))

∂Ejl =P

k

∂ER(Mjk)

∂Ejl E_jk

I E_ik andER(M_ik)are iteratively estimated by an EM-like algorithm; at the limit,ER(M_ik)gives theprobability ofx_ito belong to clusterkfor the optimalSQ

(45)

Deterministic annealing

Algorithm

For increasing sequenceβ₁,β₂, . . . ,β_L,

1. InitializeER(M)randomly in[0,1]such thatP

kER(M_ik) =1 2. Repeat forl=1, . . . ,L

2.1 Compute E:Eik =2P

j,ⁱ

P

k⁰E^R(Mjk⁰)Skk⁰Bjiwhere B =_2m¹(W−P);

2.2 ComputeE^R(M):E^R(Mik) =P^e^β^{l Eik}

k0e^β^{l Eik}⁰

3. ThresholdER(M_ik)intoclustering:

M_ik =arg max

k=1,...,CER(M_ik).

(46)

A toy example

A toy example[Zachary, 1977]: Zachary’s karate club(friendship social network between the34 membersof a Karate club at a US university in the 70s).

1 2

3 4

5

6 7

8 9

10

11 12 13 14 15

16

17 18 19

20 21

22 23

24

25 26 27

28

29 30

31

32 33 34

(47)

A toy example

A toy example[Zachary, 1977]: Zachary’s karate club(friendship social network between the34 membersof a Karate club at a US university in the 70s).

1 2

3 4

5

6 7

8 9

10

11 12 13 14 15

16

17 18 19

20 21

22 23

24

25 26 27

28

29 30

31

32 33 34

(48)

Optimization of organized modularity on “Karate”

For a choice of neighborhood relationship leading to4 non empty clusterson a squared grid of size 2×2:

50 100 200 500

0.00.10.20.30.4

β

Organized modularity

Evolution of the organized modularity during the annealing scheme

●

● ●

●

● ●

●

3 3

1

1 1

3 4

4

1 3 3 3

4

1

3 4

2

2 2 4

4

2 4 4

Final classification and layout

(49)

Optimization of organized modularity on “Karate”

1

●

● ●

●

●●

●

2

●

● ●

●

●●

●

3

●

● ●

●

●●

●

●●

4

●

● ●

●

●●

●

●●

Probability of each node to be in a given classe just after the first phase transition

●

● ●

●

● ●

●

3 3

1

1 1

3 4

4

1 3 3 3

4

1

3 4

2

2 2 4

4

2 4 4

(50)

Optimization of organized modularity on “Karate”

1

●

● ●

●

●●

●

2

●

● ●

●

●●

●

3

●

● ●

●

●●

●

●●

4

●

● ●

●

●●

●

●●

Probability of each node to be in a given classe just after the second phase transition

●

● ●

●

● ●

●

3 3

1

1 1

3 4

4

1 3 3 3

4

1

3 4

2

2 2 4

4

2 4 4

(51)

Optimization of organized modularity on “Karate”

●

● ●

●

● ●

●

3 3

1

1 1 3 4

4

1 3 3 3

4

1

3 4

2

2 2 4

4

2 4 4

●¹ ●²

3 4

(52)

Optimization of organized modularity on “Karate”

●

● ●

●

● ●

●

3 3

1

1 1 3 4

4

1 3 3 3

4

1

3 4

2

2 2 4

4

2 4 4

(53)

Comparisons on a toy example: “Karate”

Optimal solutionobtained with SOM (various kernels tested):

●

● ●

●

● ●

●

●●

3 3

1

1 1 3 4

3

1 3 3 3 4

4

1 3 4

3 4

2

2 2 4

4

2 4 4

●

● ●

●

● ●

●

●●

3 3

1

1 1 3 4

4

1 3 3 3 4

4

1 3 4

3 4

2

2 2 4

4

2 4 4

SOM (heat kernel) SQoptimization

Modularity = 0.4188 Modularity = 0.4198

true optimum

SQoptimization solution is consistent with the true division of the social network