Pattern Matching in Protein-Protein Interaction Graphs

(1)

Pattern Matching in Protein-Protein Interaction Graphs

Ga ëlle Brevier(Universit é de Grenoble, France) Romeo Rizzi(Universit à di Udine, Italy) St éphane Vialette(Universit é Paris-Est, France)

Lisbon, September 19, 2008

(2)

Outline

1 Introduction

2 Exact colorful instances

3 Hardness results

4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests

5 Future works

(3)

Introduction

Protein interactions identified on a genome-wide scale are commonly visualized as protein interaction graphs, where proteins are vertices and interactions are edges.

(4)

Gene or Protein Interactions Databases

BioGRID- A Database of Genetic and Physical Interactions DIP- Database of Interacting Proteins

MINT- A Molecular Interactions Database IntAct- EMBL-EBI Protein Interaction

MIPS- Comprehensive Yeast Protein-Protein interactions

Yeast Protein Interactions- Yeast two-hybrid results from Fields’ group PathCalling- A yeast protein interaction database by Curagen SPiD- Bacillus subtilis Protein Interaction Database

AllFuse- Functional Associations of Proteins in Complete Genomes BRITE- Biomolecular Relations in Information Transmission and Expression ProMesh- A Protein-Protein Interaction Database

The PIM Database- by Hybrigenics Mouse Protein-Protein interactions

Human herpesvirus 1 Protein-Protein interactions Human Protein Reference Database

BOND- The Biomolecular Object Network Databank. Former BIND

MDSP- Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry Protcom- Database of protein-protein complexes enriched with the domain-domain structures Proteins that interact with GroEL and factors that affect their release

YPD^TM- Yeast Proteome Database by Incyte . . .

(5)

Introduction

Comparative analysis of protein-protein interaction graphs aims at finding complexes that are common to different species.

Mounting evidence suggests that proteins that function together in a pathway or a structural complex are likely to evolve in a

correlated fashion.

(6)

Intoduction

Pattern matching in protein-protein interaction graphs Finding a protein complex in another protein network.

Graph matching

Focus on mappings that preserve adjacencies (to deal with interaction datasets that are missing many true protein interactions).

Injective list homomorphisms and optimization State-of-the art approaches to identifying orthologs (genes in different species that originate from a single gene in the last common ancestor of these species).

Putative orthologs are represented by colors

(7)

Introduction: Searching for an exact occurrence

Pattern graph(G, λ_G) mult(G, λ_G) =2

Target graph(H, λ_H) mult(H, λ_H) =5

(8)

Introduction: Searching for an exact occurrence

Target graph(H, λ_H) mult(H, λ_H) =5 θ:V(G)−−−^λ^G^,λ→^H V(H)

(9)

Introduction: Searching for an exact occurrence

(10)

Introduction: Searching for an exact occurrence

(11)

Introduction: Searching for an exact occurrence

(12)

Introduction: Searching for an exact occurrence

(13)

Introduction: Searching for the best occurrence

Target graph(H, λ_H) mult(H, λ_H) =4

(14)

Introduction: Searching for the best occurrence

(15)

Introduction: Searching for the best occurrence

(16)

Introduction: Searching for the best occurrence

(17)

Introduction: Searching for the best occurrence

(18)

Introduction: Searching for the best occurrence

(19)

Problem

Max–(ρ, σ)–Matching–Colors

•Input: Two graphsGandH and the coloring mappings λ_G :V(G)→C, mult(G, λG) =ρ, and

λ_H :V(H)→C, mult(H, λ_H) =σ.

•Solution: An injective mappingθ:V(G)−−−^λ^G^,λ→^H V(H).

•Measure: The number of edges ofGmatched by the injective mappingθ.

EXACT–(ρ, σ)–MATCHING–COLORSis the extremal problem of finding an injective mappingθ:V(G)−−−^λ^G^,λ→^H V(H)that matches all the edges ofG.

(20)

Introduction

Trim instance

An instance of theMAX–(ρ, σ)–MATCHING–COLORSor the

EXACT–(ρ, σ)–MATCHING–COLORSproblem is said to betrimif the following conditions hold true:

1 for each colorc_i ∈C,#CG(c_i)≤#CH(c_i), and

2 for each edge{u_i,u_j}∈E(G), there exists an edge{v_i,v_j}∈E(H) such thatλ_G(u_i) =λ_H(v_i)andλ_G(u_j) =λ_H(v_j).

(21)

Related works in the context

List injective homomorphisms for protein graphs

[Fagnot, Lelandais and V., 2007; Fertin, Rizzi and V., 2005].

Reaction motifs in metabolic networks

[Lacroix, Fernandes and Sagot, 2006; Hermelin, Fellows, Fertin and V., 2007].

QPath

[Shlomi, Segal, Ruppin and Sharan, 2006].

Path Matching and Graph Matching in Biological Networks [Yang and Sze, 2007].

(22)

Outline

1 Introduction

5 Future works

(23)

Exact colorful instances

Theorem (Fagnot, Lelandais and V., 2007)

Both theEXACT–(1, σ)–MATCHING–COLORSproblem for∆(G)≤2 and theEXACT–(ρ,2)–MATCHING–COLORSproblem are solvable in polynomial-time for any constantρandσ.

Theorem (Fertin, Rizzi and V., 2005)

The EXACT–(1,3)–MATCHING–COLORSproblem for∆(G) =3and

∆(H) =4isNP-complete.

We focus here on theEXACT–(1, σ)–MATCHING–COLORSproblem.

(24)

Exact colorful instances

Algorithm 1:Rand-Exact-Matching-Colors begin

terminating whether an occurrence ofGinH w.r.tλ_Gandλ_H is found.Letθ:V(G)−−−^λ^G^,λ→^H V(H)be a random injective mapping.

up to 3n_Gtimes, terminating whether an occurrence of GinH w.r.t λ_G andλ_H is found. (1)Choose at random an edge e∈E(G)that is not matched byθ.

(2)Choose at random one vertex u∈e.

(3)Change at random the value ofθ(u)w.r.t.λ_G andλ_H. end

(25)

Exact colorful instances: Random walk

Particle moving along the integer line Fix an optimal solutionθ_opt.

θ_i andθ_optagree on exactlyj vertices.

e={u,v}∈E(G)random edge that is not matched byθ_i. θ_i andθ_optdisagree on exactly one ofu andv.

0 j−1 j j+1 n_G

σ−1 2σ−2

1 2σ−2

σ−2 2σ−2

(26)

Exact colorful instances: Random walk

Particle moving along the integer line Fix an optimal solutionθ_opt.

θ_i andθ_optagree on exactlyj vertices.

e={u,v}∈E(G)random edge that is not matched byθ_i. θ_i andθ_optdisagree on bothu andv.

0 j−1 j j+1 n_G

0 2

2σ−2

2σ−4 2σ−2

(27)

Exact colorful instances: Random walk

Particle moving along the integer line Pessimistic stochastic process(Y₁,Y₂, . . .)

0 j−1 j j+1 nG

2σ−3 2σ−2

1 2σ−2

0

(28)

Exact colorful instances: Random walk

Useful bounds

Letr_j be the probability of exactlyk “moves down”, andj+k

“moves up” in a sequence of 2k+j moves:

r_j ≥

2σ−3 2σ−2

k 1 2σ−2

j+k

Letq_j be the probability that the algorithm finds an injective homomorphism withinj+2k ≤3n_G steps, starting from a random injective mappingθ:V(G)−−−^λ^G^,λ→^H V(H)

q_j ≥

√3 8p

πj

27(2σ−3) 4(2σ−2)³

j

(29)

Exact colorful instances: Random walk

Theorem

AlgorithmRand-Exact-Matching-Colorsreturns an injective

homomorphismθ:V(G)−−−^λ^G^,λ→^H V(H)(if such a mapping exists) in O(f˜ (σ)ⁿ^G)expected time, where

f(σ) = 4σ(2σ−2)³

4(2σ−2)³+27(2σ−3)·

Notice

f(σ)< σ, forσ >2.

f(3)<2.279,f(4)<3.460 andf(5)<4.578.

ρ=1.

(30)

Outline

1 Introduction

5 Future works

(31)

Hardness results

Theorem

TheMAX–(3,3)–MATCHING–COLORSproblem isAPX-hard even if both G and H are linear forests, and

theMAX–(2,2)–MATCHING–COLORSproblem isAPX-hard even if both G and H are trees.

Notice

It remains open, however, whether the

MAX–(ρ, σ)–MATCHING–COLORSproblem for linear forestsGandHis polynomial-time solvable in caseρ <3.

(32)

Outline

1 Introduction

5 Future works

(33)

Outline

1 Introduction

5 Future works

(34)

Bounded degree graphs: Intermediate problem

Max–Matching–With–Color–Constraints

•Input: A graphGtogether with a coloring mapping

λ_G : V(G) → {c₁,c₂, . . . ,c_m}, and a symmetric matrix A = [a_i,j] of ordermwhose entries are natural integers.

•Solution : A matchingM ⊆ E(G) s.t. the constraint that, for 1≤i≤j ≤m, the number of edges inMhaving one end-vertex coloredc_i and one end-vertex coloredc_j is at mosta_i,j.

•Measure: The size of the matching,i.e.,#M.

(35)

Bounded degree graphs: Intermediate problem

Theorem

The MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem is NP-complete but is approximable within ratio3/2+ε, for anyε >0.

Proof.

Approximation preserving reduction toMAXIMUM B-SETPACKING. MAXIMUM SETPACKINGis defined as follows: Given a collection Sof finite subsets of a ground setX, find a maximum cardinality collection of pairwise disjoint setsS⁰⊆S.

MAXIMUMB-SETPACKINGis the variation ofMAXIMUM SET

PACKINGin which the cardinality of all sets inC are bounded from above by a constantB≥3.

(36)

Bounded degree graphs

Chromatic index

Anedge coloringof a graphGisproperif no two adjacent edges are assigned the same color.

The smallest number of colors needed in a proper edge coloring of a graphGis thechromatic indexχ⁰(G).

Vizing’s theorem states thatχ⁰(G)≤∆(G) +1and that such an edge coloring can be found in polynomial-time.

Petersen graph: χ⁰(G) =∆(G) +1=4

(37)

Bounded degree graphs

Theorem

For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORS problem is approximable within ratio3/2(∆_min+1) +εdor anyε >0, where

∆_min =min{∆(G), ∆(H)}.

Key elements Chromatic index.

Vizing’s theorem.

Iteratively using the(3/2+ε)-approximation algorithm for

instances of theMAX–MATCHING–WITH–COLOR–CONSTRAINTS

problem.

(38)

Bounded degree graphs

Proof.

1. ∆_min =∆(H).

1 H admits a proper edge coloring with at most∆(H) +1 colors, say {c₁⁰,c₂⁰, . . . ,c_∆(H)+1⁰ }.

2 For 1≤i≤∆(H) +1,

1 letHi be the graph obtained fromHby deleting all edges but those colored with colorc_i⁰,note thatHi is a matching.

2 Using the(3/2+ε)-approximation algorithm for the

MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem, we obtain 2-approximation algorithm for the new instance of the MAX–(ρ, σ)–MATCHING–COLORSproblem obtained by replacingH byH_i.

3 Returning the best one these∆(H) +1 mappings yields an

approximation algorithm with performance ratio 3/2(∆(H) +1) +ε.

(39)

Bounded degree graphs

Proof.

2. ∆_min =∆(G):

1 Gadmits a proper edge coloring with at most∆(G) +1 colors, say {c₁⁰,c₂⁰, . . . ,c_∆(G)+1⁰ }.

2 For 1≤i≤∆(G) +1,

1 letGi be the graph obtained fromGby deleting all edges but those colored with colorc_i⁰,note thatGi is a matching.

2 Using the(3/2+ε)-approximation algorithm for the

MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem, we obtain 2-approximation algorithm for the new instance of the MAX–(ρ, σ)–MATCHING–COLORSproblem obtained by replacingG byG_i.

3 Returning the best one these∆(G) +1 mappings yields an

approximation algorithm with performance ratio 3/2(∆(G) +1) +ε.

(40)

Bounded degree graphs

Proof.

3. Combining

∆_min=∆(H): (3/2(∆(H) +1) +ε)-approximation algorithm.

∆_min=∆(G): (3/2(∆(G) +1) +ε)-approximation algorithm.

yields

(3/2(∆_min+1) +ε)-approxmination algorithm for anyε >0,

∆_min=min{∆(G), ∆(H)}.

(41)

Outline

1 Introduction

5 Future works

(42)

A randomized algorithm

Definitions

LetGbe a graph andλ_G :V(G)→Cbe a coloring mapping ofG.

Alegal(`₁, `₂)-labelingofGis an assignment to labels{`₁, `₂}to the vertices ofGsuch that, for each colorc_i ∈C, eitherj_#

CG(ci) 2

k

orl_#

CG(ci) 2

m

vertices inCG(c_i)are labeled`₁.

Thecut induced by a legal(`₁, `₂)-labelingto be the set of edges that have one end-vertex with label`₁and one end-vertex with label`₂.

(43)

A randomized algorithm

(44)

A randomized algorithm

`₁

`₁ `₂

`₂

`₁

`₂

`₁ `₁

`₂

(45)

A randomized algorithm

`₁-subset `₂-subset

`₁

`₁ `₂

`₂

`₁

`₂

`₁ `₁

`₂

(46)

A randomized algorithm

`₁-subset `₂-subset

(`₁, `₂)-cut edges

`₁

`₁ `₂

`₂

`₁

`₂

`₁ `₁

`₂

(47)

A randomized algorithm

Theorem

There exists a randomized algorithm for the

MAX–(ρ, σ)–MATCHING–COLORSproblem with expected performance ratio4σ.

Key elements

Random(`₁, `₂)-labeling.

Random mappingθ:V(G)−−−^λ^G^,λ→^H V(H).

Maximum weighted matching in bipartite graphs.

(48)

A randomized algorithm

Pattern graph(G, λ_G) Target graph(H, λH)

(49)

A randomized algorithm

Pattern graph(G, λ_G)

`₂-labeling

`₁-labeling

Target graph(H, λH)

(50)

A randomized algorithm

`₂-labeling

`₁-labeling

Target graph(H, λH) Optimal solutionθ_opt

(51)

A randomized algorithm

`₂-labeling

`₁-labeling

Random mapping θ_rand

(52)

A randomized algorithm

`₂-labeling

`₁-labeling

Random mapping θ_rand u

v

θ_opt(u) =θ_rand(u)

∀w ∈V(G)|`₁ θ_rand(w)6=θ_opt(v)

(53)

A randomized algorithm

`₂-labeling

`₁-labeling

v

∀w ∈V(G)|`₁ θ_rand(w)6=θ_opt(v) Weight. matching

M

(54)

A randomized algorithm

`₂-labeling

`₁-labeling

v

∀w ∈V(G)|`₁ θ_rand(w)6=θ_opt(v)

Solution θ_sol =θ_rand+M

Weight. matching M

(55)

Outline

1 Introduction

5 Future works

(56)

Linear Forests

Theorem

The MAX–(3,3)–MATCHING–COLORSproblem isAPX-hard even if both G and H are linear forests.

Theorem

For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORSproblem is approximable within ratio4in case both G and H are linear forests.

Key elements

Balanced 2-intervals.

Weighted independent set.

(57)

Linear Forests

Definitions

A2-intervalD = (I,J)is the union of two disjoint intervals defined over a single line.

D

I J

A 2-intervalD = (I,J)is said to bebalancedif|I|=|J|. D

I J

Two 2-intervalsD₁= (I₁,J₁)andD₂= (I₂,J₂)aredisjoint, if both 2-intervals share no common point.

(58)

Linear Forests

11 11 2 11 11 11

P₁^G P₂^G

P₁^H P₂^H P₃^H

Pattern graph(G, λG) Target graph(H, λH)

P₁^G P₂^G P₁^H P₂^H P₃^H

(59)

Linear Forests

Theorem (Crochemore, Hermelin, Landau, Rawitz and V., 2006)

There exists a polynomial-time algorithm with performance ratio4for finding a maximum weight subset of disjoint2-intervals in a set of weighted balanced2-intervals.

Key elements

Local ratio technique.

r-effective weight function.

Theorem

For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORSproblem is approximable within ratio4in case both G and H are linear forests.

(60)

Outline

1 Introduction

5 Future works

(61)

Future works

Improve the random walk algorithm for the EXACT–(ρ, σ)–MATCHING–COLORSproblem.

What aboutρ≥2 . . . ?

Improve the approximation ratio for bounded degree graphs Design a better (randomized?) approximation algorithm for the MAX–(ρ, σ)–MATCHING–COLORSproblem.

Is theMAX–(ρ, σ)–MATCHING–COLORSproblem approximable within ratioσ?