Pattern Matching in Protein-Protein Interaction Graphs
Ga ¨elle Brevier(Universit ´e de Grenoble, France) Romeo Rizzi(Universit `a di Udine, Italy) St ´ephane Vialette(Universit ´e Paris-Est, France)
Lisbon, September 19, 2008
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Introduction
Protein interactions identified on a genome-wide scale are commonly visualized as protein interaction graphs, where proteins are vertices and interactions are edges.
Gene or Protein Interactions Databases
BioGRID- A Database of Genetic and Physical Interactions DIP- Database of Interacting Proteins
MINT- A Molecular Interactions Database IntAct- EMBL-EBI Protein Interaction
MIPS- Comprehensive Yeast Protein-Protein interactions
Yeast Protein Interactions- Yeast two-hybrid results from Fields’ group PathCalling- A yeast protein interaction database by Curagen SPiD- Bacillus subtilis Protein Interaction Database
AllFuse- Functional Associations of Proteins in Complete Genomes BRITE- Biomolecular Relations in Information Transmission and Expression ProMesh- A Protein-Protein Interaction Database
The PIM Database- by Hybrigenics Mouse Protein-Protein interactions
Human herpesvirus 1 Protein-Protein interactions Human Protein Reference Database
BOND- The Biomolecular Object Network Databank. Former BIND
MDSP- Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry Protcom- Database of protein-protein complexes enriched with the domain-domain structures Proteins that interact with GroEL and factors that affect their release
YPDTM- Yeast Proteome Database by Incyte . . .
Introduction
Comparative analysis of protein-protein interaction graphs aims at finding complexes that are common to different species.
Mounting evidence suggests that proteins that function together in a pathway or a structural complex are likely to evolve in a
correlated fashion.
Intoduction
Pattern matching in protein-protein interaction graphs Finding a protein complex in another protein network.
Graph matching
Focus on mappings that preserve adjacencies (to deal with interaction datasets that are missing many true protein interactions).
Injective list homomorphisms and optimization State-of-the art approaches to identifying orthologs (genes in different species that originate from a single gene in the last common ancestor of these species).
Putative orthologs are represented by colors
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for an exact occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =5 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4 θ:V(G)−−−λG,λ→H V(H)
Introduction: Searching for the best occurrence
Pattern graph(G, λG) mult(G, λG) =2
Target graph(H, λH) mult(H, λH) =4 θ:V(G)−−−λG,λ→H V(H)
Problem
Max–(ρ, σ)–Matching–Colors
•Input: Two graphsGandH and the coloring mappings λG :V(G)→C, mult(G, λG) =ρ, and
λH :V(H)→C, mult(H, λH) =σ.
•Solution: An injective mappingθ:V(G)−−−λG,λ→H V(H).
•Measure: The number of edges ofGmatched by the injective mappingθ.
EXACT–(ρ, σ)–MATCHING–COLORSis the extremal problem of finding an injective mappingθ:V(G)−−−λG,λ→H V(H)that matches all the edges ofG.
Introduction
Trim instance
An instance of theMAX–(ρ, σ)–MATCHING–COLORSor the
EXACT–(ρ, σ)–MATCHING–COLORSproblem is said to betrimif the following conditions hold true:
1 for each colorci ∈C,#CG(ci)≤#CH(ci), and
2 for each edge{ui,uj}∈E(G), there exists an edge{vi,vj}∈E(H) such thatλG(ui) =λH(vi)andλG(uj) =λH(vj).
Related works in the context
List injective homomorphisms for protein graphs
[Fagnot, Lelandais and V., 2007; Fertin, Rizzi and V., 2005].
Reaction motifs in metabolic networks
[Lacroix, Fernandes and Sagot, 2006; Hermelin, Fellows, Fertin and V., 2007].
QPath
[Shlomi, Segal, Ruppin and Sharan, 2006].
Path Matching and Graph Matching in Biological Networks [Yang and Sze, 2007].
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Exact colorful instances
Theorem (Fagnot, Lelandais and V., 2007)
Both theEXACT–(1, σ)–MATCHING–COLORSproblem for∆(G)≤2 and theEXACT–(ρ,2)–MATCHING–COLORSproblem are solvable in polynomial-time for any constantρandσ.
Theorem (Fertin, Rizzi and V., 2005)
The EXACT–(1,3)–MATCHING–COLORSproblem for∆(G) =3and
∆(H) =4isNP-complete.
We focus here on theEXACT–(1, σ)–MATCHING–COLORSproblem.
Exact colorful instances
Algorithm 1:Rand-Exact-Matching-Colors begin
terminating whether an occurrence ofGinH w.r.tλGandλH is found.Letθ:V(G)−−−λG,λ→H V(H)be a random injective mapping.
up to 3nGtimes, terminating whether an occurrence of GinH w.r.t λG andλH is found. (1)Choose at random an edge e∈E(G)that is not matched byθ.
(2)Choose at random one vertex u∈e.
(3)Change at random the value ofθ(u)w.r.t.λG andλH. end
Exact colorful instances: Random walk
Particle moving along the integer line Fix an optimal solutionθopt.
θi andθoptagree on exactlyj vertices.
e={u,v}∈E(G)random edge that is not matched byθi. θi andθoptdisagree on exactly one ofu andv.
0 j−1 j j+1 nG
σ−1 2σ−2
1 2σ−2
σ−2 2σ−2
Exact colorful instances: Random walk
Particle moving along the integer line Fix an optimal solutionθopt.
θi andθoptagree on exactlyj vertices.
e={u,v}∈E(G)random edge that is not matched byθi. θi andθoptdisagree on bothu andv.
0 j−1 j j+1 nG
0 2
2σ−2
2σ−4 2σ−2
Exact colorful instances: Random walk
Particle moving along the integer line Pessimistic stochastic process(Y1,Y2, . . .)
0 j−1 j j+1 nG
2σ−3 2σ−2
1 2σ−2
0
Exact colorful instances: Random walk
Useful bounds
Letrj be the probability of exactlyk “moves down”, andj+k
“moves up” in a sequence of 2k+j moves:
rj ≥
2σ−3 2σ−2
k 1 2σ−2
j+k
Letqj be the probability that the algorithm finds an injective homomorphism withinj+2k ≤3nG steps, starting from a random injective mappingθ:V(G)−−−λG,λ→H V(H)
qj ≥
√3 8p
πj
27(2σ−3) 4(2σ−2)3
j
Exact colorful instances: Random walk
Theorem
AlgorithmRand-Exact-Matching-Colorsreturns an injective
homomorphismθ:V(G)−−−λG,λ→H V(H)(if such a mapping exists) in O(f˜ (σ)nG)expected time, where
f(σ) = 4σ(2σ−2)3
4(2σ−2)3+27(2σ−3)·
Notice
f(σ)< σ, forσ >2.
f(3)<2.279,f(4)<3.460 andf(5)<4.578.
ρ=1.
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Hardness results
Theorem
TheMAX–(3,3)–MATCHING–COLORSproblem isAPX-hard even if both G and H are linear forests, and
theMAX–(2,2)–MATCHING–COLORSproblem isAPX-hard even if both G and H are trees.
Notice
It remains open, however, whether the
MAX–(ρ, σ)–MATCHING–COLORSproblem for linear forestsGandHis polynomial-time solvable in caseρ <3.
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Bounded degree graphs: Intermediate problem
Max–Matching–With–Color–Constraints
•Input: A graphGtogether with a coloring mapping
λG : V(G) → {c1,c2, . . . ,cm}, and a symmetric matrix A = [ai,j] of ordermwhose entries are natural integers.
•Solution : A matchingM ⊆ E(G) s.t. the constraint that, for 1≤i≤j ≤m, the number of edges inMhaving one end-vertex coloredci and one end-vertex coloredcj is at mostai,j.
•Measure: The size of the matching,i.e.,#M.
Bounded degree graphs: Intermediate problem
Theorem
The MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem is NP-complete but is approximable within ratio3/2+ε, for anyε >0.
Proof.
Approximation preserving reduction toMAXIMUM B-SETPACKING. MAXIMUM SETPACKINGis defined as follows: Given a collection Sof finite subsets of a ground setX, find a maximum cardinality collection of pairwise disjoint setsS0⊆S.
MAXIMUMB-SETPACKINGis the variation ofMAXIMUM SET
PACKINGin which the cardinality of all sets inC are bounded from above by a constantB≥3.
Bounded degree graphs
Chromatic index
Anedge coloringof a graphGisproperif no two adjacent edges are assigned the same color.
The smallest number of colors needed in a proper edge coloring of a graphGis thechromatic indexχ0(G).
Vizing’s theorem states thatχ0(G)≤∆(G) +1and that such an edge coloring can be found in polynomial-time.
Petersen graph: χ0(G) =∆(G) +1=4
Bounded degree graphs
Theorem
For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORS problem is approximable within ratio3/2(∆min+1) +εdor anyε >0, where
∆min =min{∆(G), ∆(H)}.
Key elements Chromatic index.
Vizing’s theorem.
Iteratively using the(3/2+ε)-approximation algorithm for
instances of theMAX–MATCHING–WITH–COLOR–CONSTRAINTS
problem.
Bounded degree graphs
Proof.
1. ∆min =∆(H).
1 H admits a proper edge coloring with at most∆(H) +1 colors, say {c10,c20, . . . ,c∆(H)+10 }.
2 For 1≤i≤∆(H) +1,
1 letHi be the graph obtained fromHby deleting all edges but those colored with colorci0,note thatHi is a matching.
2 Using the(3/2+ε)-approximation algorithm for the
MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem, we obtain 2-approximation algorithm for the new instance of the MAX–(ρ, σ)–MATCHING–COLORSproblem obtained by replacingH byHi.
3 Returning the best one these∆(H) +1 mappings yields an
approximation algorithm with performance ratio 3/2(∆(H) +1) +ε.
Bounded degree graphs
Proof.
2. ∆min =∆(G):
1 Gadmits a proper edge coloring with at most∆(G) +1 colors, say {c10,c20, . . . ,c∆(G)+10 }.
2 For 1≤i≤∆(G) +1,
1 letGi be the graph obtained fromGby deleting all edges but those colored with colorci0,note thatGi is a matching.
2 Using the(3/2+ε)-approximation algorithm for the
MAX–MATCHING–WITH–COLOR–CONSTRAINTSproblem, we obtain 2-approximation algorithm for the new instance of the MAX–(ρ, σ)–MATCHING–COLORSproblem obtained by replacingG byGi.
3 Returning the best one these∆(G) +1 mappings yields an
approximation algorithm with performance ratio 3/2(∆(G) +1) +ε.
Bounded degree graphs
Proof.
3. Combining
∆min=∆(H): (3/2(∆(H) +1) +ε)-approximation algorithm.
∆min=∆(G): (3/2(∆(G) +1) +ε)-approximation algorithm.
yields
(3/2(∆min+1) +ε)-approxmination algorithm for anyε >0,
∆min=min{∆(G), ∆(H)}.
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
A randomized algorithm
Definitions
LetGbe a graph andλG :V(G)→Cbe a coloring mapping ofG.
Alegal(`1, `2)-labelingofGis an assignment to labels{`1, `2}to the vertices ofGsuch that, for each colorci ∈C, eitherj#
CG(ci) 2
k
orl#
CG(ci) 2
m
vertices inCG(ci)are labeled`1.
Thecut induced by a legal(`1, `2)-labelingto be the set of edges that have one end-vertex with label`1and one end-vertex with label`2.
A randomized algorithm
A randomized algorithm
`1
`1
`1 `2
`2
`1
`1
`2
`2
`1 `1
`2
A randomized algorithm
`1-subset `2-subset
`1
`1
`1 `2
`2
`1
`1
`2
`2
`1 `1
`2
A randomized algorithm
`1-subset `2-subset
(`1, `2)-cut edges
`1
`1
`1 `2
`2
`1
`1
`2
`2
`1 `1
`2
A randomized algorithm
Theorem
There exists a randomized algorithm for the
MAX–(ρ, σ)–MATCHING–COLORSproblem with expected performance ratio4σ.
Key elements
Random(`1, `2)-labeling.
Random mappingθ:V(G)−−−λG,λ→H V(H).
Maximum weighted matching in bipartite graphs.
A randomized algorithm
Pattern graph(G, λG) Target graph(H, λH)
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH)
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH) Optimal solutionθopt
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH) Optimal solutionθopt
Random mapping θrand
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH) Optimal solutionθopt
Random mapping θrand u
v
θopt(u) =θrand(u)
∀w ∈V(G)|`1 θrand(w)6=θopt(v)
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH) Optimal solutionθopt
Random mapping θrand u
v
θopt(u) =θrand(u)
∀w ∈V(G)|`1 θrand(w)6=θopt(v) Weight. matching
M
A randomized algorithm
Pattern graph(G, λG)
`2-labeling
`1-labeling
Target graph(H, λH) Optimal solutionθopt
Random mapping θrand u
v
θopt(u) =θrand(u)
∀w ∈V(G)|`1 θrand(w)6=θopt(v)
Solution θsol =θrand+M
Weight. matching M
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Linear Forests
Theorem
The MAX–(3,3)–MATCHING–COLORSproblem isAPX-hard even if both G and H are linear forests.
Theorem
For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORSproblem is approximable within ratio4in case both G and H are linear forests.
Key elements
Balanced 2-intervals.
Weighted independent set.
Linear Forests
Definitions
A2-intervalD = (I,J)is the union of two disjoint intervals defined over a single line.
D
I J
A 2-intervalD = (I,J)is said to bebalancedif|I|=|J|. D
I J
Two 2-intervalsD1= (I1,J1)andD2= (I2,J2)aredisjoint, if both 2-intervals share no common point.
Linear Forests
11 11 2 11 11 11
P1G P2G
P1H P2H P3H
Pattern graph(G, λG) Target graph(H, λH)
P1G P2G P1H P2H P3H
Linear Forests
Theorem (Crochemore, Hermelin, Landau, Rawitz and V., 2006)
There exists a polynomial-time algorithm with performance ratio4for finding a maximum weight subset of disjoint2-intervals in a set of weighted balanced2-intervals.
Key elements
Local ratio technique.
r-effective weight function.
Theorem
For anyρandσ, theMAX–(ρ, σ)–MATCHING–COLORSproblem is approximable within ratio4in case both G and H are linear forests.
Outline
1 Introduction
2 Exact colorful instances
3 Hardness results
4 Approximation algorithms Bounded degree graphs A randomized algorithm Linear forests
5 Future works
Future works
Improve the random walk algorithm for the EXACT–(ρ, σ)–MATCHING–COLORSproblem.
What aboutρ≥2 . . . ?
Improve the approximation ratio for bounded degree graphs Design a better (randomized?) approximation algorithm for the MAX–(ρ, σ)–MATCHING–COLORSproblem.
Is theMAX–(ρ, σ)–MATCHING–COLORSproblem approximable within ratioσ?