HAL Id: inria-00090635
https://hal.inria.fr/inria-00090635v2
Submitted on 7 Sep 2006
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Lagrangian Approaches for a class of Matching Problems in Computational Biology
Nicola Yanev, Rumen Andonov, Philippe Veber, Stefan Balev
To cite this version:
Nicola Yanev, Rumen Andonov, Philippe Veber, Stefan Balev. Lagrangian Approaches for a class of Matching Problems in Computational Biology. [Research Report] RR-5973, INRIA. 2006. �inria- 00090635v2�
inria-00090635, version 2 - 7 Sep 2006
a p p o r t
d e r e c h e r c h e
SN0249-6399ISRNINRIA/RR--5973--FR+ENG
Thème BIO
Lagrangian Approaches for a class of Matching Problems in Computational Biology
Nicola Yanev and Rumen Andonov and Philippe Veber and Stefan Balev
N° 5973
31st August 2006
Unité de recherche INRIA Rennes
IRISA, Campus universitaire de Beaulieu, 35042 Rennes Cedex (France)
Téléphone : +33 2 99 84 71 00 — Télécopie : +33 2 99 84 71 71
NiolaYanev
∗
and RumenAndonov
†
and PhilippeVeber
†
and Stefan Balev
‡
ThèmeBIOSystèmesbiologiques
ProjetsSymbiose
Rapportdereherhe n°597331stAugust200618pages
Abstrat: Thispaperpresentseientalgorithmsforsolvingtheproblemof aligningaprotein
struture template to a query amino-aid sequene, known as protein threading problem. We
onsider the problem as a speial ase of graph mathing problem. We giveformal graphand
integerprogrammingmodelsof the problem. After studying the properties of these models, we
proposetwokindsofLagrangianrelaxationforsolvingthem. Wepresentexperimentalresultson
reallife instanesshowingtheeieny ofourapproahes.
Key-words: sequene-struturealignment,omplexity,integerprogramming,Lagrangianrelax-
ation
∗
hobymath.bas.org
†
prenom.nomirisa.fr
‡
stefan.balevuniv-lehavre.fr
d'une lasse de problème d'appariement en bioinformatique
Résumé : Cet artile propose des algorithmes eaes pour déterminer l'alignement optimal
entreunestruture etune séqueneprotéique,problèmeonnu souslenom deprotein threading.
Nous posons e problème ommeun as partiulier d'appariement. Nousprésentonsunmodèle
formelduproblèmesouslaformed'unefamilledegraphes,etdesprogrammesennombreentiers
orrespondants. Nousétudionsdansunpremiertempslespropriétésdeesmodèles,pourensuite
proposer deuxapprohesde relaxationlagrangiennepourlarésolution. Enn, nousmontrons,à
l'aidededonnéesexpérimentalessurdesinstanesréelles,l'eaitédeesapprohes.
Mots-lés : alignement séquene-struture, omplexité, programmation en nombres entiers,
relaxationlagrangienne
A C G C A A
A G T C T
ACG_CAA A_GTC_T
Figure1: Mathinginterpretationofsequenealignmentproblem
1 Preliminaries
Mathing is important lass of ombinatorial optimization problems with many real-life appli-
ations. Mathing problems involve hoosing a subset of edges of a graph subjet to degree
onstraintsontheverties. Manyalignmentproblemsarisinginomputationalbiologyarespeial
ases of mathing in bipartite graphs. In these problems the verties of the graph anbe nu-
leotidesofaDNAsequene,aminoaidsofaproteinsequeneorseondarystrutureelementsof
aproteinstruture. Unlikelassialmathingproblems, alignmentproblems haveintrinsiorder
onthegraphvertiesandthisimpliesextraonstraintsontheedges. Asanexample,Fig.1shows
an alignment of two sequenes as a mathing in bipartite graph. We an see that the feasible
alignmentsare1-mathingswithoutrossingedges.
Inthis paper we deal with the problem of aligning aprotein struture template to a query
proteinsequeneoflengthN,knownasproteinthreadingproblem(PTP).Atemplateisanordered
set of m seondarystruture elements(or bloks)of lengthsli, i = 1, . . . , m. An alignment(or
threading)isoveringofontiguoussequeneareasbythebloks. Athreadingisalledfeasibleif
theblokspreservetheirorderanddonotoverlap. A threadingisompletely determinedbythe
startingpositionsof allbloks. Forthesakeofsimpliitywewilluserelativepositions. Ifbloki
startsatthejthqueryharater,itsrelativepositionisri=j−Pi−1
k=1lk. Inthiswaythepossible
(relative)positionsofeahsegmentarebetween1andn=N+ 1−Pm
i=1li (seeFig.2(b)). The
setoffeasiblethreadingsis
T ={(r1, . . . , rm)|1≤r1≤ · · · ≤rm≤n}.
Proteinthreadingproblem isamathingproblemin abipartitegraph(U∪V, U×V),where U ={u1, . . . , um} istheorderedset ofbloksand V ={v1, . . . , vn}is theorderedset ofrelative
positions. Thethreadingfeasibilityonditionsanberestatedintermsofmathinginthefollowing
way. AmathingM ⊆U×V isfeasibleif:
(i) d(u) = 1,u∈U (whered(x)is thedegreeofx). Thismeansthat eah blokis assignedto
exatlyoneposition). Bythewaythisimpliesthattheardinalityofeahfeasiblemathing
ism.
(ii) There are no rossing edges, or more preisely, if (ui, vj) ∈ M, (uk, vl) ∈ M and i < k,
then j ≤l. This meansthat the blokspreservetheir order and donotoverlap. Thelast
inequalityisnotstritbeauseofusingrelativepositions.
Note that while (i) isa lassialmathing onstraint, (ii)is spei for thealignmentproblems
andmakesthemmorediult. Fig.2()showsamathingorrespondingtoafeasiblethreading.
Proposition 1. Thenumberof feasiblethreadings is|T |= m+n−1m
.
Proof. Wean denetherelativepositionsasri =j−Pi−1
k=1lk+i−1. Inthisasetherelative
positionsofthefeasiblethreadingsare relatedby
1≤r1<· · ·< rm≤m+n−1
andathreadingisdeterminedbyhoosingmoutofm+n−1positions.
(a)
abs.position 1 2 3 4 5 6 7 8 9 10 1112 13 1415 16 1718 19 20
rel.positionblok1 1 2 3 4 5 6 7 8 9
rel.positionblok2 1 2 3 4 5 6 7 8 9
rel.positionblok3 1 2 3 4 5 6 7 8 9
(b)
1 2 3
1 2 3 4 5 6 7 8 9
V
U
()
Figure 2: (a) Example of alignmentof query sequene of length 20 and template ontaining 3
segmentsoflengths3,5and4. (b)Correspondenebetweenabsoluteandrelativeblokpositions.
()Amathing orrespondingtothealignmentof(a).
One of the possible ways to deal with alignment problems is to try to adapt the existing
mathing tehniquesto thenewedgeonstraintsof type(ii). Insteadof doingthisweproposea
newgraphmodel andwedevelopeientmathingalgorithmsbasedonthismodel.
Weintrodue analignment graph G= (U×V, E). Eah vertex ofthis graphorrespondsto
an edge of themathing graph. Forsimpliity wewill denote thevertiesby vij, i = 1, . . . , m, j = 1, . . . , n and draw them as an n×m grid (see Fig. 3). The verties vij, j = 1, . . . , n will
be alled ith layer. A layer orresponds to a blok and eah vertex in a layerorresponds to
positioningofthisblokin thequerysequene.
Oneanonnetbyedgesthepairsofvertiesof Gwhih orrespondto pairsof nonrossing
edges in the mathing graph. In this ase a feasible threading is an m-lique in G. A similar
approahisusedin[12℄. Weintrodueonlyasubsetoftheaboveedges,namelytheonesthaton-
netvertiesfromadjaentolumnsandhavethefollowingregularpattern: E={(vij, vi+1,l)|i= 1, . . . , m−1,1 ≤j ≤l ≤n}. We addtwomorevertiesS and T and edgesonnetingS to all
vertiesfrom the rstolumn and T to allvertiesfrom the last olumn. Nowit is easyto see
theone-to-one orrespondene betweentheset of feasiblethreadings (or mathings)and theset
ofS-T pathsinG. Fig.3illustratesthisorrespondene.
Till now we gave several alternative ways to desribe the feasible alignments. Alignment
problemsinomputationalbiologyinvolvehoosingthebestofthembasedonsomesorefuntion.
Thesimplestsorefuntions assoiateweightsto theedges ofthemathinggraph. Forexample,
this istheaseofsequene alignmentproblems. Byintroduingalignmentgraphssimilarto the
above, lassialsequenealignment algorithms, suh asSmith-Waterman orNeedleman-Wunh,
anbeviewedasndingshortestS-T paths. Whenthesorefuntionsusestruturalinformation, theproblemsaremorediultandtheshortestpathmodelannotinorporatethisinformation.
Thesorefuntionsin PTPevaluatethedegreeofompatibilitybetweenthesequeneamino
aidsandtheirpositionsinthetemplatebloks. Theinterations(orlinks)betweenthetemplate
block position
T S
i = 1 i = 2 i = 3 i = 4 i = 5 i = 6
j = 1 j = 2 j = 3 j = 4
Figure 3: Exampleof alignmentgraph. The pathin thiklines orresponds to thethreading in
whihthepositionsofthebloksare1,2,2,3,4,4.
bloksaredesribedbytheso-alledgeneralizedontatmapgraph,whosevertiesarethebloks
andwhoseedgesonnetpairsof interatingbloks. LetLbethesetoftheseedges:
L={(i, k)|i < kandbloksi andkinterat}
Sometimes we need to distinguish the links between adjaent bloks and the other links. Let
R={(i, k)|(i, k)∈L, k−i >1} bethesetofremote(ornon-loal)links. Thelinksfrom L\R
arealled loallinks. Withoutlossofgeneralityweansuppose thatallpairsofadjaentbloks
interat.
Thelinksbetweenthebloksgeneratesoreswhihdependontheblokpositions. Inthisway
asorefuntionofPTP anbepresentedbythefollowingsetsofoeients
cij, i= 1, . . . , m, j= 1, . . . , n,thesoreofputting blokionposition j
dijkl,(i, k)∈L,1≤j≤l≤n,thesoregeneratedbytheinterationbetweenbloksiand k whenblokiisonpositionj andblokk isonpositionl.
Theoeientscij aresomefuntion (usually sum) ofthepreferenes ofeah queryaminoaid
plaedin blokiforoupyingitsassignedposition,aswellasthesoresofpairwiseinterations betweenaminoaidsbelongingtobloki. Theoeientsdijkl inludethesoresofinterations between pairs of amino aids belonging to bloks i and j. Loops (sequenes between adjaent
bloks)mayalsohavesequenespei sores,inludedintheoeientsdi,j,i+1,l.
The soreof a threading is the sum of the orresponding sore oeientsand PTP is the
optimizationproblem ofndingthethreadingof minimumsore. Iftherearenoremotelinks (if
R=∅)weanputthesoreoeientsonthevertiesandtheedgesofthealignmentgraphand
PTP isequivalentto theproblem ofnding theshortestS-T path. Inorder totakethe remote
linksinto aount,weaddtothealignmentgraphtheedges
{(vij, vkl)|(i, k)∈R, 1≤j≤l≤n}
whihwewillreferasz-edges.
An S-T pathissaidtoativatethez-edgesthat havebothendsonthispath. EahS-T path
ativates exatly|R| z-edges, one foreah link in R. The subgraphindued by theedges of an S-T pathandtheativatedz-edgesisalled augmentedpath. ThusPTPisequivalentto nding
theshortestaugmentedpathin thealignmentgraph(seeFig.4).
Aswewillseelater,themainadvantageofthisgraphisthatsomesimplealignmentproblems
redue to nding the shortest S-T path in it with some pries assoiated to the edges and/or
verties. The last problem an be easily solved by a trivial dynami programming algorithm
of omplexity O(mn2). In order to address the general ase we need to represent this graph
optimisationproblem asanintegerprogrammingproblem.
j = 2 j = 3 j = 4
i = 1 i = 2 i = 3 i = 4 i = 5 i = 6
j = 1
block position
T S
c1122
c2232
c3243
c4354
c5464
c1132
c3264
c4364
Figure4: Exampleofaugmentedpath. Thegeneralizedontatmapgraphisgiveninthebottom.
ThexarsoftheS-T pathareinsolidlines. Theativatedz-arsareindashedlines. Thelength
oftheaugmentedpathisequaltothesoreof thethreading(1,2,2,3,4,4).
2 Integer programming formulation
Letyij bebinaryvariablesassoiatedtothevertiesofG. yij isoneifblokiisonpositionjand
zerootherwise. LetY bethepolytopedenedbythefollowingonstraints:
n
X
j=1
yij= 1 i= 1, . . . , m (1)
j
X
l=1
yil−
j
X
l=1
yi+1,l≥0 i= 1, . . . , m−1, j= 1, . . . , n−1 (2) yij ≥0 i= 1, . . . , m, j= 1, . . . , n (3)
Constraints(1) ensure the feasibility ondition (i) and(2) are responsiblefor (ii). That is why
Y ∩Bmnisexatlythesetoffeasiblethreadings.
In order totakeinto aountthe interationosts, weintrodue aseond set ofbinary vari-
ables zijkl, (i, k) ∈ L, 1 ≤ j ≤ l ≤ n. To avoid added notation we will use vetor nota-
tion for the variables yi = (yi1, ...yin) ∈ Bn with assigned osts ci = (ci1, ...cin) ∈ Rn and zik = (zi1k1, . . . , zi1kn, zi2k2, . . . , zi2kn, . . . , zinkn) ∈ Bn(n+1)2 for (i, k) ∈ L with assigned osts dik = (di1k1, . . . , di1kn, di2k2, . . . , di2kn, . . . , dinkn)∈R
n(n+1) 2 .
Considerthe 2n×n(n+1)2 node-edgeinidene matrixof thesubgraphspanned bytwointer-
atinglayersiandk. ThesubmatrixA′ ontainingtherstnrows(resp. A′′ontaingthelast n
rows)orrespondsto thelayeri(resp. layerk).
Nowtheproteinthreadingproblemanbedenedas
zIPL =v(P T P(L)) = min{
m
X
i=1
ciyi+ X
(i,k)∈L
dikzik} (4)
subjetto: y= (y1, . . . , ym)∈Y, (5)
yi=A′zik (i, k)∈L (6)
yk=A′′zik (i, k)∈L (7)
zik∈Bn(n+1)2 (i, k)∈L (8)
Theshortutnotationv(.)will beused fortheoptimalobjetivefuntion valueofasubproblem
obtainedfrom P T P(L)withsomez variablesxed.
3 Complexity results
In this setion we study thestruture of thepolytopedened by (5)-(7) and zik ∈ R
n(n+1) 2
+ , as
wellastheimpatofthesetLontheomplexityofthealgorithmsforsolvingtheP T P problem.
Throughoutthissetion,vertexostsciareassumedtobezero. Thisassumptionisnotrestritive
beausetheostscij anbeaddedtodi,j,i+1,l,l=j, . . . , n. Wewillonsidertheostsdik asn×n
matries ontainingthe oeients dijkl abovethe main diagonal and arbitrary large numbers
below the main diagonal. In order to simplify the desriptions of the algorithms given in this
setionweintroduethefollowingmatrixoperations.
Denition 1. LetA and B be twomatries ofompatible size. A·B isthe matrixprodutof A and B where the additionoperation is replaed bymin andthe multipliation operationis replaedby+.
Denition 2. LetA andB betwomatries ofsize n×n. M =A⊗B isdened by M(i, j) = mini≤r≤jA(i, r) +B(i, j)
Belowwepresentfourkindsofontatgraphsthat makePTPpolynomiallysolvable.
3.1 Contat graph ontains only loal edges
Asmentionedabove,inthisasePTPredues tondingtheshortestS-T pathinthealignment
graphwhihanbedonebyO(mn2)dynamiprogrammingalgorithm. Animportantpropertyof analignmentgraphontainingonlyloaledgesisthatithasatightLPdesription.
Theorem1. Thepolytope Y isintegral, i.e. ithasonly integer-valuedverties.
Proof. Let A be the matrixof the oeientsin (1)-(2) with olumns numberedby theindies
of the variables. Onean provethat A is totaly unimodular (TU) by performing thefollowing
sequeneofTUpreservingtransformations.
fori= 1, . . . , n
deleteolumn(i, n)(theseareunitolumns)
fori= 1, . . . , m
forj=n−1, . . . ,1
pivotonaij (AisTUithematrixobtainedbyapivotoperationonAisTU
deleteolumn(i, j)(nowthisisunitolumn)
ThenalmatrixisanunitolumnthatisTU.SineallthetransformationsareTUpreserving,
AisTUandY isintegral.
One ouldprovethe sameassertionbyshowingthat anarbitrary feasiblesolutionto (1)-(3)
isaonvexombination ofsomeinteger-valuedvertiesof Y. Thebestsuhvertex(inthesense
ofanobjetivefuntion)mightbeagoodapproximatesolutiontoaproblemwhosefeasiblesetis
anintersetionofY withadditionalonstraints.
Let y is an arbitrarynon -integersolution to (1)-(3). Beause of (1), (2) an unit ow1 f = (fsj, f(i,k)(i+1,j)) i= 1, m−1 j= 1, n in Gexists.t.
X
k≤j
f(i,k)(i+1,j)=yij i= 1, m−1 fsj=y1j j= 1, n
Bythewellknownproperties ofthenetworkowpolytope,the owf anbeexpressedasa
onvexombinationof integer-valuedunit ows(pathsinG). Buteahsuh oworrespondsto
1
The4indeesi, k, p, jusedforarslabelingfollowsthe onvention: tailatvertex(i, k)headatvertex(p, j).
Sometimesthebraketswillbedropped.