• Aucun résultat trouvé

Lagrangian Approaches for a class of Matching Problems in Computational Biology

N/A
N/A
Protected

Academic year: 2021

Partager "Lagrangian Approaches for a class of Matching Problems in Computational Biology"

Copied!
22
0
0

Texte intégral

(1)

HAL Id: inria-00090635

https://hal.inria.fr/inria-00090635v2

Submitted on 7 Sep 2006

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Lagrangian Approaches for a class of Matching Problems in Computational Biology

Nicola Yanev, Rumen Andonov, Philippe Veber, Stefan Balev

To cite this version:

Nicola Yanev, Rumen Andonov, Philippe Veber, Stefan Balev. Lagrangian Approaches for a class of Matching Problems in Computational Biology. [Research Report] RR-5973, INRIA. 2006. �inria- 00090635v2�

(2)

inria-00090635, version 2 - 7 Sep 2006

a p p o r t

d e r e c h e r c h e

SN0249-6399ISRNINRIA/RR--5973--FR+ENG

Thème BIO

Lagrangian Approaches for a class of Matching Problems in Computational Biology

Nicola Yanev and Rumen Andonov and Philippe Veber and Stefan Balev

N° 5973

31st August 2006

(3)
(4)

Unité de recherche INRIA Rennes

IRISA, Campus universitaire de Beaulieu, 35042 Rennes Cedex (France)

Téléphone : +33 2 99 84 71 00 — Télécopie : +33 2 99 84 71 71

NiolaYanev

and RumenAndonov

and PhilippeVeber

and Stefan Balev

ThèmeBIOSystèmesbiologiques

ProjetsSymbiose

Rapportdereherhe 597331stAugust200618pages

Abstrat: Thispaperpresentseientalgorithmsforsolvingtheproblemof aligningaprotein

struture template to a query amino-aid sequene, known as protein threading problem. We

onsider the problem as a speial ase of graph mathing problem. We giveformal graphand

integerprogrammingmodelsof the problem. After studying the properties of these models, we

proposetwokindsofLagrangianrelaxationforsolvingthem. Wepresentexperimentalresultson

reallife instanesshowingtheeieny ofourapproahes.

Key-words: sequene-struturealignment,omplexity,integerprogramming,Lagrangianrelax-

ation

hobymath.bas.org

prenom.nomirisa.fr

stefan.balevuniv-lehavre.fr

(5)

d'une lasse de problème d'appariement en bioinformatique

Résumé : Cet artile propose des algorithmes eaes pour déterminer l'alignement optimal

entreunestruture etune séqueneprotéique,problèmeonnu souslenom deprotein threading.

Nous posons e problème ommeun as partiulier d'appariement. Nousprésentonsunmodèle

formelduproblèmesouslaformed'unefamilledegraphes,etdesprogrammesennombreentiers

orrespondants. Nousétudionsdansunpremiertempslespropriétésdeesmodèles,pourensuite

proposer deuxapprohesde relaxationlagrangiennepourlarésolution. Enn, nousmontrons,à

l'aidededonnéesexpérimentalessurdesinstanesréelles,l'eaitédeesapprohes.

Mots-lés : alignement séquene-struture, omplexité, programmation en nombres entiers,

relaxationlagrangienne

(6)

A C G C A A

A G T C T

ACG_CAA A_GTC_T

Figure1: Mathinginterpretationofsequenealignmentproblem

1 Preliminaries

Mathing is important lass of ombinatorial optimization problems with many real-life appli-

ations. Mathing problems involve hoosing a subset of edges of a graph subjet to degree

onstraintsontheverties. Manyalignmentproblemsarisinginomputationalbiologyarespeial

ases of mathing in bipartite graphs. In these problems the verties of the graph anbe nu-

leotidesofaDNAsequene,aminoaidsofaproteinsequeneorseondarystrutureelementsof

aproteinstruture. Unlikelassialmathingproblems, alignmentproblems haveintrinsiorder

onthegraphvertiesandthisimpliesextraonstraintsontheedges. Asanexample,Fig.1shows

an alignment of two sequenes as a mathing in bipartite graph. We an see that the feasible

alignmentsare1-mathingswithoutrossingedges.

Inthis paper we deal with the problem of aligning aprotein struture template to a query

proteinsequeneoflengthN,knownasproteinthreadingproblem(PTP).Atemplateisanordered

set of m seondarystruture elements(or bloks)of lengthsli, i = 1, . . . , m. An alignment(or

threading)isoveringofontiguoussequeneareasbythebloks. Athreadingisalledfeasibleif

theblokspreservetheirorderanddonotoverlap. A threadingisompletely determinedbythe

startingpositionsof allbloks. Forthesakeofsimpliitywewilluserelativepositions. Ifbloki

startsatthejthqueryharater,itsrelativepositionisri=jPi−1

k=1lk. Inthiswaythepossible

(relative)positionsofeahsegmentarebetween1andn=N+ 1Pm

i=1li (seeFig.2(b)). The

setoffeasiblethreadingsis

T ={(r1, . . . , rm)|1r1≤ · · · ≤rmn}.

Proteinthreadingproblem isamathingproblemin abipartitegraph(UV, U×V),where U ={u1, . . . , um} istheorderedset ofbloksand V ={v1, . . . , vn}is theorderedset ofrelative

positions. Thethreadingfeasibilityonditionsanberestatedintermsofmathinginthefollowing

way. AmathingM U×V isfeasibleif:

(i) d(u) = 1,uU (whered(x)is thedegreeofx). Thismeansthat eah blokis assignedto

exatlyoneposition). Bythewaythisimpliesthattheardinalityofeahfeasiblemathing

ism.

(ii) There are no rossing edges, or more preisely, if (ui, vj) M, (uk, vl) M and i < k,

then j l. This meansthat the blokspreservetheir order and donotoverlap. Thelast

inequalityisnotstritbeauseofusingrelativepositions.

Note that while (i) isa lassialmathing onstraint, (ii)is spei for thealignmentproblems

andmakesthemmorediult. Fig.2()showsamathingorrespondingtoafeasiblethreading.

Proposition 1. Thenumberof feasiblethreadings is|T |= m+n−1m

.

Proof. Wean denetherelativepositionsasri =jPi−1

k=1lk+i1. Inthisasetherelative

positionsofthefeasiblethreadingsare relatedby

1r1<· · ·< rmm+n1

andathreadingisdeterminedbyhoosingmoutofm+n1positions.

(7)

(a)

abs.position 1 2 3 4 5 6 7 8 9 10 1112 13 1415 16 1718 19 20

rel.positionblok1 1 2 3 4 5 6 7 8 9

rel.positionblok2 1 2 3 4 5 6 7 8 9

rel.positionblok3 1 2 3 4 5 6 7 8 9

(b)

1 2 3

1 2 3 4 5 6 7 8 9

V

U

()

Figure 2: (a) Example of alignmentof query sequene of length 20 and template ontaining 3

segmentsoflengths3,5and4. (b)Correspondenebetweenabsoluteandrelativeblokpositions.

()Amathing orrespondingtothealignmentof(a).

One of the possible ways to deal with alignment problems is to try to adapt the existing

mathing tehniquesto thenewedgeonstraintsof type(ii). Insteadof doingthisweproposea

newgraphmodel andwedevelopeientmathingalgorithmsbasedonthismodel.

Weintrodue analignment graph G= (U×V, E). Eah vertex ofthis graphorrespondsto

an edge of themathing graph. Forsimpliity wewill denote thevertiesby vij, i = 1, . . . , m, j = 1, . . . , n and draw them as an n×m grid (see Fig. 3). The verties vij, j = 1, . . . , n will

be alled ith layer. A layer orresponds to a blok and eah vertex in a layerorresponds to

positioningofthisblokin thequerysequene.

Oneanonnetbyedgesthepairsofvertiesof Gwhih orrespondto pairsof nonrossing

edges in the mathing graph. In this ase a feasible threading is an m-lique in G. A similar

approahisusedin[12℄. Weintrodueonlyasubsetoftheaboveedges,namelytheonesthaton-

netvertiesfromadjaentolumnsandhavethefollowingregularpattern: E={(vij, vi+1,l)|i= 1, . . . , m1,1 j l n}. We addtwomorevertiesS and T and edgesonnetingS to all

vertiesfrom the rstolumn and T to allvertiesfrom the last olumn. Nowit is easyto see

theone-to-one orrespondene betweentheset of feasiblethreadings (or mathings)and theset

ofS-T pathsinG. Fig.3illustratesthisorrespondene.

Till now we gave several alternative ways to desribe the feasible alignments. Alignment

problemsinomputationalbiologyinvolvehoosingthebestofthembasedonsomesorefuntion.

Thesimplestsorefuntions assoiateweightsto theedges ofthemathinggraph. Forexample,

this istheaseofsequene alignmentproblems. Byintroduingalignmentgraphssimilarto the

above, lassialsequenealignment algorithms, suh asSmith-Waterman orNeedleman-Wunh,

anbeviewedasndingshortestS-T paths. Whenthesorefuntionsusestruturalinformation, theproblemsaremorediultandtheshortestpathmodelannotinorporatethisinformation.

Thesorefuntionsin PTPevaluatethedegreeofompatibilitybetweenthesequeneamino

aidsandtheirpositionsinthetemplatebloks. Theinterations(orlinks)betweenthetemplate

(8)

block position

T S

i = 1 i = 2 i = 3 i = 4 i = 5 i = 6

j = 1 j = 2 j = 3 j = 4

Figure 3: Exampleof alignmentgraph. The pathin thiklines orresponds to thethreading in

whihthepositionsofthebloksare1,2,2,3,4,4.

bloksaredesribedbytheso-alledgeneralizedontatmapgraph,whosevertiesarethebloks

andwhoseedgesonnetpairsof interatingbloks. LetLbethesetoftheseedges:

L={(i, k)|i < kandbloksi andkinterat}

Sometimes we need to distinguish the links between adjaent bloks and the other links. Let

R={(i, k)|(i, k)L, ki >1} bethesetofremote(ornon-loal)links. Thelinksfrom L\R

arealled loallinks. Withoutlossofgeneralityweansuppose thatallpairsofadjaentbloks

interat.

Thelinksbetweenthebloksgeneratesoreswhihdependontheblokpositions. Inthisway

asorefuntionofPTP anbepresentedbythefollowingsetsofoeients

ˆ cij, i= 1, . . . , m, j= 1, . . . , n,thesoreofputting blokionposition j

ˆ dijkl,(i, k)L,1jln,thesoregeneratedbytheinterationbetweenbloksiand k whenblokiisonpositionj andblokk isonpositionl.

Theoeientscij aresomefuntion (usually sum) ofthepreferenes ofeah queryaminoaid

plaedin blokiforoupyingitsassignedposition,aswellasthesoresofpairwiseinterations betweenaminoaidsbelongingtobloki. Theoeientsdijkl inludethesoresofinterations between pairs of amino aids belonging to bloks i and j. Loops (sequenes between adjaent

bloks)mayalsohavesequenespei sores,inludedintheoeientsdi,j,i+1,l.

The soreof a threading is the sum of the orresponding sore oeientsand PTP is the

optimizationproblem ofndingthethreadingof minimumsore. Iftherearenoremotelinks (if

R=)weanputthesoreoeientsonthevertiesandtheedgesofthealignmentgraphand

PTP isequivalentto theproblem ofnding theshortestS-T path. Inorder totakethe remote

linksinto aount,weaddtothealignmentgraphtheedges

{(vij, vkl)|(i, k)R, 1jln}

whihwewillreferasz-edges.

An S-T pathissaidtoativatethez-edgesthat havebothendsonthispath. EahS-T path

ativates exatly|R| z-edges, one foreah link in R. The subgraphindued by theedges of an S-T pathandtheativatedz-edgesisalled augmentedpath. ThusPTPisequivalentto nding

theshortestaugmentedpathin thealignmentgraph(seeFig.4).

Aswewillseelater,themainadvantageofthisgraphisthatsomesimplealignmentproblems

redue to nding the shortest S-T path in it with some pries assoiated to the edges and/or

verties. The last problem an be easily solved by a trivial dynami programming algorithm

of omplexity O(mn2). In order to address the general ase we need to represent this graph

optimisationproblem asanintegerprogrammingproblem.

(9)

j = 2 j = 3 j = 4

i = 1 i = 2 i = 3 i = 4 i = 5 i = 6

j = 1

block position

T S

c1122

c2232

c3243

c4354

c5464

c1132

c3264

c4364

Figure4: Exampleofaugmentedpath. Thegeneralizedontatmapgraphisgiveninthebottom.

ThexarsoftheS-T pathareinsolidlines. Theativatedz-arsareindashedlines. Thelength

oftheaugmentedpathisequaltothesoreof thethreading(1,2,2,3,4,4).

2 Integer programming formulation

Letyij bebinaryvariablesassoiatedtothevertiesofG. yij isoneifblokiisonpositionjand

zerootherwise. LetY bethepolytopedenedbythefollowingonstraints:

n

X

j=1

yij= 1 i= 1, . . . , m (1)

j

X

l=1

yil

j

X

l=1

yi+1,l0 i= 1, . . . , m1, j= 1, . . . , n1 (2) yij 0 i= 1, . . . , m, j= 1, . . . , n (3)

Constraints(1) ensure the feasibility ondition (i) and(2) are responsiblefor (ii). That is why

Y Bmnisexatlythesetoffeasiblethreadings.

In order totakeinto aountthe interationosts, weintrodue aseond set ofbinary vari-

ables zijkl, (i, k) L, 1 j l n. To avoid added notation we will use vetor nota-

tion for the variables yi = (yi1, ...yin) Bn with assigned osts ci = (ci1, ...cin) Rn and zik = (zi1k1, . . . , zi1kn, zi2k2, . . . , zi2kn, . . . , zinkn) Bn(n+1)2 for (i, k) L with assigned osts dik = (di1k1, . . . , di1kn, di2k2, . . . , di2kn, . . . , dinkn)R

n(n+1) 2 .

Considerthe 2n×n(n+1)2 node-edgeinidene matrixof thesubgraphspanned bytwointer-

atinglayersiandk. ThesubmatrixA ontainingtherstnrows(resp. A′′ontaingthelast n

rows)orrespondsto thelayeri(resp. layerk).

Nowtheproteinthreadingproblemanbedenedas

zIPL =v(P T P(L)) = min{

m

X

i=1

ciyi+ X

(i,k)∈L

dikzik} (4)

subjetto: y= (y1, . . . , ym)Y, (5)

yi=Azik (i, k)L (6)

yk=A′′zik (i, k)L (7)

zikBn(n+1)2 (i, k)L (8)

(10)

Theshortutnotationv(.)will beused fortheoptimalobjetivefuntion valueofasubproblem

obtainedfrom P T P(L)withsomez variablesxed.

3 Complexity results

In this setion we study thestruture of thepolytopedened by (5)-(7) and zik R

n(n+1) 2

+ , as

wellastheimpatofthesetLontheomplexityofthealgorithmsforsolvingtheP T P problem.

Throughoutthissetion,vertexostsciareassumedtobezero. Thisassumptionisnotrestritive

beausetheostscij anbeaddedtodi,j,i+1,l,l=j, . . . , n. Wewillonsidertheostsdik asn×n

matries ontainingthe oeients dijkl abovethe main diagonal and arbitrary large numbers

below the main diagonal. In order to simplify the desriptions of the algorithms given in this

setionweintroduethefollowingmatrixoperations.

Denition 1. LetA and B be twomatries ofompatible size. A·B isthe matrixprodutof A and B where the additionoperation is replaed bymin andthe multipliation operationis replaedby+.

Denition 2. LetA andB betwomatries ofsize n×n. M =AB isdened by M(i, j) = mini≤r≤jA(i, r) +B(i, j)

Belowwepresentfourkindsofontatgraphsthat makePTPpolynomiallysolvable.

3.1 Contat graph ontains only loal edges

Asmentionedabove,inthisasePTPredues tondingtheshortestS-T pathinthealignment

graphwhihanbedonebyO(mn2)dynamiprogrammingalgorithm. Animportantpropertyof analignmentgraphontainingonlyloaledgesisthatithasatightLPdesription.

Theorem1. Thepolytope Y isintegral, i.e. ithasonly integer-valuedverties.

Proof. Let A be the matrixof the oeientsin (1)-(2) with olumns numberedby theindies

of the variables. Onean provethat A is totaly unimodular (TU) by performing thefollowing

sequeneofTUpreservingtransformations.

fori= 1, . . . , n

deleteolumn(i, n)(theseareunitolumns)

fori= 1, . . . , m

forj=n1, . . . ,1

pivotonaij (AisTUithematrixobtainedbyapivotoperationonAisTU

deleteolumn(i, j)(nowthisisunitolumn)

ThenalmatrixisanunitolumnthatisTU.SineallthetransformationsareTUpreserving,

AisTUandY isintegral.

One ouldprovethe sameassertionbyshowingthat anarbitrary feasiblesolutionto (1)-(3)

isaonvexombination ofsomeinteger-valuedvertiesof Y. Thebestsuhvertex(inthesense

ofanobjetivefuntion)mightbeagoodapproximatesolutiontoaproblemwhosefeasiblesetis

anintersetionofY withadditionalonstraints.

Let y is an arbitrarynon -integersolution to (1)-(3). Beause of (1), (2) an unit ow1 f = (fsj, f(i,k)(i+1,j)) i= 1, m1 j= 1, n in Gexists.t.

X

k≤j

f(i,k)(i+1,j)=yij i= 1, m1 fsj=y1j j= 1, n

Bythewellknownproperties ofthenetworkowpolytope,the owf anbeexpressedasa

onvexombinationof integer-valuedunit ows(pathsinG). Buteahsuh oworrespondsto

1

The4indeesi, k, p, jusedforarslabelingfollowsthe onvention: tailatvertex(i, k)headatvertex(p, j).

Sometimesthebraketswillbedropped.

Références

Documents relatifs

Problems of form (1) that satisfy Assumption 2.1 and have the structure presented in Proposition 2.4, can be solved using the deterministic equivalent model (10)-(14) based

For a long time, threading methods using non local parameters (see the next section) suffered from the lack of a rigorous method capable of determin- ing the sequence –

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

|M atchingDocs| | (|M atchingDocs∩ReturnedDocs| = l) (1) With the knowledge of targeted search engine’s index size, and also the num- ber of documents matching seed query,

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come

Key words: Integer programming, combinatorial optimization, protein threading problem, protein structure alignment..

As we will describe in the next section the fact that the score function is local or non-local has a profound influence on the type of algorithm that needs to be used for aligning

As we will describe in the next section the fact that the score function is local or non-local has a profound influence on the type of algorithm that needs to be used for aligning