• Aucun résultat trouvé

Algorithms for RNA Structure Alignment

Kaizhong Zhang

4.5 Algorithms for RNA Structure Alignment

In this section, we consider the problem of computing the edit alignment and extended edit alignment between two RNA structuresR1andR2. Since computing the structure alignment for RNA tertiary structures is Max SNP-hard, we cannot expect to find the optimal solution in polynomial time.

However we will present algorithms that will find the optimal solution when at least one of the RNA structures is a secondary structure and good solutions when both RNA structures are tertiary structures. Therefore, we do not assume that the input RNA structures are secondary structures and will not use any tree representation.

4.5.1 Edit Alignment

Since aligning crossing base pairs is difficult, we add one more condition in defining a structural alignment (R1, R2) ofR1andR2.

(4) If (r1[i], r1[j]) and (r1[k], r1[l]) are base pairs inR1 and (r2[i], r2[j]) and (r2[k], r2[l]) are base pairs in R2, then (r1[i], r1[j]) and (r1[k], r1[l]) are noncrossing in R1 and (r2[i], r2[j]) and (r2[k], r2[l]) are noncrossing in R2.

Therefore, even though the input RNA structures may have crossing base pairs, the aligned base pairs in them are noncrossing. We present an algorithm that computes the optimal alignment of two RNA structures based on this new alignment definition. We will show that our algorithm can be used for aligning tertiary structures in practical applications, though the alignment may not be the optimal one according to the original definition.

In extending techniques of Gotoh [152] to handle gap initiation costs from sequence alignment to structure alignment, the main difficulty is that, with the deletion of a base pair, two gaps might be created simultaneously. We deal with this problem by considering the deletion of a base pair as two separate deletions of its two bases, each with a cost of half of the base-pair deletion cost. We will use a bottom up dynamic programming algorithm to find the optimal alignment between R1 and R2. That is, we consider the smaller substructures inR1 and R2 first and eventually consider the whole structures ofR1 andR2.

Property of optimal alignments.Consider two RNA structuresR1 and R2. Let γg =gap cost. We use Γ( ) to define γ(i, j) for 0 i ≤ |R1| and 0≤j≤ |R2|.

72 Data Mining in Bioinformatics

γ(i,0) =Γ(r1[i]→λ) ifi=pr1(i) γ(0, i) =Γ→r2[i]) ifi=pr2(i)

γ(i, j) =Γ(r1[i]→r2[j]) ifi=pr1(i) andj=pr2(j) γ(i,0) =γ(j,0) =Γ((r1[i], r1[j])→λ)/2 ifi=pr1(j)< j

γ(0, i) =γ(0, j) =Γ(r2[i], r2[j]))/2 ifi=pr2(j)< j γ(i, j) =Γ((r1[i1], r1[i])(r2[j1], r2[j])) ifi1=pr1(i)< i

andj1=pr2(j)< j

From this definition, ifr1[i] is a single base, thenγ(i,0) is the cost of deleting this base, and ifr1[i] is a base of a base pair, thenγ(i,0) is half of the cost of deleting the base pair. Therefore we distribute evenly the deletion cost of a base pair to its two bases. The meaning ofγ(0, i) is similar. Wheni >0 and j >0,γ(i, j) is the cost of aligning base pairs (r1[i1], r1[i]) and (r2[j1], r2[j]).

We now consider the optimal alignment betweenR1[i1, i2] andR2[j1, j2].

We use A(i1, i2 ; j1, j2) to represent the optimal alignment cost between R1[i1, i2] and R2[j1, j2]. We use D(i1, i2 ; j1, j2) to represent the optimal alignment cost such that r1[i2] is aligned to . If i1 ≤pr1(i2)< i2, then by the definition of alignment, in the optimal alignment ofD(i1, i2 ; j1, j2), r1[pr1(i2)] has to be aligned to. We useI(i1, i2 ; j1, j2) to represent the optimal alignment cost such thatr2[j2] is aligned to. Ifj1≤pr2(j2)< j2, then in the optimal alignment ofI(i1, i2; j1, j2),r2[pr2(j2)] has to be aligned to.

In computing A(i1, i2 ; j1, j2), D(i1, i2 ; j1, j2) and I(i1, i2 ; j1, j2) for any i1 ≤i ≤i2, ifpr1(i)< i1 or i2 < pr1(i), thenr1[i] will be forced to be aligned to ; for any j1 ≤j ≤j2, if pr2(j)< j1 or j2 < pr2(j), thenr2[j]

will be forced to be aligned to. It will be clear from Lemmas 4.5.3, 4.5.4, and 4.5.5 that this proposition is used to deal with two situations: aligning one base pair among crossing base pairs and deleting a base pair.

We can now consider how to compute the optimal alignment between R1[i1, i2] and R2[j1, j2]. The first two lemmas are trivial, so we omit their proofs.

Lemma 4.5.1.

A(∅ ; ) = 0 D(∅ ; ) =γg I(∅ ; ) =γg Lemma 4.5.2. Fori1≤i≤i2 andj1≤j≤j2,

D(i1, i; ) =D(i1, i−1 ; ) I(∅ ; j1, j) =I(∅ ; j1, j−1)

+γ(i,0) +γ(0, j)

A(i1, i; ) =D(i1, i ; ) A(∅ ; j1, j) =I(∅ ; j1, j) I(i1, i; ) =D(i1, i; ) +γg D(∅ ; j1, j) =I(∅ ; j1, j) +γg

Lemma 4.5.3. Fori1≤i≤i2 andj1≤j≤j2,

Proof. Similar to Lemma 4.5.3.

Lemma 4.5.5. Fori1≤i≤i2 andj1≤j≤j2, minimum of the three cases.

74 Data Mining in Bioinformatics

For case 2, since i1 pr1(i) < i and j1 pr2(j) < j, both (r1[pr1(i)], r1[i]) and (r2[pr2(j)], r2[j]) are base pairs. In the optimal alignment, (r1[pr1(i)], r1[i]) may be aligned to (,), (r2[pr2(j)], r2[j]) may be aligned to (,), or (r1[pr1(i)], r1[i]) may be aligned to (r2[pr2(j)], r2[j]).

If (r1[pr1(i)], r1[i]) is aligned to (,), then A(i1, i ; j1, j) = D(i1, i; j1, j). If (r2[pr2(j)], r2[j]) is aligned to (,) thenA(i1, i; j1, j) = I(i1, i; j1, j).

If (r1[pr1(i)], r1[i]) is aligned to (r2[pr2(j)], r2[j]), then the optimal alignment between R1[i1, i] and R2[j1, j] is divided into three parts: (1) the optimal alignment between R1[i1, pr1(i)1] and R2[j1, pr2(j)1], (2) the optimal alignment betweenR1[pr1(i) + 1, i1] andR2[pr2(j) + 1, j1], and (3) the alignment of (r1[pr1(i)], r1[i]) to (r2[pr2(j)], r2[j]). This is true since any base pair across (r1[pr1(i)], r1[i]) or (r2[pr2(j)], r2[j]) should be aligned to and the cost of such an alignment has already been included in part 1 and part 2. Hence we haveA(i1, i; j1, j) =A(i1, pr1(i)1 ; j1, pr2(j)1)+

A(pr1(i) + 1, i1 ; pr2(j) + 1, j1)+γ(i, j).

In case 3, we consider all the other possibilities in which we cannot align r1[i] to r2[j]. We examine several subcases involving base pairs.

- Subcase 1: pr1(i) > i. This means that r1[pr1(i)] is outside the interval [i1, i] and we have to align r1[i] to.

- Subcase 2:pr2(j)> j. This is similar to subcase 1. Together with subcase 1, this implies that whenpr1(i)> i andpr2(j)> j, even if r1[i]=r2[j], we cannot align them to each other.

- Subcase 3:pr1(i)< i1. This is similar to subcase 1. Together with subcase 1, we know that if a base pair is across an aligned base pair, then it has to be aligned to.

- Subcase 4:pr2(j)< j1. This is similar to subcase 3.

Basic algorithm.From Lemmas 4.5.1 to 4.5.5, we can compute A(R1, R2)

=A(1,|R1|; 1,|R2|) using a bottom-up approach. Moreover, it is clear that we do not need to compute allA(i1, i2 ; j1, j2). From Lemma 4.5.5, we need to compute only theA(i1, i2; j1, j2) such that (r1[i11], r1[i2+ 1]) is a base pair inR1 and (r2[j11], r2[j2+ 1]) is a base pair inR2.

Given R1 and R2, we can first compute sorted base-pair listsL1 forR1 andL2 forR2. This sorted order is in fact a bottom-up order since, for two base pairssandtinR1, ifsis before or insidet, thensis beforetin the sorted listL1. For each pair of base pairsL1[i] = (i1, i2) andL2[j] = (j1, j2), we use Lemma 4.5.1 to Lemma 4.5.5 to computeA(i1+ 1, i21 ; j1+ 1, j21). We use the procedure in Figure 4.9 to compute A(R1[i1, i2], R2[j1, j2]). Figure 4.10 shows the algorithm.

Let R1 and R2 be the two given RNA structures and P1 and P2 be the number of base pairs inR1 and R2, respectively. The time to compute A(i1, i2; j1, j2) isO((i2−i1)(j2−j1)), which is bounded byO(|R1|×|R2|). The time complexity of the algorithm in the worst case isO(P1P2|R1||R2|). We can

To computeA(R1[i1, i2], R2[j1, j2])

computeA(0,0),D(0,0), andI(0,0) as in Lemma 4.5.1;

fori:=i1 toi2

computeA(i,0),D(i,0), andI(i,0) as in Lemma 4.5.2;

forj:=j1 toj2

computeA(0, j),D(0, j), andI(0, j) as in Lemma 4.5.2;

fori:=i1 toi2 forj:=j1 toj2

computeA(i, j),D(i, j), andI(i, j) as in Lemma 4.5.3, Lemma 4.5.4, and Lemma 4.5.5.

Fig. 4.9.Procedure for computingA(R1[i1, i2], R2[j1, j2]).

Input:R1[1..m] andR2[1..n]

compute a sorted (by 3 end) base pair listL1 forR1; compute a sorted (by 3 end) base pair listL2 forR2; fori:= 1to|L1|

forj:= 1to|L2|

letL1[i] = (r1[i1], r1[i2]);

letL1[j] = (r2[j1], r2[j2]);

computeA(R1[i1+ 1, i21], R2[j1+ 1, j21]);

computeA(R1[1, m], R2[2, n]);

trace back to find the optimal alignment betweenR1 andR2.

Fig. 4.10.Algorithm for computingA(R1, R2).

improve our algorithm so that the worst case running time isO(S1S2|R1||R2|) where S1 and S2 are the number of stems, i.e., stacked pairs of maximal length, inR1 andR2, respectively. The space complexity of the algorithm is O(|R1||R2|).

Notice that when one of the input RNA structures is a secondary structure, this algorithm computes the optimal solution of the problem. Also, since the number of tertiary interactions is relatively small compared with the number of secondary interactions, we can use this algorithm to compute the alignment between two RNA tertiary structures. Essentially the algorithm tries to find the best sets of noncrossing base pairs to align and delete tertiary interactions. Although this is not an optimal solution, in practice it would

76 Data Mining in Bioinformatics

produce a reasonable result by aligning most of the base pairs in the two RNA tertiary structures.

4.5.2 Extended Edit Alignment

From the algorithm for computing the edit alignment of two RNA structures, it is easy to develop an algorithm for computing the extended edit alignment between the RNA structures. To begin with, we make the following modifications. A base in a base pair can be deleted, inserted, or aligned with another base. In these situations, the bond between the two bases in the base pair is broken, and therefore there is a base-pair bond-breaking cost.

The simplest way is to evenly distribute this cost to the two bases of the base pair.

LetΓg be the cost of a base-pair bond-breaking operation. Suppose that r1[i] is a base in R1 andr2[j] is a base inR2. Then the cost of deletingr1[i]

isΓ(r1[i]→λ) ifr1[i] is an unpaired base in R1 andΓ(r1[i]→λ) +Γg/2 if r1[i] is a base in a base pair inR1; the cost of insertingr2[j] isΓ→r2[j]) ifr2[j] is an unpaired base in R2 and Γ→r2[j]) +Γg/2 ifr2[j] is a base in a base pair inR2; the cost of aligning r1[i] to r2[j] is Γ(r1[i] →r2[j]) if bothr1[i] andr2[j] are unpaired bases,Γ(r1[i]→r2[j]) +Γg/2 if exactly one ofr1[i] andr2[j] is an unpaired base,Γ(r1[i]→r2[j]) +Γg if both r1[i] and r2[j] are a base in a base pair.

After performing these changes, Lemmas 4.5.1 to 4.5.4 remain the same as before and Lemma 4.5.5 needs to be changed so thatr1[i] can be aligned withr2[j] regardless of whether or not they are unpaired bases.