Algorithms for RNA Structure Alignment

Kaizhong Zhang

4.5 Algorithms for RNA Structure Alignment

In this section, we consider the problem of computing the edit alignment and extended edit alignment between two RNA structuresR₁andR₂. Since computing the structure alignment for RNA tertiary structures is Max SNP-hard, we cannot expect to ﬁnd the optimal solution in polynomial time.

However we will present algorithms that will ﬁnd the optimal solution when at least one of the RNA structures is a secondary structure and good solutions when both RNA structures are tertiary structures. Therefore, we do not assume that the input RNA structures are secondary structures and will not use any tree representation.

4.5.1 Edit Alignment

Since aligning crossing base pairs is diﬃcult, we add one more condition in deﬁning a structural alignment (R₁, R₂) ofR₁andR₂.

(4) If (r₁[i], r₁[j]) and (r₁[k], r₁[l]) are base pairs inR₁ and (r₂[i], r₂[j]) and (r₂[k], r₂[l]) are base pairs in R₂, then (r₁[i], r₁[j]) and (r₁[k], r₁[l]) are noncrossing in R₁ and (r₂[i], r₂[j]) and (r₂[k], r₂[l]) are noncrossing in R₂.

Therefore, even though the input RNA structures may have crossing base pairs, the aligned base pairs in them are noncrossing. We present an algorithm that computes the optimal alignment of two RNA structures based on this new alignment deﬁnition. We will show that our algorithm can be used for aligning tertiary structures in practical applications, though the alignment may not be the optimal one according to the original deﬁnition.

In extending techniques of Gotoh [152] to handle gap initiation costs from sequence alignment to structure alignment, the main difficulty is that, with the deletion of a base pair, two gaps might be created simultaneously. We deal with this problem by considering the deletion of a base pair as two separate deletions of its two bases, each with a cost of half of the base-pair deletion cost. We will use a bottom up dynamic programming algorithm to find the optimal alignment between R₁ and R₂. That is, we consider the smaller substructures inR1 and R2 first and eventually consider the whole structures ofR1 andR2.

Property of optimal alignments.Consider two RNA structuresR1 and R₂. Let γ_g =gap cost. We use Γ( ) to deﬁne γ(i, j) for 0 ≤ i ≤ |R₁| and 0≤j≤ |R₂|.

72 Data Mining in Bioinformatics

γ(i,0) =Γ(r₁[i]→λ) ifi=p_r₁(i) γ(0, i) =Γ(λ→r₂[i]) ifi=p_r₂(i)

γ(i, j) =Γ(r₁[i]→r₂[j]) ifi=p_r₁(i) andj=p_r₂(j) γ(i,0) =γ(j,0) =Γ((r₁[i], r₁[j])→λ)/2 ifi=p_r₁(j)< j

γ(0, i) =γ(0, j) =Γ(λ→(r₂[i], r₂[j]))/2 ifi=p_r₂(j)< j γ(i, j) =Γ((r₁[i₁], r₁[i])→(r₂[j₁], r₂[j])) ifi₁=p_r₁(i)< i

andj₁=p_r₂(j)< j

From this deﬁnition, ifr₁[i] is a single base, thenγ(i,0) is the cost of deleting this base, and ifr₁[i] is a base of a base pair, thenγ(i,0) is half of the cost of deleting the base pair. Therefore we distribute evenly the deletion cost of a base pair to its two bases. The meaning ofγ(0, i) is similar. Wheni >0 and j >0,γ(i, j) is the cost of aligning base pairs (r₁[i₁], r₁[i]) and (r₂[j₁], r₂[j]).

We now consider the optimal alignment betweenR1[i1, i2] andR2[j1, j2].

We use A(i1, i2 ; j1, j2) to represent the optimal alignment cost between R1[i1, i2] and R2[j1, j2]. We use D(i1, i2 ; j1, j2) to represent the optimal alignment cost such that r₁[i₂] is aligned to −. If i₁ ≤p_r₁(i₂)< i₂, then by the deﬁnition of alignment, in the optimal alignment ofD(i₁, i₂ ; j₁, j₂), r₁[p_r₁(i₂)] has to be aligned to−. We useI(i₁, i₂ ; j₁, j₂) to represent the optimal alignment cost such thatr₂[j₂] is aligned to−. Ifj₁≤p_r₂(j₂)< j₂, then in the optimal alignment ofI(i₁, i₂; j₁, j₂),r₂[p_r₂(j₂)] has to be aligned to−.

In computing A(i₁, i₂ ; j₁, j₂), D(i₁, i₂ ; j₁, j₂) and I(i₁, i₂ ; j₁, j₂) for any i₁ ≤i ≤i₂, ifp_r₁(i)< i₁ or i₂ < p_r₁(i), thenr₁[i] will be forced to be aligned to −; for any j₁ ≤j ≤j₂, if p_r₂(j)< j₁ or j₂ < p_r₂(j), thenr₂[j]

will be forced to be aligned to−. It will be clear from Lemmas 4.5.3, 4.5.4, and 4.5.5 that this proposition is used to deal with two situations: aligning one base pair among crossing base pairs and deleting a base pair.

We can now consider how to compute the optimal alignment between R1[i1, i2] and R2[j1, j2]. The ﬁrst two lemmas are trivial, so we omit their proofs.

Lemma 4.5.1.

A(∅ ; ∅) = 0 D(∅ ; ∅) =γ_g I(∅ ; ∅) =γ_g Lemma 4.5.2. Fori₁≤i≤i₂ andj₁≤j≤j₂,

D(i₁, i; ∅) =D(i₁, i−1 ; ∅) I(∅ ; j₁, j) =I(∅ ; j₁, j−1)

+γ(i,0) +γ(0, j)

A(i₁, i; ∅) =D(i₁, i ; ∅) A(∅ ; j₁, j) =I(∅ ; j₁, j) I(i₁, i; ∅) =D(i₁, i; ∅) +γ_g D(∅ ; j₁, j) =I(∅ ; j₁, j) +γ_g

Lemma 4.5.3. Fori₁≤i≤i₂ andj₁≤j≤j₂,

Proof. Similar to Lemma 4.5.3.

Lemma 4.5.5. Fori₁≤i≤i₂ andj₁≤j≤j₂, minimum of the three cases.

74 Data Mining in Bioinformatics

For case 2, since i₁ ≤ p_r₁(i) < i and j₁ ≤ p_r₂(j) < j, both (r₁[p_r₁(i)], r₁[i]) and (r₂[p_r₂(j)], r₂[j]) are base pairs. In the optimal alignment, (r1[pr₁(i)], r1[i]) may be aligned to (−,−), (r2[pr₂(j)], r2[j]) may be aligned to (−,−), or (r1[pr₁(i)], r1[i]) may be aligned to (r2[pr₂(j)], r2[j]).

If (r₁[p_r₁(i)], r₁[i]) is aligned to (−,−), then A(i₁, i ; j₁, j) = D(i₁, i; j₁, j). If (r₂[p_r₂(j)], r₂[j]) is aligned to (−,−) thenA(i₁, i; j₁, j) = I(i₁, i; j₁, j).

If (r₁[p_r₁(i)], r₁[i]) is aligned to (r₂[p_r₂(j)], r₂[j]), then the optimal alignment between R₁[i₁, i] and R₂[j₁, j] is divided into three parts: (1) the optimal alignment between R₁[i₁, p_r₁(i)−1] and R₂[j₁, p_r₂(j)−1], (2) the optimal alignment betweenR₁[p_r₁(i) + 1, i−1] andR₂[p_r₂(j) + 1, j−1], and (3) the alignment of (r₁[p_r₁(i)], r₁[i]) to (r₂[p_r₂(j)], r₂[j]). This is true since any base pair across (r₁[p_r₁(i)], r₁[i]) or (r₂[p_r₂(j)], r₂[j]) should be aligned to− and the cost of such an alignment has already been included in part 1 and part 2. Hence we haveA(i₁, i; j₁, j) =A(i₁, p_r₁(i)−1 ; j₁, p_r₂(j)−1)+

A(p_r₁(i) + 1, i−1 ; p_r₂(j) + 1, j−1)+γ(i, j).

In case 3, we consider all the other possibilities in which we cannot align r1[i] to r2[j]. We examine several subcases involving base pairs.

- Subcase 1: p_r₁(i) > i. This means that r₁[p_r₁(i)] is outside the interval [i₁, i] and we have to align r₁[i] to−.

- Subcase 2:p_r₂(j)> j. This is similar to subcase 1. Together with subcase 1, this implies that whenp_r₁(i)> i andp_r₂(j)> j, even if r₁[i]=r₂[j], we cannot align them to each other.

- Subcase 3:p_r₁(i)< i₁. This is similar to subcase 1. Together with subcase 1, we know that if a base pair is across an aligned base pair, then it has to be aligned to−.

- Subcase 4:pr₂(j)< j1. This is similar to subcase 3.

Basic algorithm.From Lemmas 4.5.1 to 4.5.5, we can compute A(R₁, R₂)

=A(1,|R₁|; 1,|R₂|) using a bottom-up approach. Moreover, it is clear that we do not need to compute allA(i₁, i₂ ; j₁, j₂). From Lemma 4.5.5, we need to compute only theA(i₁, i₂; j₁, j₂) such that (r₁[i₁−1], r₁[i₂+ 1]) is a base pair inR₁ and (r₂[j₁−1], r₂[j₂+ 1]) is a base pair inR₂.

Given R₁ and R₂, we can ﬁrst compute sorted base-pair listsL₁ forR₁ andL₂ forR₂. This sorted order is in fact a bottom-up order since, for two base pairssandtinR₁, ifsis before or insidet, thensis beforetin the sorted listL₁. For each pair of base pairsL₁[i] = (i₁, i₂) andL₂[j] = (j₁, j₂), we use Lemma 4.5.1 to Lemma 4.5.5 to computeA(i₁+ 1, i₂−1 ; j₁+ 1, j₂−1). We use the procedure in Figure 4.9 to compute A(R₁[i₁, i₂], R₂[j₁, j₂]). Figure 4.10 shows the algorithm.

Let R1 and R2 be the two given RNA structures and P1 and P2 be the number of base pairs inR1 and R2, respectively. The time to compute A(i1, i2; j1, j2) isO((i2−i1)(j2−j1)), which is bounded byO(|R1|×|R2|). The time complexity of the algorithm in the worst case isO(P₁P₂|R₁||R₂|). We can

To computeA(R₁[i₁, i₂], R₂[j₁, j₂])

computeA(0,0),D(0,0), andI(0,0) as in Lemma 4.5.1;

fori:=i₁ toi₂

computeA(i,0),D(i,0), andI(i,0) as in Lemma 4.5.2;

forj:=j₁ toj₂

computeA(0, j),D(0, j), andI(0, j) as in Lemma 4.5.2;

fori:=i₁ toi₂ forj:=j₁ toj₂

computeA(i, j),D(i, j), andI(i, j) as in Lemma 4.5.3, Lemma 4.5.4, and Lemma 4.5.5.

Fig. 4.9.Procedure for computingA(R₁[i₁, i₂], R₂[j₁, j₂]).

Input:R₁[1..m] andR₂[1..n]

compute a sorted (by 3 end) base pair listL₁ forR₁; compute a sorted (by 3 end) base pair listL₂ forR₂; fori:= 1to|L1|

forj:= 1to|L2|

letL₁[i] = (r₁[i₁], r₁[i₂]);

letL₁[j] = (r₂[j₁], r₂[j₂]);

computeA(R₁[i₁+ 1, i₂−1], R₂[j₁+ 1, j₂−1]);

computeA(R₁[1, m], R₂[2, n]);

trace back to ﬁnd the optimal alignment betweenR₁ andR₂.

Fig. 4.10.Algorithm for computingA(R₁, R₂).

improve our algorithm so that the worst case running time isO(S₁S₂|R₁||R₂|) where S₁ and S₂ are the number of stems, i.e., stacked pairs of maximal length, inR₁ andR₂, respectively. The space complexity of the algorithm is O(|R₁||R₂|).

Notice that when one of the input RNA structures is a secondary structure, this algorithm computes the optimal solution of the problem. Also, since the number of tertiary interactions is relatively small compared with the number of secondary interactions, we can use this algorithm to compute the alignment between two RNA tertiary structures. Essentially the algorithm tries to ﬁnd the best sets of noncrossing base pairs to align and delete tertiary interactions. Although this is not an optimal solution, in practice it would

76 Data Mining in Bioinformatics

produce a reasonable result by aligning most of the base pairs in the two RNA tertiary structures.

4.5.2 Extended Edit Alignment

From the algorithm for computing the edit alignment of two RNA structures, it is easy to develop an algorithm for computing the extended edit alignment between the RNA structures. To begin with, we make the following modiﬁcations. A base in a base pair can be deleted, inserted, or aligned with another base. In these situations, the bond between the two bases in the base pair is broken, and therefore there is a base-pair bond-breaking cost.

The simplest way is to evenly distribute this cost to the two bases of the base pair.

LetΓ_g be the cost of a base-pair bond-breaking operation. Suppose that r₁[i] is a base in R₁ andr₂[j] is a base inR₂. Then the cost of deletingr₁[i]

isΓ(r₁[i]→λ) ifr₁[i] is an unpaired base in R₁ andΓ(r₁[i]→λ) +Γ_g/2 if r₁[i] is a base in a base pair inR₁; the cost of insertingr₂[j] isΓ(λ→r₂[j]) ifr₂[j] is an unpaired base in R₂ and Γ(λ→r₂[j]) +Γ_g/2 ifr₂[j] is a base in a base pair inR₂; the cost of aligning r₁[i] to r₂[j] is Γ(r₁[i] →r₂[j]) if bothr₁[i] andr₂[j] are unpaired bases,Γ(r₁[i]→r₂[j]) +Γ_g/2 if exactly one ofr₁[i] andr₂[j] is an unpaired base,Γ(r₁[i]→r₂[j]) +Γ_g if both r₁[i] and r₂[j] are a base in a base pair.

After performing these changes, Lemmas 4.5.1 to 4.5.4 remain the same as before and Lemma 4.5.5 needs to be changed so thatr₁[i] can be aligned withr2[j] regardless of whether or not they are unpaired bases.

Dans le document Advanced Information and Knowledge Processing (Page 77-82)