• Aucun résultat trouvé

RNA Structure Comparison and Alignment Based on Edit Operations

Kaizhong Zhang

4.2 RNA Structure Comparison and Alignment Models

4.2.1 RNA Structure Comparison and Alignment Based on Edit Operations

structure comparison [205, 254] and alignment models based on the extended edit operations are then considered.

An RNA structure is represented by R(P), where R is a sequence of nucleotides withr[i] representing theith nucleotide, andP ⊂ {1,2,· · · ,|R|}2 is a set of pairs of which each element (i, j), i < j, represents a base pair (r[i], r[j]) in R. We use R[i, j] to represent the subsequence of nucleotides fromr[i] tor[j]. We assume that base pairs inR(P) do not share participating bases. Formally, for any (i1, j1) and (i2, j2) inP,j1=i2,i1=j2, andi1=i2 if and only ifj1=j2.

Let s = r[k] be an unpaired base and p = (r[i], r[j]) be a base pair in R(P). We define the relation betweensand pas follows. We say sis before pifk < i. We saysisinsidepifi < k < j. We saysisafter pifj < k.

Lets= (r[i], r[j]) andt= (r[k], r[l]) be two base pairs inR(P). We define the relation betweensandtas follows. We say sisbeforet ifj < k. We say sisinside t ifk < iand j < l. We says andt arecrossing ifi < k < j < l ork < i < l < j.

For an RNA structureR(P), if any two of its base pairs are noncrossing, then we say R(P) is a secondary structure. Otherwise, we say R(P) is a tertiary structure.

For an RNA structureR(P), we definepr( ) as follows.

pr(i) =

i ifr[i] is an unpaired base

j if (r[i], r[j]) or (r[j], r[i]) is a base pair inP

By this definition pr(i)=i if and only if r[i] is a base in a base pair of R(P) andpr(i) =iif and only ifr[i] is an unpaired base ofR(P). Ifpr(i)=i, thenpr(i) is the base paired with basei. When there is no confusion, we use Rinstead ofR(P) to represent an RNA structure assuming that there is an associated functionpr( ).

4.2.1 RNA Structure Comparison and Alignment Based on Edit Operations

Following the tradition in sequence comparison [296, 373], we define three edit operations, substitute, delete, and insert, on RNA structures. For a given RNA structureR, each operation can be applied to either a base pair or an unpaired base. To substitute a base pair is to replace one base pair with another. This means that at the sequence level, two bases may be changed at the same time. To delete a base pair is to delete the two bases of the base pair. At the sequence level, this means to delete two bases at the same time.

To insert a base pair is to insert a new base pair. At the sequence level, this means to insert two bases at the same time. To relabel an unpaired base is to replace it with another unpaired base. To delete an unpaired base is to delete the base from the sequence. To insert an unpaired base is to insert a new

62 Data Mining in Bioinformatics

base into the sequence as an unpaired base. Note that there is no substitute operation that can change a base pair to an unpaired base or vice versa.

Figure 4.1 shows edit operations on RNA structures.

A

Fig. 4.1. RNA structure edit operations. Base-pair substitution is shown at the left and base-pair deletion is shown at the right.

In this model, a base pair can be matched only with a base pair. In general, this is a reasonable model since in RNA structures when one base of a base pair changes, its partner usually also changes to conserve that pairing relationship.

We represent an edit operation as a→b, wherea and b are λ, the null label, labels of base pairs from{A, C, G, U}×{A, C, G, U}, or unpaired bases from{A, C, G, U}. We calla→b a substitute operation ifa=λandb=λ, a delete operation if b = λ, and an insert operation if a = λ. Let Γ be a cost function that assigns to each edit operation a b a nonnegative real number Γ(a b). We constrain Γ to be a distance metric. That is, (1) Γ(a b) 0, Γ(a a) = 0, (2) Γ(a b) = Γ(b a), and (3) Γ(a c) Γ(a b) +Γ(b c). We extend Γ to a sequence of edit operationsS =s1, s2, . . . sn by lettingΓ(S) =n

i=1Γ(si).

The edit distance between two RNA structures R1 and R2 is defined by considering the minimum-cost edit operation sequence that transformsR1to R2. Formally, the edit distance betweenR1andR2is defined as

D(R1, R2) = min

S (S)|S is an edit operation sequence takingR1 toR2}. In the computation of the edit distance, the goal is to find the minimum-cost edit sequence that can change one structure to the other. A similarity (maximization) version can also be considered, where the goal is to find the maximum-scoring edit sequence that can change one structure to the other. We will refer to the edit distance,D(R1, R2), as the RNA structure comparison model based on edit operations.

RNA structure distance/similarity can also be represented by an alignment of two RNA structures. In the alignment representation, the gap initiation cost can be considered. Formally, given two RNA structuresR1and R2, a structural alignment ofR1andR2is represented by (R1, R2) satisfying the following conditions.

(1) R1isR1with some new’s inserted andR2isR2with some new’s inserted such that|R1|=|R2|.

(2) Ifr1[i] is an unpaired base inR1, then eitherr2[i] is an unpaired base in R2 or r2[i] =. If r2[i] is an unpaired base in R2, then either r1[i] is an unpaired base inR1 orr1[i] =.

(3) If (r1[i], r1[j]) is a base pair inR1, then either (r2[i], r2[j]) is a base pair in R2 or r2[i] = r2[j] = . If (r2[i], r2[j]) is a base pair in R2, then either (r1[i], r1[j]) is a base pair inR1 orr1[i] =r1[j] =.

From this definition, it is clear that an alignment preserves the order of unpaired bases and the topological relationship between base pairs. Since the alignment specifies how base pairs are aligned and preserves the relationship between the base pairs, it is in fact a structural alignment. Figure 4.2 gives a simple illustration of this alignment.

A A A G A A U A A U U U A C G G G A C C C U A U A A A C G A G A U A A C A U U A C G G G A U A A A

base pair match base pair deletion

base insertion gap base match

base pair substitution

base substitution

Fig. 4.2.RNA structure alignment with edit operations.

A gap in an alignment (R1, R2) is a consecutive subsequence of ’s in either R1 or R2 with maximal length. More formally, [i· · ·j] is a gap in (R1, R2) if either r1[k] = fori ≤k ≤j, r1[i1]=, r1[j+ 1]=, or r2[k] = for i ≤k ≤j, r2[i1]=, r2[j+ 1] =. For each gap in an alignment, in addition to the insertion/deletion costs, we will assign a constant, gap cost, as the gap initiation cost. This means that longer gaps are preferred since for a longer gap the additional cost distributed to each base is relatively small. This kind of affine gap penalty has long been used in sequence alignment [152]. In biological alignment, whenever possible, longer gaps are preferred since it is difficult to delete the first element, but after that, continuing to delete subsequent elements is much easier.

Given an alignment (R1, R2), we define an unpaired base match setSM, an unpaired base deletion setSD, an unpaired base insertion setSI, a base-pair match setP M, a base-pair deletion set P D, and a base-pair insertion setP I as follows.

64 Data Mining in Bioinformatics

SM ={i|r1[i] andr2[i] are unpaired bases inR1 andR2}. SD ={i|r1[i] is an unpaired base inR1 andr2[i] =}. SI ={i|r2[i] is an unpaired base inR2 andr1[i] =}.

P M ={(i, j)| (r1[i], r1[j]) and (r2[i], r2[j]) are base pairs inR1 andR2}. P D ={(i, j)| (r1[i], r1[j]) is a base pair inR1 andr2[i] =r2[j] =}. P I ={(i, j)| (r2[i], r2[j]) is a base pair inR2 andr1[i] =r1[j] =}.

The cost of an alignment (R1, R2) is defined as follows, where #gapis the number of gaps in (R1, R2).

cost((R1, R2)) =gap cost×#gap

+

iSMΓ(r1[i]→r2[i]) +

iSDΓ(r1[i]→λ) +

iSIΓ→r2[i])

+

(i,j)P MΓ((r1[i], r1[j])(r2[i], r2[j])) +

(i,j)P DΓ((r1[i], r1[j])→λ)

+

(i,j)P IΓ(r2[i], r2[j]))

Given two RNA structuresR1 andR2, the edit alignment between them is defined as

A(R1, R2) = min

(R1,R2){cost((R1, R2))}.

We will refer to the edit alignment, A(R1, R2), as the RNA structure alignment model based on edit operations. Whengap cost= 0, it is easy to see thatD(R1, R2) =A(R1, R2) [259, 448].

4.2.2 RNA Structure Comparison and Alignment Based on