• Aucun résultat trouvé

Algorithms for RNA Secondary Structure ComparisonComparison

Kaizhong Zhang

4.4 Algorithms for RNA Secondary Structure ComparisonComparison

In this section, we consider the problem of computing the edit distance and extended edit distance between two RNA secondary structures. Since an RNA secondary structure appears as a treelike structure (Figure 4.6), there are algorithms for RNA structure comparison using tree comparison [244, 363, 392, 447, 449]. This is indeed true for computing the edit distance and extended edit distance of two RNA secondary structures if suitable tree (sometimes forest) representations are used.

68 Data Mining in Bioinformatics Fig. 4.6.RNA secondary structure.

4.4.1 Edit Distance

Recall that a secondary structure is represented by a setSof noncrossing base pairs that form bonds. For (i, j)∈S, his accessible from (i, j) if i < h < j and there is no pair (k, l)∈Ssuch thati < k < h < l < j. Define (i, j) as the parent of (k, l)∈S ifk, lare accessible from (i, j). Define (i, j) as the parent ofh ∈S ifhis accessible from (i, j). Note that each base pair (i, j)∈S and each unpaired base h has at most one parent, implying a tree (sometimes forest) on the elements of a secondary structure. The definitions of “child,”

“sibling,” and “leaf” follow naturally. The order imposed based on the 5 to 3 nature of an RNA molecule makes the tree an ordered tree (Figure 4.7). In this representation, internal nodes represent base pairs and leaves represent unpaired bases.

Following [392, 449], let us consider tree edit operations. Relabeling a node means changing the label of the node. Deleting a nodenmeans making the children ofnbecome the children of the parent ofnand then removingn.

Inserting is the complement of deleting. Examining each of the edit operations defined on RNA secondary structures, we can see that they are exactly the same as the tree edit operations defined on the tree representation.

Conversely, the edit operations defined on this tree representation are meaningful operations on RNA secondary structures. Theoretically there could be operations that do not result in a valid secondary structure (e.g.

inserting an unpaired base as an internal node), but we can show that the minimum-cost sequence of tree edit operations that transforms one tree into another will not use this kind of operations. Therefore we can use tree edit algorithms on this tree representation to compare two RNA secondary structures.

C G

Fig. 4.7.Tree representation of the RNA structure in Figure 4.6.

The ordered tree edit distance algorithm [449] has a time complexity of O(|T1||T2|min(depth(T1),leaves(T1)) min(depth(T2), leaves(T2))) and space complexity ofO(|T1||T2|) where |Ti| is the size of the treeTi. The depth is really the collapsed depth, where nodes with degree one are ignored when counting the depth.

Using the tree representation for RNA secondary structures, the size of the tree, denoted by RT, is the total number of unpaired bases plus the total number of base pairs, which is actually smaller than the length of the corresponding primary structure. The collapsed depth here, denoted by dp, is really the maximum number of loops on a path from the root to a leaf. Here the loops are bulge loops, internal loops, multibranched loops, and hairpin loops. Taking the loops into account, the resulting algorithm for computing the edit distance between two RNA secondary structures has a time complexity ofO(RT1RT2dp1dp2).

4.4.2 Extended Edit Distance

Since there is no tree edit operation corresponding to base-pair bond breakings, we cannot directly use tree edit distance algorithms here. However with an extended tree representation and a small modification of the tree edit distance algorithms, an algorithm for computing the extended edit distance between RNA secondary structures can easily be developed. This extended tree representation is shown in Figure 4.8. In this representation, each internal node represents the bond between the two bases of a base pair, the leftmost and the rightmost leaves of an internal node represent the two bases of a base pair, and all the other leaves represent unpaired bases.

70 Data Mining in Bioinformatics

Fig. 4.8. Extended tree representation of the RNA structure in Figure 4.6.

With this extended tree representation, unpaired base substitution, insertion, and deletion correspond to tree leaf substitution, insertion, and deletion and a base-pair bond breaking corresponds to an internal node insertion or deletion. The only problem is concerned with the base-pair substitution operation since a base pair now is represented by three nodes:

an internal node and its leftmost and rightmost leaves. This means that if we want to use the tree edit distance algorithms, some modifications are necessary.

In fact, we need only a very small modification. When applying the tree edit distance algorithms to this extended tree representation, whenever we match an internal node of one treeT to an internal node in the other treeT, we have to make sure that, simultaneously, the leftmost leaf of the internal node in treeT is matched with the leftmost leaf of the internal node in tree T and the rightmost leaf of the internal node in treeT is matched with the rightmost leaf of the internal node in treeT.

Using the extended tree representation, the size of the tree is larger than the length of the corresponding primary structure. The collapsed depth here is the same as the depth of the tree. Therefore the real running time of this modified algorithm using the extended tree representation in Figure 4.8 would be slower than the running time of the algorithm using the tree representation in Figure 4.7. Since the tree size is actually larger than the length of the corresponding primary structure, in the actual implementation one may avoid using the explicit tree representation [205, 423].