Antipole Tree Data Structure for Clustering in Metric SpacesMetric Spaces

Cinzia Di Pietro, Alfredo Ferro, Giuseppe Pigola, Alfredo Pulvirenti, Michele Purrello, Marco Ragusa,

3.3 Antipole Tree Data Structure for Clustering in Metric SpacesMetric Spaces

Let X be a ﬁnite set of objects (for example, biosequences) and let d a distance functiondist :X×X →Rsuch that the following four properties hold:

1. dist(x, y)≥ 0∀x, y ∈X (positiveness) 2. dist(x, y) = dist(y, x)∀x, y ∈ X (symmetry) 3. dist(x, x) = 0∀x ∈X (reﬂexivity)

4. dist(x, y) ≤ dist(x, z) +dist(z, y)∀x, y, z ∈ X (triangularity)

ClusteringXwith a bounded diameterσis the problem of partitioningXinto few nonempty subsets (i.e., the clusters) of diameter less thanσ. A centroid or the 1-median of a clusterCluis the elementCofClu, which minimizes the following

y∈Xd(C, y). The radius of a clusterCluis the distance between the centroidCand the farthest object fromCin that cluster. Assume we fix a cluster diameterσsuch that sequences whose pairwise distance is greater than σare considered to be significantly different by some application-dependent criterion. The antipole clustering of bounded diameterσ[64] is performed by a top-down procedure starting from a given finite set of points (biosequences in our case)Sby a splitting procedure (Figure 3.1a) which assigns each point of the splitting subset to the closest endpoint of a “pseudo-diameter” segment calledantipole.¹

An antipole pair of elements is generated as the ﬁnal winner of a set of randomized tournaments (Figure 3.2).

The initial tournament is formed by randomly partitioning the setS into t-uples (subsets of cardinalityt), locating the 1-median and then discarding it (Figure 3.2a). The winning sets (all points except the 1-median) go to the next stage of the tournament. If any of the tournaments has cardinality that is not a multiple oft, then one of the games will have at most 2t−1 players.

If the ﬁnal winner pair has a distance (pairwise alignment) that is lower thanσ, then splitting is not performed and the subset is one of the clusters (Figure 3.1b). Otherwise the cluster is split and the algorithm proceeds recursively on each of the two generated subsets. At the end of the procedure, the centroid of each cluster [63] is computed by an algorithm similar to the one to ﬁnd the antipole pair (Figure 3.2b). In this case the winner of each game is the object that minimizes the sum from the other t−1 elements.

This procedure gives rise to a data structure called antipole tree.

1The pseudo-diameter in such a case is a pair of biosequences, the endpoints, diﬀerent enough.

BUILD TREE(S,σ)

1 Q←APPROX ANTIPOLE(S,σ);

2 ifQ =∅then// splitting condition fails 3 T .Leaf ← T RU E;

4 T .C ←MAKE CLUSTER(S);

5 return T; 6 end if;

7 {A,B}=Q; // A, B are the antipoles sequences 8 T .A←A;

9 T .B← B;

10 SA ← {O∈S|dist(O, A)<dist(O, B)}; 11 SB ← {O∈S|dist(O, B)≤dist(O, A)}; 12 T .left ←BUILD TREE(SA,σ);

13 T .right ←BUILD TREE(SB,σ);

14 return T; 15 end BUILD TREE.

(a) MAKE CLUSTER(S)

1 C.Centroid ←APPROX 1 MEDIAN(S);

2 C.Radius ← maxx∈Sdist(x,C.Centroid) 3 C.CList ←S \ {C.Centroid};

4 returnC;

5 end MAKE CLUSTER.

(b)

Fig. 3.1. (a) Antipole algorithm. (b) MakeCluster algorithm.

3.4 AntiClustAl: Multiple Sequence Alignment via Antipoles

In this section we show that replacing the phylogenetic tree with the antipole tree gives a substantial speed improvement to the ClustalW approach with as good or better quality. (The quality of our approach derives from the fact that multiple alignment of a set of sequences works better when the diameter of the set of sequences is small.) Our basic algorithm is

1. Build the antipole tree as described in section 3.3.

2. Align the sequences progressively from the leaves up, inspired by ClustalW.

Starting at the leaves, the second step aligns all the sequences of the corresponding cluster using the proﬁle alignment technique. Figures 3.3, 3.4, and 3.5 contain the pseudocode of the multiple sequence alignment via antipole. The recursive functionAntiClustalvisits the antipole tree from the leaves up (following ClustalW’s strategy of visiting the phylogenetic tree from the leaves up). It aligns all the sequences stored in the leaves (the clusters) by calling the functionAlignCluster. Next, two aligned clusters are merged by the functionMergeAlignment. The three mentioned procedures make use of the functions AlignSequences, AlignProﬁleVsSequence, and OptimalAlignment,

AntiClustAl: Multiple Sequence Alignment by Antipole Clustering 49

The approximate antipole selection algorithm LOCAL WINNER(T) // (for the remaining elements ofS) 9 end while

10 returnFIND ANTIPOLE(S);

11 end APPROX ANTIPOLE // (for the remaining elements ofS) 9 end while;

10 return1-MEDIAN(S);

11 end APPROX 1 MEDIAN

Fig. 3.2. Pseudocode of the approximate algorithms: (a) Approximate antipole search. (b) 1-median computation. The variablethresholdis usually taken to be (t²+ 1). Indeed, this is the lowest value for which it is possible to partition the set Sinto subsets of size betweentand 2t−1. In the AntiClustAl implementation, the subset sizetis taken equal to three. This guarantees good performance. However it can be experimentally shown that the optimal choice oftis equal to the dimension of the underlying metric space plus one.

which, respectively, align two profiles, a profile versus a sequence and two sequences according to the Miller and Myers algorithm [292]. Finally the function GetProfile returns the profile of a multiple sequence alignment.

Figure 3.6 shows an example of the proposed method.

AntiClustal(TreeT)

1 if(!isLeaf(T)) /* if T is not a leaf */

2 AntiClustal(T .left);

3 AntiClustal(T .right);

4 T ←MergeAlignment(T .left,T .right);

5 T .leaf ←TRUE;

6 return(T);

7 else

8 AlignCluster(T);

9 return(T);

10 end if;

11 end AntiClustal

Fig. 3.3.Multiple sequence alignment via the antipole tree.

MergeAlignment(TreeA, TreeB) 1 P₁ ←GetProﬁle(A);

2 P₂ ←GetProﬁle(B);

3 C ←AlignSequences(A,P₁,B,P₂);

4 returnC.

5 end MergeAlignment

Fig. 3.4.How to align two clusters.

AlignCluster(TreeA) 1 if (|A.cluster|= 1) 2 returnA;

3 else

4 C ←OptimalAlignment(A₀,A₁);

5 if(|A| ≥3)

6 for eachAi∈A.clusterdo

7 P ←GetProﬁle(C);

8 C ←AlignProﬁleVsSequence(Ai, P, C);

9 end for 10 end if 11 returnC;

12 end if;

13 end AlignCluster

Fig. 3.5.How to align the sequences in a cluster.

AntiClustAl: Multiple Sequence Alignment by Antipole Clustering 51

Fig. 3.6. Example of multiple sequence alignment via antipole tree using alpha-globin sequences. A portion of the aligned sequences is shown at the bottom. The symbols under the aligned sequences show the matches in concordance with the given deﬁnition.

Dans le document Advanced Information and Knowledge Processing (Page 54-58)