• Aucun résultat trouvé

How does phylogenomics work?

1.3 Phylogenomics

1.3.2 How does phylogenomics work?

Generally, two possibilities have been explored for inferring phylogenies using phyloge-nomics [Delsuc et al. 2005] (Figure 6-ch.1):

Figure 6-ch.1. Methods of phylogenomics inference. From [Delsuc et al. 2005].

S e q u e n c e - b a s e d m e t h o d s : As in any phylogenetic reconstruction, a primordial step needs here as well to be carefully performed in order to ensure that a “vertical trans-mission” of the characters is respected: the homology assessment. This is not an easy task because genes of interest for inferring evolutionary history are often poorly sampled, so de-ciding between orthology and paralogy is not always possible (impossible to differentiate between independent gene losses or unsampled genes in some species, for example). Simi-larly, horizontally (or laterally) transferred genes (HGT) can be difficult to pinpoint with a

alignments is usually datasets of at least a thousand ESTs, finding a sufficient number of genes where orthology can be deduced is nevertheless doable. On the other hand, for this very same reason that phylogenomics deals with huge amounts of data, it is fair to assume that even if little undetected paralogy or HGT remain after careful checks it is unlikely that they will dominate the genuine phylogenetic signal [Lake and Rivera 2004].

The method of choice for creating a set of orthologous genes is the inference of phylo-genetic trees for every single-gene alignment, thus requiring the individual genes to be aligned and unambiguous positions to be selected. This is much more precise than a sim-pler and faster selection of species/sequences based solely on BLAST results [Altschul et al. 1990], which is known to be unreliable. Indeed the BLAST algorithm does not take into account evolutionary information, so that genes appearing to be the most similar based on BLAST hits are often not each others closest relative in term of phylogeny, leading to false positive insertions of species [Koski and Golding 2001].

After this first and important step, two options are conceivable, that is the superma-trix or the supertree approaches [Delsuc et al. 2005]. The supermasuperma-trix approach corre-sponds to the concatenation of all selected single-genes into one super-alignment, and sub-mitting it to classical phylogenetic reconstruction methods (the more reliable being at pre-sent the probabilistic methods, Maximum Likelihood [Felsenstein 1981] and Bayesian [Huelsenbeck et al. 2001]). The strategy that is generally applied is to consider each con-catenated sequence as one “gene” and ignoring the evolutionary specificities of each [Philippe et al. 2004; Philippe et al. 2005; Rodriguez-Ezpeleta et al. 2005; Rokas et al. 2005;

Wiens 2005; Delsuc et al. 2006; Patron et al. 2007; Rodriguez-Ezpeleta et al. 2007a]. It pre-sumes that the different discordant histories, if any, contained in each gene will be aver-aged away by the combined analysis of numerous characters. Another strategy that has also been tested, to a lesser extent, is to allow a different set of parameters for each gene in order to more adequately describe different tempos and modes of evolution [Bapteste et al.

2002; Philippe et al. 2004; Patron et al. 2007], but results were generally not significantly different from the “cruder” approach above, questioning its real utility.

Because every single-gene that makes up the concatenation is in principle subject to its own selection of species, upon sequence availability, missing entries are the rule and they are generally not equally distributed (some species have many missing positions, oth-ers have nearly none). Potentially they could drastically lower the resolution of a tree, or induce artifacts due to model violations. Missing data occur even when complete genomes are available because genes can be independently lost, duplicated, or horizontally trans-ferred. This feature of phylogenomic alignments could be a serious drawback, making this

discipline a nice approach in theory but practically impossible to perform. Fortunately em-pirical and simulation studies have shown that the percentage of missing data can actually be high, up to 90%, and yet the overall signal remains [Wiens 2003; Driskell et al. 2004;

Philippe et al. 2004]. This is especially true in a phylogenomic context as the number of sites present in a large concatenated alignment remains high, even for species with a lot of missing data. Furthermore, it seems that adding even incomplete taxa is beneficial and improve phylogenetic accuracy by breaking long branches [Wiens 2005]. However this con-cern has in my opinion not been investigated thoroughly enough, and precise issues such as the influence of the distribution of missing data (evolutionary close or not to species with no missing data) still need to be specifically discussed. Otherwise risks exist that the sup-posedly weak influence of missing data is taken as face value, so that many highly incom-plete species would not be treated with the necessary caution.

The second sequence-based approach is the supertree, which differs from the superma-trix in that it combines the trees, generated individually based on the single-genes, and not the single-genes themselves. In practice this method has barely been employed [Philip et al. 2005; Fitzpatrick et al. 2006] and comparative efficiency studies of supermatrix and supertree, especially in a phylogenomic context, are needed to shed light on the benefits and disadvantages of both. Until then phylogenomics will likely be based almost entirely on the supermatrix approach, owing to its much greater hindsight.

W h o l e - g e n o m e f e a t u r e m e t h o d s : These methods, relatively new, do not directly rely on multiple-sequence alignment and generally cannot be applied to incom-plete genomic sampling. They provide great promise for the near future as very valuable independent and complementary possibilities for testing phylogenetic trees, when complete genomes will be available for a larger diversity and improvements made in their implemen-tation. Because they are precisely based on entire genomes, one can assume that these kind of data truly reflect the organismal evolution, or at least better approximate it than single-, or even multiple-gene phylogenies. Moreover, events under investigation here, such as alteration of the gene-order or gene-content in a species are supposed to be extremely rare [Rokas and Holland 2000], thus not sensitive to homoplasy.

Looking at the gene-order or gene-content comes to considering each chromosome as a linear (or circular, for example in the case of mitochondria or chloroplasts genomes) order-ing of genes [Moret and Warnow 2005], from which evolutionary relationships are puta-tively inferred. These methods use the number of shared orthologous genes between ge-nomes as a similarity measure [Korbel et al. 2002]. In its most simplistic application the

divided by their total number of genes, so that evolutionary distances are interpreted in terms of events such as the acquisition and loss of genes [Snel et al. 1999]. More sophisti-cated approaches have also been developed, such as parsimony algorithms [Fitz-Gibbon and House 1999; House and Fitz-Gibbon 2002] or statistical frameworks [Gu and Zhang 2004; Larget et al. 2005a; Larget et al. 2005b], but they generally showed weak capacities in resolving phylogenies due to their inability to correctly handle the issue of saturation.

Another method is the distribution of sequence strings approach which transforms into distances the observed frequencies of short oligonucleotides or oligopeptides in comparison to a theoretical, completely random usage of these strings [Deschavanne et al. 1999; Ed-wards et al. 2002; Pride et al. 2003; Qi et al. 2004]. This method has not been extensively tested and so far no model of evolution is available to explain the observations. This method seems to be particularly prone to saturation of the signal, thus of limited usage in deep evolutionary questions [Pride et al. 2003]. Nevertheless it is worth investigating fur-ther, notably because it is one of the few to be free of the orthology prerequisite.