• Aucun résultat trouvé

Assessing the effects of compositional heterogeneity on phylogenomic analyses

N/A
N/A
Protected

Academic year: 2021

Partager "Assessing the effects of compositional heterogeneity on phylogenomic analyses"

Copied!
17
0
0

Texte intégral

(1)

Assessing the effects of compositional

heterogeneity on phylogenomic analyses

Denis Baurain

1,2,3

, Robert G. Beiko

2

, and Mark A. Ragan

2

1

Université de Liège / FNRS, Belgium

2

University of Queensland / ARC Centre in Bioinformatics, Australia

3

Université de Montréal

(2)

Background

144 bacterial genomes 422,971 proteins 22,432 alignments 220,240 proteins (52.1%) 22,432 unrooted trees 95,194 bipartitions (PP ≥ 0.95) 1 supertree

clustering & alignment Bayesian inference matrix representation with parsimony

inheritance pattern largely vertical with ‘highways’ of gene sharing

A C D B 998 concord. A B D C 1 discord. A B C D 1 discord. A B C D 2 LGT events

identification of exchange partners

A B C D 998/2 supertree topology c/d c/d detection of LGT events

(3)

tests for

systematic biases

clustering strategy (e.g. cluster size)

alignment quality (e.g. sequence length variation)

phylogenetic inference (e.g. alignment size)

conflicting signals

in the data

history

composition (e.g. GC bias)

rate (e.g. LBA)

other signals

« Any signal having experienced

convergence in nonsister

lineages

will affect recovery of the historical signal. »

LGT or methodological issues?

A B C D LGT event? A B C D signal convergence? A C D B 998 concord. A B D C 1 discord.

(4)

Phylogenetic assumptions

R

k

=

!

"

j

#

Aj j$ A

%

"

C

#

AC

"

G

#

AG

"

T

#

AT

"

A

#

AC

!

"

j

#

Cj j$C

%

"

G

#

CG

"

T

#

CT

"

A

#

AG

"

C

#

CG

!

"

j

#

Gj j$G

%

"

T

#

GT

"

A

#

AT

"

C

#

CT

"

G

#

GT

!

"

j

#

Tj j$T

%

&

'

(

(

(

(

(

(

(

(

(

)

*

+

+

+

+

+

+

+

+

+

A

B

C

D

R

1

R

2

R

5

R

3

R

4

R

6

R

7

adapted from Lars S. Jermiin, University of Sydney; Lio and Goldman (1998) Genome Res 8:1233-1244

sites evolve

independently

and

identically using a Markov process

given by R (e.g. GTR model)

for practical reasons, we assume

that the sites have evolved under

reversible

,

homogeneous

and

stationary

conditions

note – new models allow to safely

violate some of these assumptions

πA, …, πT – nucleotide frequencies αkij – conditional rates of change

αkij = αkji – reversibility

R1 = R2 = … = R6 = R7 – homogeneity πkj = f0j – stationarity

(5)

matched-pair tests of homogeneity

Bowker’s (1948) test for

symmetry

Stuart’s (1955) test for

marginal homogeneity

‘traditional’ homogeneity tests

compositional χ

2

Statistical tests for stationarity

Tavaré (1986) Lect Math Life Sci 17:57-86; Jermiin et al. (2004) Syst Biol 53:638-643

40 9 11 9 11 Σ 7 5 0 1 1 T 10 1 7 0 2 G 12 2 1 8 1 C 11 1 3 0 7 A Σ T G C A

S

1

ATGGTACAATGCGGCATGTACTCGCGATATCGACGATACG

S

2

ATCGAACGATGTGGCGTACACTCACGTTACCGACACGACG

(6)

Correlations among all three tests

Bowker’s χ2 Stuart’s χ2 compositional χ2 0.23% 27.53% 2.47% [AA, n = 2815041, p = 0.05] failure statistics Bowker’s χ2 Stuart’s χ2 compositional χ2 B o w ke r’s χ 2 S tu a rt’s χ 2 co m p o sit io n a l χ 2 0.752 0.712 0.946 – – – 0.752 0.712 0.946 Pearson’s correlation

(7)

Ranking of the worst players

… … … … … … 16.93 … 11.68 … 41.03 19.35 27.10 24.43 21.61 24.88 fail. (%) … … … … … Prochlorococcus SS120 Gloeobacter violaceus 774 131 29 Prochlorococcus SS120 Synechococcus WH8102 1301 152 12 … … … … … Pseudomonas PA01 Wigglesworthia brevipalpis 390 160 6 Prochlorococcus MED4 Synechocystis PCC6803 868 168 5 Prochlorococcus MED4 Gloeobacter violaceus 749 203 4 Prochlorococcus MED4 Thermosynechococcus BP-1 835 204 3 Prochlorococcus MED4 Prochlorococcus MIT9313 1245 269 2 Prochlorococcus MED4 Synechococcus WH8102 1274 317 1 organism b organism a pairs (#) fail. (#) rank

(8)

1882 genes 2274 genes

1716 genes 2525 genes very low light

low light

high light upper 25 m

both Prochlorococcus and Synechococcus are part of

the

picophytoplankton

(tiny organisms: cell Ø < 1 µm)

they account for 1/3 of Earth’s primary biomass production

all known members of this group are

96% similar at the

rRNA level

(but have quite different gene contents)

Synechococcus found 25 ya / Prochlorococcus found 15 ya

we had 4 genomes in our 144-species dataset

Who are the picocyanobacteria?

Prochlorococcus MIT9313 Synechococcus WH8102 Prochlorococcus SS120 Prochlorococcus MED4 713/136 597/198 1012/7 50.7% GC 36.4% GC 30.8% GC 59.4% GC

(9)

photosynthetic apparatus

Hess (2004) Curr Opin Biotech 15:191-198

Is Prochlorococcus monophyletic?

phylogenetic profiling

16S rRNA R oc ap e t al . (2 00 3) N at ur e 4 24 :1 04 2-10 47

(10)

7 rooted trees…

Synechococcus WH8102 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus SS120 112 Prochlorococcus SS120 Prochlorococcus MED4 Synechococcus WH8102 Prochlorococcus MIT9313 75 106 Prochlorococcus MIT9313 Synechococcus WH8102 Prochlorococcus SS120 Prochlorococcus MED4 supertree & RY-recoded 16S rRNA 16S rRNA N-BLASTP

alignments of [size ≥ 6] for which the monophyly of the four picocyanobacteria is highly supported (n = 819) only fully resolved topologies are considered (n = 310) blue stars (

*

) denote really discordant topologies

Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus SS120 Synechococcus WH8102 2 Synechococcus WH8102 Prochlorococcus MED4 Prochlorococcus MIT9313 Prochlorococcus SS120 6 Prochlorococcus MED4 Synechococcus WH8102 Prochlorococcus MIT9313 Prochlorococcus SS120 2 Prochlorococcus MIT9313 Prochlorococcus SS120 Prochlorococcus MED4 Synechococcus WH8102 7

*

*

(11)

… fold to 2 unrooted topologies

Prochlorococcus MIT9313 Prochlorococcus SS120 Prochlorococcus MED4 Synechococcus WH8102 804 concordant Prochlorococcus MIT9313 Prochlorococcus MED4 Synechococcus WH8102 Prochlorococcus SS120 23 discordant Prochlorococcus MIT9313 Prochlorococcus MED4 Synechococcus WH8102 Prochlorococcus SS120 0 discordant ? ? ? ? 296 unresolved same dataset + all alignments of [4 ≤ size ≤ 5] that include the four picocyanobacteria (n = 304) all topologies are considered (n = 1123)

having binned all alignments leading to one given topology, we look for features of the compositional signal (compositional bias) that would be characteristic of that topology

(12)

Signals that work together

only bins of [size ≥ 6] are considered bars denote standard error

(13)

FG-recoded sequences

GC bias at the protein level?

raw sequences

(14)

GC bias at the protein level? (2)

FG-recoded sequences

(15)

The case for a closer look

‘supergene’ made of 1,485 genes found in at least two (out of four) organisms (469,682 positions; 15% missing)

MIT9313 WH8102 SS120 MED4 50.7% GC 36.4% GC 30.8% GC 59.4% GC GC-content rankings stars (

*

) denotes

(16)

Conclusions

the compositional heterogeneity definitely

has an

impact

on phylogenomic analyses

here, the compositional signal likely exaggerates the

historical signal (i.e. non-monophyly of Prochlorococcus)

in other circumstances, it could be the opposite

while the compositional heterogeneity is obvious at

both the DNA and protein levels, the propagation of

the

GC bias

is not limited to

FYMINK / GARP

codons

in their ‘holy war’ against systematic biases,

phylogenomic analyses would certainly benefit from

newer evolutionary models

that can deal with the

violation of the stationarity assumption

(17)

Acknowledgments

data, help and advice

Robert Charlebois (NeuroGadgets Inc.)

Nick Hamilton (UQ)

Tim Harlow (UQ / ACB)

Lars S. Jermiin (USyd)

funding

FNRS / ULg (Belgium)

ACB (Australia)

Références

Documents relatifs

Il semble que le Prochlorococcus ait besoin d'autres microbes pour certaines enzymes, la catalase et la catalase peroxydase, qui sont nécessaires pour se protéger contre les

Selection for diverse genes encoding cell surface proteins together with horizontal gene transfer of genes into genomic islands leads to a large Prochlorococcus population with

The purpose of this review was, therefore, to summarize the various roles of platelets in the context of ECMO in adult patients, with a special focus on the potential usefulness

Neither this play by Dumaniant nor Picard’s Médiocre et Rampant nor the other translations that appeared in the Teatro Moderno Applaudito during the municipal

Nous avons également noté que les phrases dans lesquelles nous retrouvons une mutation, un gène, un miARN et/ou une maladie sont forcément des phrases qui relatent un

Seuls les résultats obtenus pour la méthode en circuit ouvert 1 (avec renouvellement de milieu, trois bactéries et une microalgue) sont présentés dans la partie résultat.

The interpretation of TOF-SIMS data obtained by COSIMA will be a complex task due to the intimate mixture of minerals and organic matter expected in cometary

Growth rates and concentrations of Picophagus flagellatus feeding on Prochlorococcus (Pro), Synechococcus (Syn) and mixed cultures of Prochlorococcus and Synechococcus (Pro +