Automated tools for the generation and interpretation of single gene trees at a broad taxonomic scale

(1)

Automated tools for the generation

and interpretation of single gene trees

at a broad taxonomic scale

Denis Baurain

1

, Mick Van Vlierberghe

1

, Arnaud Di Franco

2

, Herve Philippe

2

1

_{InBioS, University of Liege, Belgium;}

2

_{SETE, UMR CNRS 5321, Moulis, Fance}

mvanvlierberghe@doct.ulg.ac.be

Abstract

I

DENTIFYING orthology relationships among sequences is fundamental in

phyloge-nomics; indeed, those are essential to understand evolution, diversity of life and ancestry among organisms. To build alignments of orthologous sequences, phy-logenomic pipelines often start with a step of all-vs-all similarity search followed by a clustering with an algorithm such as OrthoFinder [Emms and Kelly (2015)

Genome Biol 16:157]. For it to be as accurate as possible, proteomes of good quality

are needed but their availability is limited to a small subset of the living beings. Therefore, large-scale taxonomic phylogenomic analyses imply the enrichment of preexisting orthologous groups with transcriptomic or genomic data and the need for robust tools for identifying orthologues from heterogeneous sequence data. To this end, we have developed a novel tool, ”Forty-Two”, along the lines of HaMStR [Ebersberger et al. (2009) BMC Evol Biol 9:157], whose aim is to add (and op-tionally align) sequences to thousands of preexisting multiple sequence alignments (MSA) while controlling for orthology relationships and potentially contaminating sequences. ”Forty-Two” uses advanced heuristics based on a multiple Best Recipro-cal Hit (multi-BRH) strategy against reference proteomes to distinguish orthologous and paralogous sequences among homologues. It is fully functional and has already been used in two high-profile phylogenomic manuscripts (under review) dealing with the animal tree of life. Here, we present the principles and algorithms underlying ”Forty-Two” as well as the results of an extensive test suite of its features, in order to support its release to the public.

Workplan

Benchmark test for recovery of depleted orthologuous groups (MSA)

1. Dataset

(a) 57 organisms (quality proteomes) in 8 taxonomic groups (Figure 1) (b) Define orthologous groups (MSAs)

(c) Classify orthogroups (Table 1) 2. ‘Forty-Two‘

(a) For each depleted orthogroup:

(b) Try to add back the removed sequences (c) Compute statistics (TPs, FPs, TNs, FNs)

(d) Two runs of removal: groups and & supergroups

Max copy mean 1.1 1.2 2.0

Distribution

Universal 17 33 36

Widespread 76 130 59

Sparse 32 60 37

Table 1: Number of orthogroups (MSAs) according to classification criteria. A

taxo-nomic criterion (few intra-group representatives (”Sparse”), almost all intra-group rep-resentatives (”Widespread”) and all the reprep-resentatives (”Universal”) ) and a maxi-mum copy mean number criterion (1.1, 1.2 and 2.0)

Fragilariopsis cylindrus_186039@33153 Pseudo-nitzschia multiseries_37319@34320 Phaeodactylum tricornutum_2850@33576 Thalassiosira pseudonana_35128@31700 Aureococcus anophagefferens_44056@26106 Ectocarpus siliculosus_2880@37522 Phytophthora infestans_4787@38609 Pythium ultimum_65071@39286 Albugo candida_65357@34212 Saprolegnia parasitica_695850@38348 Aurantiochytrium limacinum_717989@37585 Schizochytrium aggregatum_876976@34729 Aplanochytrium kerguelense_87111@36723 Blastocystis sp_1529846@29162 Cryptosporidium parvum_353152@23915 Toxoplasma gondii_5811@30110 Vitrella brassicaformis_1169540@33770 Perkinsus marinus_423536@27132 Symbiodinium minutum_1202447@23577 Stylonychia lemnae_755200@31328 Tetrahymena thermophila_5911@31730 Bathycoccus prasinos_41875@36467 Ostreococcus lucimarinus_242159@35209 Micromonas pusilla_564608@35705 Chlamydomonas reinhardtii_3055@34821 Coccomyxa subellipsoidea_574566@37150 Arabidopsis thaliana_3702@35807 Selaginella moellendorffii_88036@35747 Klebsormidium flaccidum_3175@40407 Chondrus crispus_2769@29845 Porphyridium purpureum_35688@33795 Cyanidioschyzon merolae_45157@30581 Galdieria sulphuraria_130081@34958 Leishmania infantum_435258@25517 Trypanosoma brucei_679716@26063 Bodo saltans_75058@25984 Agaricus bisporus_5341@34985 Ustilago maydis_5270@36548 Puccinia graminis_418459@32969 Komagataella phaffii_460519@33158 Saccharomyces cerevisiae_559292@29992 Aspergillus niger_5061@36142 Mucor circinelloides_747725@37119 Rhizophagus irregularis_588596@36309 Allomyces macrogynus_28583@37843 Batrachochytrium dendrobatidis_684364@36756 Rozella allomycis_281847@29189 Fonticula alba_691883@32881 Homo sapiens_9606@38802 Lottia spp_72691@39609 Nematostella spp_45350@38117 Trichoplax adhaerens_10228@37961 Monosiga brevicollis_81824@34673 Capsaspora owczarzaki_595528@39546 Dictyostelium discoideum_44689@38177 Polysphondylium pallidum_13642@36717 Acanthamoeba castellanii_5755@36348 98 85 100 100 99 100 100 100 100 100 100 100 100 100 100 100 100 100 100 99 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 75 100 100 100 99 100 97 100 99 100 100 100

0.0828 smat-CAT-LGF.arb: , 57 species, 32 700 characters Amoebozoa (3)

Fonticula + Fungi (12) Ichthyosporea + Metazoa (6) Kinetoplastids (3) Rhodophyta (4) Viridiplantae (8) Alveolata (7) Stramenopiles (14)

Figure 1: RAxML tree built using our 57 selected species (model PROTCATLGF)

show-ing the phylogenetic groups used for the test.

q_seq01 q_seq02 q_seq03 r_org05 best_hit01.05 best_hit02.05 best_hit03 r_org06 best_hit01.05 best_hit02.05 best_hit03 r_org07 best_hit01.05 best_hit02.05 best_hit03 r_org08 No best hit No best hit No best hi

Collecting Queries In MSA (Multiple Sequence Alignment)

Set Of Homologous Sequences

h_10.01.1 ... BLASTP (bh VS r) h_10.02.1 h_10.03.1 h_10.04.1 h_10.05.1 r_org05 r_org06 r_org07 bh_10.01.1.05 bh_10.01.1.06 bh_10.01.1.07 bh_10.02.1.05 bh_10.02.1.06 bh_10.02.1.07 bh_10.03.1.05 bh_10.03.1.06 bh_10.03.1.07 bh_10.04.1.05 bh_10.04.1.06 bh_10.04.1.07 bh_10.05.1.05 bh_10.05.1.06 bh_10.05.1.07

(B) - Set Of Homologues's Best Hits

h_10.01.1 h_10.02.1 h_10.03.1 h_10.04.1 h_10.05.1 bh_10.01.1.05 bh_10.01.1.06 bh_10.01.1.07 bh_10.02.1.05 bh_10.02.1.06 bh_10.02.1.07 bh_10.03.1.05 bh_10.03.1.06 bh_10.03.1.07 bh_10.04.1.05 bh_10.04.1.06 bh_10.04.1.07 bh_10.05.1.05 bh_10.05.1.06 bh_10.05.1.07

2

_{Identifying Best Hits For Queries...}

1

... Selecting Ref_orgs

& Collecting Queries' Best Hits

active ref_orgs - r_org05 - r_org06 - r_org07

3

4

Identifying Homologues For Queries In Bank_orgs

O = best hit in collection queries best hits X = best hit not in collection of queries best hits

r_org05 r_org06 r_org07 bh.05.02 bh.05.03 bh.06.01 bh.06.02 ... BLASTP (bh VS r) bh.05.01 O O O O X X O O X O

Checking BRH (Best Reciprocal Hit)

WARNING WARNING Collection of MSAs MSA 1 > ... > ... > ... MSA 3 > ... > ... > ... MSA 4 > ... > ... > ... MSA 6 > ... > ... > ... MSA 7 > ... > ... > ... MSA 2 > ... > ... > ... MSA 5 > ... > ... > ... MSA 8 > ... > ... > ... MSA 9 > ... > ... > ... bank_orgs - b_org10 - b_org11 - b_org12 - b_org13 query_orgs - q_org01 - q_org02 - q_org03 - q_org04 - q_org05

queries for MSA - q_seq01 - q_seq02 - q_seq03 - q_seq04 MSA 1 > org01 > org02 > org03 > org04 > ... > orgX > orgY

FOR EACH MSA

Probe MSA for queries

ref_orgs - r_org05 - r_org06 - r_org07 - r_org08 - r_org09

no best hit = discard ref_org

q_seq01 q_seq02 q_seq03 r_org05 r_org06 r_org07 r_org09 bh.05.01 bh.05.02 bh.05.03 bh.06.03 bh.07.03 bh.07.01 bh.06.01 bh.06.02 bh.07.02 XX XX XX r_org08 bh.08.01 bh.08.02 bh.08.03 BLASTP (q VS r) avg_BS avg05 avg06 avg07 XX avg08 q_seq04 bh.05.04 bh.06.04 bh.07.04 XX bh.08.04

q_seq01 q_seq02 q_seq03 b_org10 TBLASTN (q VS b) b_org11 b_org12 b_org13 h_13.03.1 h_13.03.2 h_13.03.3 h_13.03.4 h_11.01.1 h_11.01.2 h_11.01.3 h_11.01.4 h_10.03.1 h_10.03.2 h_10.03.3 h_10.03.4 h_10.02.1 h_10.02.2 h_10.02.3 h_10.02.4 h_10.01.1 h_10.01.2 h_10.01.3 h_10.01.4 h_11.02.1 h_11.02.2 h_11.02.3 h_11.02.4 h_11.03.1 h_11.03.2 h_11.03.3 h_11.03.4 h_12.01.1 h_12.01.2 h_12.01.3 h_12.01.4 h_12.02.1 h_12.02.2 h_12.02.3 h_12.02.4 h_12.03.1 h_12.03.2 h_12.03.3 h_12.03.4 h_13.01.1 h_13.01.2 h_13.01.3 h_13.01.4 h_13.02.1 h_13.02.2 h_13.02.3 h_13.02.4 q_seq04 h_13.04.1 h_13.04.2 h_13.04.3 h_13.04.4 h_10.04.1 h_10.04.2 h_10.04.3 h_10.04.4 h_11.04.1 h_11.04.2 h_11.04.3 h_11.04.4 h_12.04.1 h_12.04.2 h_12.04.3 h_12.04.4

Set Of Queries' Best Hits

bh.05.03 bh.06.03 bh.07.03 bh.05.01 bh.07.01 bh.06.01 bh.05.01 bh.07.01 bh.06.01 bh.05.02 bh.06.02 bh.07.02 bh.05.02 bh.06.02 bh.07.02

(A) - Set Of Queries' Best Hits

bh.05.04 bh.06.04 bh.07.04 bh.08.04 bh.08.03 bh.08.01 bh.08.02 h_11.01.1 h_11.01.2 h_11.01.3 h_11.01.4 h_13.03.1 h_13.03.2 h_13.03.3 h_13.03.4 h_11.03.1 h_11.03.2 h_11.03.3 h_11.03.4 h_10.03.1 h_10.03.2 h_10.03.3 h_10.03.4 h_10.02.1 h_10.02.2 h_10.02.3 h_10.02.4 h_10.01.1 h_10.01.2 h_10.01.3 h_10.01.4 h_12.02.1 h_12.02.2 h_12.02.3 h_12.02.4 h_12.03.1 h_12.03.2 h_12.03.3 h_12.03.4 h_11.02.1 h_11.02.2 h_11.02.3 h_11.02.4 h_13.01.1 h_13.01.2 h_13.01.3 h_13.01.4 h_13.02.1 h_13.02.2 h_13.02.3 h_13.02.4 h_11.04.1 h_11.04.2 h_11.04.3 h_11.04.4 h_12.04.1 h_12.04.2 h_12.04.3 h_12.04.4 h_13.04.1 h_13.04.2 h_13.04.3 h_13.04.4 h_12.01.1 h_12.01.2 h_12.01.3 h_12.01.4 h_10.04.1 h_10.04.2 h_10.04.3 h_10.04.4

avg05 > avg06 > avg07 > avg08 applying ref_org_mul (0,75) ref_orgs - r_org05 - r_org06 - r_org07 - r_org08 - r_org09 bank_orgs - b_org10 - b_org11 - b_org12 - b_org13 ref_org_mul (0,75) h_10.01.1 = retained

Identifying Orthologues Among Homologues

if bh_10.01.1.05 & bh_10.01.1.06 & bh_10.01.1.07 in (A) in (A) in (A) h_10.02.1 = discarded if bh_10.02.1.05 or bh_10.02.1.06 or bh_10.02.1.07 not in (A) not in (A) not in (A) GO TO NEXT MSA

bank_orgs = transcriptomes from which sequences will added to MSAs = conﬁg parameters

ref_orgs = proteomes used to validate a sequence for addition into an MSA ref_org_mul = multiplier used for selecting a subset of ref_orgs

Figure 2: Diagram showing heuristics behind ‘forty-two‘.

● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision ref_mul ● ● ● ● ● ● ● ● ● ● 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sparse ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Widespread ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 recall precision Universal

Figure 3: PR-curves for MSAs with few intra-group representatives (”Sparse”), almost

all intra-group representatives (”Widespread”) and all the representatives (”Univer-sal”, 57 species). ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00

Alv Amo Fon−Fun Ich−Met Kin Rho Str Vir Serie AUC Groups ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00

Alv−Str Amo−Kin Fon−Fun−Ich−Met Rho−Vir Serie AUC Supergroups ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00

universal widespread sparse Taxonomic distribution AUC copies 11. 12 20 Groups ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 0.25 0.50 0.75 1.00

universal widespread sparse Taxonomic distribution AUC copies 11 12 20 Supergroups (B) (A) . . . . .

Figure 4: Distribution of computed areas under curve for each orthogroup for each

each of the two runs (”groups” and ”supergroups”) relative to (A) ”groups” and ”super-groups” and to (B) taxonomic distribution within orthogroups and number copies per MSA.

Perspectives

Further testing is planned; indeed it will be interesting to try to enrich public or-thogroups with biased taxonomic sampling and also to compare performances ver-sus HaMStR [BMC Evolutionary Biology 2009 9:157] which is also able to search for othologs in ESTs.