• Aucun résultat trouvé

Analysis of evolutionary constraints on gene arrangements in animal genomes

N/A
N/A
Protected

Academic year: 2022

Partager "Analysis of evolutionary constraints on gene arrangements in animal genomes"

Copied!
177
0
0

Texte intégral

(1)

Thesis

Reference

Analysis of evolutionary constraints on gene arrangements in animal genomes

LI, Jia

Abstract

The availability of an increasing number of animal genomes, mostly being of vertebrate and insect species, provides us an opportunity to employ comparative genomics approaches to study the evolution of gene arrangements. Genome architecture evolution through genome rearrangement events such as fission, fusion, inversion, translocation, and transposition leads to continual divergence from the ancestral architecture. In order to detect the conserved genomic regions across distant species, a large-scale ortholog-anchored multiple-species synteny delineation workflow is developed in this research. Moreover, the database that stores the resulted synteny information generated by the workflow is presented in this thesis, as well as the web interface which allows public queries for the synteny information. This thesis also elaborates the evolutionary constraints on gene arrangements revealed by various analyses on the arthropod and vertebrate synteny block data. Furthermore, a detailed case study of a remarkably stable TipE gene cluster in insects and examples of genome rearrangement studies for tobacco hawkmoth are addressed to [...]

LI, Jia. Analysis of evolutionary constraints on gene arrangements in animal genomes . Thèse de doctorat : Univ. Genève, 2013, no. Sc. 4588

URN : urn:nbn:ch:unige-304282

DOI : 10.13097/archive-ouverte/unige:30428

Available at:

http://archive-ouverte.unige.ch/unige:30428

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE

Département de médecine FACULTÉ DE MÉDECINE génétique et développement Professeur Evgeny M. Zdobnov Département d'informatique FACULTÉ DES SCIENCES Professeur Ron D. Appel

Analysis of evolutionary constraints on gene arrangements in animal genomes

THÈSE

présentée à la Faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique

par

Jia LI

de Nanjing (Chine)

Thèse N° 4588 Genève, 2013

(3)

PhD Thesis Title:

Analysis of evolutionary constraints on gene arrangements in animal genomes

PhD Candidate:

Jia Li

Supervisor:

Prof. Evgeny M. Zdobnov

Period of Study:

February 2010 – June 2013

Institution of Study:

University of Geneva

Research Group:

Computational Evolutionary Genomics Group

(4)
(5)

Remerciements

À mon directeur de thèse, Professeur Evgeny Zdobnov pour son orientation, son soutien et ses encouragements tout au long de ma these.

Au Dr. Robert Waterhouse pour tous ses conseils et son aide au cours des trois dernières années.

Au SIB PhD Training Network de m'avoir offert la possibilité de participer à divers cours, à diverses formations et conferences intéressantes.

Au Dr. Patricia Palagi, coordonnatrice, toujours chaleureuse et prête à offrir ses services.

À tous les membres du groupe CEGG, anciens et actuels, qui m’ont toujours soutenue et m'ont aidée autant que possible. Ils ont largement contribué au plaisir éprouvé pendant ces trois ans.

Enfin, un grand merci à toute ma famille et mes amis pour tout ce qu'ils ont fait pour moi.

(6)

Abstract

The availability of an increasing number of animal genomes, mostly being of vertebrate and insect species, provides us an opportunity to employ comparative genomics approaches to study the evolution of gene arrangements. Genome architecture evolution through genome rearrangement events such as fission, fusion, inversion, translocation, and transposition leads to continual divergence from the ancestral architecture. In order to detect the conserved genomic regions across distant species, a large-scale ortholog-anchored multiple-species synteny delineation workflow is developed in this research. Moreover, the database that stores the resulted synteny information generated by the workflow is presented in this thesis, as well as the web interface which allows public queries for the synteny information. This thesis also elaborates the evolutionary constraints on gene arrangements revealed by various analyses on the arthropod and vertebrate synteny block data. Furthermore, a detailed case study of a remarkably stable TipE gene cluster in insects and examples of genome rearrangement studies for tobacco hawkmoth are addressed to demonstrate the use of synteny information in evolutionary studies. Applying the computational comparative approaches to investigate animal genomes, this research has successfully provided a promising tool and a comprehensive resource for the studies of conserved genomic regions, and helped to develop a better understanding of the forces shaping these genome architectures.

(7)

ii

Résumé

La disponibilité d'un nombre croissant de génomes animaux, pour la plupart vertébrés et insectes, nous permet d'utiliser la génomique comparée pour étudier l’évolution de l’arrangement des gènes. Des modifications dans l'architecture du génome liées à des phénomènes provoquant des réarrangements génomiques telles la fission, la fusion, l'inversion, la translocation et la transposition entraînent des transformations constantes, qui résultent en un éloignement progressif de l’architecture génomique ancestrale. Afin de mieux cibler les régions génomiques communes à des espèces distantes, un système d’identification de synténie à grande échelle basé sur les gènes orthologues dans de multiples espèces a été développé dans le cadre de ce travail de recherche. De plus, les informations de synténie résultant de cette méthode sont stockées dans une base de donnée, puis accessibles au public au travers d’une interface web. Ce travail de thèse met également en évidence le poids de l’évolution sur l’arrangement des gènes, notamment par ses nombreuses analyses des blocks de synténie chez les arthropodes et les vertébrés. L’utilité de la synténie dans les études sur l’évolution est également corroborée par une étude de cas concernant un groupement de gènes, TipE, remarquablement stable chez les insectes ainsi que des exemples d'études dans le réarrangement du génome du sphinx du tabac. L’approche comparative computationnelle des génomes animaux qui est au coeur de ce travail de recherche, offre un outil prometteur et une ressource exhaustive dans l’étude des régions génomiques conservées et contribue sans aucun doute à une meilleure compréhension des forces façonnant l’architecture de ces génomes.

(8)

Acknowledgements

Firstly I would like to thank my supervisor, Professor Evgeny Zdobnov for his encouragement, support and guidance throughout my PhD. I would also like to thank Dr.

Robert Waterhouse for all his advice and help during the past three years.

I would like to thank the SIB PhD network for offering me opportunities to participate in various interesting courses, trainings and conferences. My special thanks to the coordinator, Dr. Patricia Palagi, who is always so warm-hearted and ready to help.

My appreciations also go to all the members of CEGG group, former and current, who always supported and helped me when they could, who added a lot of fun into my PhD life.

Lastly I would like to thank all my family and my friends for everything they have done for me.

(9)

iv

List of publications

Waterhouse, R.M., Tegenfeldt, F., Li, J., Zdobnov, E.M., and Kriventseva, E.V. (2013).

OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res.

41, D358–365.

Li, J., Waterhouse, R.M., and Zdobnov, E.M. (2011). A remarkably stable TipE gene cluster:

evolution of insect Para sodium channel auxiliary subunits. BMC Evol. Biol. 11, 337.

Waterhouse, R.M., Zdobnov, E.M., Tegenfeldt, F., Li, J., and Kriventseva, E.V. (2011).

OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res. 39, D283–288.

(10)

Contents

Abstract ... i

Résumé ... ii

Acknowledgements ... iii

List of publications ... iv

List of figures ... viii

List of tables... x

Acronyms ... xi

1. Introduction ... 1

1.1. Synteny – conserved gene arrangements ... 3

1.2. Importance of synteny research ... 5

1.2.1. Evolutionary studies ... 6

1.2.2. Functional genomics studies ... 7

1.2.3. Genome annotation and assembly ... 8

1.3. State-of-the-art in synteny research ... 9

1.3.1. Synteny identification ... 10

1.3.2. Resources of synteny information ... 13

1.3.3. Random breakage model vs. fragile breakage model ... 15

1.3.4. Possible evolutionary constraints on genome rearrangements ... 16

1.3.5. Conserved gene clusters in various lineages ... 17

1.4. Insect genomes – excellent materials for synteny research ... 18

2. Objectives... 21

3. Methods ... 24

3.1. Acquisition of orthology and genomic position information ... 26

3.1.1. Orthology information from OrthoDB ... 27

3.1.2. Genomic position information from genome resources ... 35

3.2. Assessment of genome assembly quality ... 36

3.3. Synteny block identification workflow ... 37

3.3.1. Species phylogenetic tree ... 39

(11)

vi

3.3.2. Pairwise synteny identification ... 40

3.3.3. Species set generation ... 43

3.3.4. Species set distance calculation ... 45

3.3.5. N-wise projection ... 46

3.3.6. Synteny to orthologous group assignment ... 48

3.3.7. Synteny block length calculation... 49

3.4. Development of OrthoBlock database and web interface ... 50

3.5. Analysis of OrthoBlock synteny data ... 51

3.5.1. Study of synteny block length distribution ... 52

3.5.2. Correlation test between percentage of orthologous groups / orthologs with synteny and average species set distance ... 53

3.5.3. Comparison of evolutionary rates of orthologous groups with and without synteny ... 54

3.5.4. Comparison of copy number divergences of orthologous groups with and without synteny ... 55

3.5.5. Analysis of Gene Ontology term enrichment ... 56

3.6. Genome rearrangement studies for tobacco hawkmoth genome... 57

3.6.1. Estimation of minimal synteny breaks between two moths ... 58

3.6.2. Syntenic degree comparison among Lepidoptera, Diptera, and Aculeata ... 59

4. Results ... 60

4.1. Evolution of insect TipE gene cluster ... 62

4.2. Genome assembly quality affects synteny identification ... 75

4.3. OrthoDB: delineating syntenic orthologs ... 83

4.4. OrthoBlock: a searchable synteny block resource ... 85

4.4.1. OrthoBlock database ... 86

4.4.2. OrthoBlock web interface ... 90

4.5. Evolutionary constraints revealed by synteny data analysis ... 97

4.5.1. Synteny block length distribution suggests non-uniform distribution of synteny break points ... 98

4.5.2. Correlation between gene arrangement distance and protein sequence distance . 101 4.5.3. Orthologous groups inside synteny blocks evolve slower ... 104

4.5.4. Less copy number divergence for orthologous groups with synteny ... 108

4.5.5. Morphogenesis and development GO terms are enriched in syntenic orthologs . 113 4.6. Genome rearrangement studies for tobacco hawkmoth genome... 114

(12)

4.6.1. More rearrangements happened between two moths than previous estimation ... 115

4.6.2. Slow genome shuffling in Lepidoptera ... 116

5. Discussion... 119

5.1. Challenges in synteny identification ... 120

5.1.1 Genome quality ... 120

5.1.2 Growing number of genomes ... 120

5.2. Features of the synteny identification workflow in this research ... 121

5.3. OrthoBlock – a comprehensive resource of hierarchical synteny data ... 123

5.4. Evolutionary constrains on gene arrangements ... 123

5.5. Outlook ... 124

6. Conclusions ... 126

Bibliography ... 129

Appendix ... 142

(13)

viii

List of figures

Figure1.1 Genome rearrangement events of linear chromosomes ... 4 

Figure1.2 Proportion of synteny‐related publications for each year from 1964 to 2012 ... 9 

Figure1.3 Diversity of known species ... 19 

Figure1.4 Phylogenetic tree of insects and vertebrates ... 20 

Figure3.1 Phylogenetic tree of arthropod species ... 29 

Figure3.2 Phylogenetic tree of vertebrate species ... 30 

Figure3.3 Synteny block identification workflow ... 38 

Figure3.4 Pairwise synteny identification ... 41 

Figure3.5 Number of synteny blocks identified for different number of allowed intervening  orthologs ... 42 

Figure3.6 An example of species set generation ... 43 

Figure3.7 N‐wise projection ... 47 

Figure3.8 Map five‐way synteny blocks to orthologous groups ... 48 

Figure4.1 Coverage of orthologs in OrthoDB for arthropod genomes ... 76 

Figure4.2 Coverage of orthologs in OrthoDB for vertebrate genomes ... 77 

Figure4.3 Number of scaffolds for arthropod genomes ... 78 

Figure4.4 Number of scaffolds for vertebrate genomes ... 79 

Figure4.5 N50gene and N50ortholog for arthropod genomes... 81 

Figure4.6 N50gene and N50ortholog for vertebrate genomes ... 82 

Figure4.7 An example of syntenic ortholog results in OrthoDB ... 84 

Figure4.8 OrthoBlock database schema ... 87 

Figure4.9 Proportions of orthologs / orthologous groups with synteny ... 88 

Figure4.10 Proportions of orthologs with synteny at each node for arthropods ... 89 

Figure4.11 Proportions of orthologs with synteny at each node for vertebrates ... 89 

Figure4.12 OrthoBlock front page ... 91 

Figure4.13 An example of OrthoBlock result page for query by gene ... 93 

Figure4.14 An example of OrthoBlock node list page for query by public gene ID ... 94 

(14)

Figure4.15 An example of OrthoBlock gene list page for query by keyword... 95 

Figure4.16 An example of OrthoBlock result page for query by orthologous group ... 96 

Figure4.17 Distribution of synteny block length for arthropod ... 99 

Figure4.18 Distribution of synteny block length for vertebrate ... 100 

Figure4.19 Correlation between percentage of orthologous group with synteny and average  species set distance ... 102 

Figure4.20 Correlation between percentage of ortholog with synteny and average species  set distance ... 103 

Figure4.21 Evolutionary rate comparison between orthologous groups with and without  synteny ... 104 

Figure4.22 Evolutionary rate comparison between orthologous groups with and without  synteny for each arthropod phylogenetic node ... 106 

Figure4.23 Evolutionary rate comparison between orthologous groups with and without  synteny for each vertebrate phylogenetic node ... 107 

Figure4.24 Copy number divergence comparison between orthologous groups with and  without synteny ... 109 

Figure4.25 Copy number divergence comparison between orthologous groups with and  without synteny for each arthropod phylogenetic node ... 110 

Figure4.26 Copy number divergence comparison between non‐single‐copy orthologous  groups with and without synteny for each arthropod phylogenetic node ... 111 

Figure4.27 Copy number divergence comparison between orthologous groups with and  without synteny for each vertebrate phylogenetic node ... 112 

Figure4.28 Proportions of orthologs with synteny in different comparisons ... 117 

Figure4.29 Comparison of proportion of orthologs with synteny among three insect    lineages ... 118 

(15)

x

List of tables

Table1.1 Synteny identification software programs ... 

Table1.2 Resources of multiple‐species synteny information ... 14 

Table3.1 Orthology databases ... 28 

Table3.2 List of arthropod genomes studied in this thesis ... 31 

Table3.3 List of vertebrate genomes studied in this thesis ... 33 

Table3.4 Number of five‐species sets for each phylogenetic node in this study ... 44 

Table4.1 Genome coverage of species with either N50gene or N50ortholog equal to 1 ... 80 

Table4.2 Enriched GO terms for orthologs with five‐species synteny ... 113 

(16)

Acronyms

List of species codes (ordered phylogenetically):

Arthropod

CODE Species name

ISCAP Ixodes scapularis TURTI Tetranychus urticae SMARI Strigamia maritima DPULE Daphnia pulex

ZNEVA Zootermopsis nevadensis PHUMA Pediculus humanus

RPROL Rhodnius prolixus APISU Acyrthosiphon pisum NVITR Nasonia vitripennis MROTU Megachile rotundata BIMPA Bombus impatiens BTERR Bombus terrestris AFLOR Apis florea

AMELL Apis mellifera

HSALT Harpegnathos saltator LHUMI Linepithema humile CFLOR Camponotus floridanus PBARB Pogonomyrmex barbatus SINVI Solenopsis invicta AECHI Acromyrmex echinatior ACEPH Atta cephalotes

TCAST Tribolium castaneum MMOLD Mengenilla moldrzyki BMORI Bombyx mori

MSEXT Manduca sexta DPLEX Danaus plexippus HMELP Heliconius melpomene ADARL Anopheles darlingi

(17)

xii

AGAMB Anopheles gambiae ASTEP Anopheles stephensi AAEGY Aedes aegypti

CQUIN Culex quinquefasciatus MDEST Mayetiola destructor DGRIM Drosophila grimshawi DMOJA Drosophila mojavensis DVIRI Drosophila virilis DWILL Drosophila willistoni DPERS Drosophila persimilis DPSEU Drosophila pseudoobscura DANAN Drosophila ananassae DEREC Drosophila erecta DYAKU Drosophila yakuba

DMELA Drosophila melanogaster DSECH Drosophila sechellia DSIMU Drosophila simulans

Vertebrate

CODE Species name

DRERI Danio rerio GMORH Gadus morhua OLATI Oryzias latipes

ONILO Oreochromis niloticus GACUL Gasterosteus aculeatus TNIGR Tetraodon nigroviridis TRUBR Takifugu rubripes

LCHAL Latimeria chalumnae XTROP Xenopus tropicalis ACARO Anolis carolinensis TGUTT Taeniopygia guttata MGALL Meleagris gallopavo GGALL Gallus gallus

OANAT Ornithorhynchus anatinus MDOME Monodelphis domestica MEUGE Macropus eugenii

SHARR Sarcophilus harrisii DNOVE Dasypus novemcinctus

(18)

CHOFF Choloepus hoffmanni ETELF Echinops telfairi PCAPE Procavia capensis LAFRI Loxodonta africana SARAN Sorex araneus

EEURO Erinaceus europaeus PVAMP Pteropus vampyrus MLUCI Myotis lucifugus ECABA Equus caballus FCATU Felis catus

CFAMI Canis familiaris

AMELA Ailuropoda melanoleuca VPACO Vicugna pacos

SSCRO Sus scrofa BTAUR Bos taurus

TTRUN Tursiops truncatus OPRIN Ochotona princeps OCUNI Oryctolagus cuniculus

STRID Spermophilus tridecemlineatus CPORC Cavia porcellus

DORDI Dipodomys ordii RNORV Rattus norvegicus MMUSC Mus musculus

TBELA Tupaia belangeri OGARN Otolemur garnettii MMURI Microcebus murinus TSYRI Tarsius syrichta CJACC Callithrix jacchus MMULA Macaca mulatta

NLEUC Nomascus leucogenys PABEL Pongo abelii

GGORI Gorilla gorilla PTROG Pan troglodytes HSAPI Homo sapiens

(19)

xiv

ABC: ATP-binding cassette ATP: adenosine-5’-triphosphate BAC: bacterial artificial chromosome

BKCa: big-conductance calcium-activated potassium channel CV: coefficient of variation

DAVID: the Database for Annotation, Visualization and Integrated Discovery DNA: deoxyribonucleic acid

EASE: Expression Analysis Systematic Explorer EST: expressed sequences tag

GB: gigabyte

GMC: glucose-methanol-choline GFF: General Feature Format GTF: General Transfer Format GO: Gene Ontology

HOM-C: homeotic gene complexes Hox: Homeobox

ID: identifier Irx: Iroquois

NaV: voltage-gated sodium channel

NCBI: National Center for Biotechnology Information Nim: Nimrod

Npp: Natriuretic peptide Osi: Osiris

RNA: ribonucleic acid

Scpp: Secretory calcium-binding phosphoprotein TAIR: The Arabidopsis Information Resource TipE: Temperature-induced paralytic E Vtg: Vitellogenin

Wnt: Wingless

(20)

Chapter 1

Introduction

(21)

Genetic variations as the substrates of evolution have two major types of sources: the small- scale variations, which only affect one or a few nucleotides, including point mutations, short insertions and deletions; and the large-scale variations including chromosomal segment duplications and deletions, inversions, translocations, horizontal gene transfers and etc. It is already widely adopted that the small-scale variations, which may impact the function of a gene or a functional element and thus have certain effects on the fitness of the organism, are usually under natural selection (Nielsen, 2005; Soskine and Tawfik, 2010). However, the mechanisms of how genome architecture has evolved are still under debate (Koonin, 2009;

Al-Shahrour et al., 2010). With the availability of growing number of genomes and increasing computational power, we now can apply computational comparative genomics approaches to study the evolutionary constraints of gene arrangements.

In this introductory chapter, the key aspects of gene arrangement researches are described in the first three sections: from the concept of synteny and the importance of synteny research to state-of-the-art in synteny research. The fourth section gives a brief introduction to insect genomes which are excellent materials for synteny research. This chapter provides the background on synteny and gene arrangements, which is necessary to appreciate the research presented in this thesis.

2

(22)

1.1. Synteny – conserved gene arrangements

In the 1970s, in classical genetics, the word “synteny” started to be used to describe the physical co-localization of two or more genes on the same chromosome in one species (Creagan et al., 1973; McMorris et al., 1973; Renwick and Bolling, 1971). Then in the 1980s, while the researchers were interested not only in the homology between genes in different species but also the homology of genomic segments, the concept of “conserved synteny”

arose (Djalali et al., 1987; Skow et al., 1987; Todd et al., 1985), which was defined as

“homology segments composed of two or more pairs of homologous genes located on the same chromosome, regardless of gene order” in Nadeau's review in 1989 (Nadeau, 1989).

However, recently, the term “synteny” is often used in comparative genomics to describe the conservation of gene order in the genome among different species with common evolutionary ancestry, and “shared synteny” or “conserved synteny” can also be used to refer the same concept (d’ Alençon et al., 2010; Bu et al., 2011; Hellsten et al., 2010). In this work, the recent usage of the term “synteny” is adopted to depict the genomic regions with conserved gene arrangements among different genomes.

Synteny blocks, which are observed when we compare different modern genomes, are the remnants after genome rearrangement events during evolution. For unichromosomal genome, inversions are the most common genome rearrangement events; and for multichromosomal genomes, the most common rearrangements include inversions, translocations, fissions, and fusions (Pevzner and Tesler, 2003a). Figure1.1 portrays the rearrangement events of linear chromosomes: fission – one chromosome splits into two;

fusion – two chromosomes merge into one; inversion (or reversal) – a segment of a chromosome is reversed; translocation – chromosomal segments exchange between nonhomologous chromosomes; and transposition – a chromosomal segment is transferred to another position on the same chromosome (Alekseyev, 2008). These genome rearrangement events shape the genome architectures of the various species since their divergence from the ancestors.

(23)

Figure1.1 Genome rearrangement events of linear chromosomes. The events include fission, fusion, inversion, translocation, and transposition. Bars left to the arrows represent chromosomes before the events; bars right to the arrows represent chromosomes after the events. Colors highlight the infected chromosomal regions.

4

(24)

1.2. Importance of synteny research

Nowadays, the research of synteny and genome rearrangement are not only for better understanding the evolution of genome architecture, but also important for other evolutionary studies, functional genomics studies, genome annotation, and genome assembly. In this section, synteny-based evolutionary studies, functional genomics studies, as well as the uses of synteny information in genome assembly and annotation are reviewed.

(25)

1.2.1. Evolutionary studies

Phylogeny reconstruction

The common way to reconstruct a phylogenetic tree is based on sequence distance, either nucleotide sequence or protein sequence, however, methods of using synteny data in phylogenetic reconstruction were developed and applied recently (Lin et al., 2012; Ye et al., 2007). The advantages of synteny-based methods include capturing the feature of the whole genome thus avoiding the gene tree versus species tree problem, no need to align multiple sequences, and etc. (Moret and Warnow, 2005)

Orthology identification

It is a very important yet difficult task to distinguish paralogs from orthologs in orthology delineation, especially for the genomes with high rates of gene duplication and loss. In order to accurately determine the orthology, various methods were proposed to identify orthologous genes using gene order data in addition to sequence data. Such methods have been implemented successfully in prokaryotic genomes (Lemoine et al., 2007), fungal genomes (Kellis et al., 2004; Wapinski et al., 2007), and mammalian genomes (Jun et al., 2009; Zheng et al., 2005).

Gene family evolution

Information of conserved synteny is useful for studying the evolution of gene families.

Especially for the gene families with members clustered together in some of the genomes, comparing the syntenic region with the family members among different lineages can facilitate the inference of the evolutionary history of such gene families (Hui et al., 2012; Iida et al., 2007; Li et al., 2011; Shah et al., 2012).

6

(26)

1.2.2. Functional genomics studies

Functional annotation

Complementary to homology information, synteny information can be used to predict gene functions and functional interaction of proteins. For example, the prediction of physical protein interactions based on conserved gene clusters was proposed, because it was experimentally confirmed that the encoded proteins of conserved neighboring gene pairs are likely to physically interact with each other in a study of nine bacterial and archaeal genomes (Dandekar et al., 1998). It has also been shown that genes can be co-regulated and co- transcribed through the creation and maintenance of operons that contain non-homologous genes (von Mering et al., 2003). Recently, many function prediction methods which make use of synteny information have been developed (Huynen et al., 2000; Overbeek et al., 1999;

Yelton et al., 2011), and the improvement of functional prediction methods may help us to complete the functional knowledgebase.

Regulatory element studies

It was proposed that the cis-regulatory elements, which regulate the expression of the genes nearby, may contribute to the conservation of surrounding genome architecture during evolution (Hufton et al., 2009; Kikuta et al., 2007; Mackenzie et al., 2004; Mongin et al., 2009). Therefore, it is reasonable to identify the cis-regulatory elements or investigate their functions based on conserved synteny information. For example, conserved synteny blocks were used to delineate gene regulatory boundaries in a study of the human genome (Ahituv et al., 2005), and an algorithm was developed to predict the association between regulatory regions and their target genes using synteny (Mongin et al., 2011).

(27)

1.2.3. Genome annotation and assembly

The conservation of genome architecture during evolution can be used to improve the annotation of the genomes. By accurately assigning orthology relationship for gene families, wrongly annotated gene models can be corrected based on the well-annotated ortholog in other species (Li et al., 2011), and gene models which were not annotated initially but are expected to exist in the genomic region because of the presence in other genomes can be discovered (Li et al., 2011; OhÉigeartaigh et al., 2011). Moreover, several pipelines were developed to annotate microbial or yeast genomes with the consideration of synteny (Proux- Wéra et al., 2012; Vallenet et al., 2006).

Not only to correct wrongly annotated gene models, by referencing closely related genomes, synteny information also can be used to identify probable genome assembly errors (Bhutkar et al., 2006). In order to improve quality of genome assemblies, especially for the genomes with low sequence coverage, new genome assembly algorithms which make use of synteny information have been developed recently (Gnerre et al., 2009; Zhao et al., 2009).

8

(28)

1.3. State-of-the-art in synteny research

Recent years, the publications about synteny increase rapidly. Figure1.2 shows the yearly proportion of synteny-related publications – the total number of publications and the number of publications with the title or the abstract containing the terms “synteny” or “gene order”

were retrieved by the PubMed “Results by year” timeline tool (Canese, 2012). With the increasing number of newly sequenced genomes, more and more researchers are getting interested in synteny.

In this section, state-of-the-art approaches for synteny identification and currently publicly available resources of synteny information are reviewed, and the debates between two theories of genome rearrangement as well as the possible evolutionary constraints are introduced.

Figure1.2 Proportion of synteny-related publications for each year from 1964 to 2012 (data source: PubMed).

(29)

1.3.1. Synteny identification

The growing number of newly sequenced genomes encourages interest in employing comparative genomics approaches to study the evolution of genome architecture. One important issue in such studies is to identify evolutionary conserved synteny blocks across various genomes. A number of computational tools have been developed for synteny identification in the last few years based on different genome alignment strategies (Table1.1).

The DNA sequence-based approaches, such as GRIMM-Synteny (Pevzner and Tesler, 2003a) and Enredo (Paten et al., 2008), can bypass the issues of gene annotation and orthology delineation, however may miss homology between genes that evolve so fast that the DNA similarity becomes obscure while protein-level similarity remains clear. Therefore, such approaches are only suitable for very close species. For distant species, the gene-based synteny identification methods (Calabrese et al., 2003; Despalins et al., 2011; Haas et al., 2004; Pham and Pevzner, 2010; Rödelsperger and Dieterich, 2008, 2010; Wang et al., 2012), which use gene as anchor to align genomes generally based on BLASTP similarity or conserved distances, are more popular. However, the homology assignment may be not accurate enough, especially when dealing with species with larger evolutionary distance, for example the insects. Therefore, the ortholog-based approaches taking advantage of orthology identification prior to genome alignment can perform better for comparison of distant species.

There are already several publicly available ortholog-based synteny identification software programs, including TEAM (Luc et al., 2003), OrthoCluster (Zeng et al., 2008), and etc.

(Muffato et al., 2010; Sinha and Meller, 2007; Soderlund et al., 2011; Vandepoele et al., 2002)

Comparing synteny blocks generated for different hierarchy level along a phylogenetic tree can facilitate the studies of genome architecture evolution, therefore help to better understand the evolutionary constraints on gene arrangements. Although there are several software programs able to generate multiple-species synteny data, only a handful of them take into consideration the phylogeny of the species (Despalins et al., 2011; Muffato et al., 2010; Rödelsperger and Dieterich, 2010).

10

(30)

Table1.1 Synteny identification software programs

(31)

Another issue which should not be under-considered when performing large-scale multiple-species synteny studies is the quality of the genomes. The precision of genome sequencing, the contiguity of genome assemblies, and the accuracy of gene model annotations can influence the result of synteny identification dramatically. It is highly possible that a synteny block conserved in multiple species which potentially could be identified however is missed because of including a poorly assembled genome in the study.

Although the state-of-the-art sequencing technologies together with genome assembly and annotation algorithms provide us more high-quality genomes, there are still quite a lot interesting genomes with relatively low quality. However, there is no synteny identification method considering the genome quality currently available in the field (Table1.1).

Moreover, even with perfect genome assembly, growing number of evolutionary synteny breaks can be expected while more genomes are included into synteny analysis.

Therefore, a robust high-throughput strategy to identify multiple-species synteny blocks along the phylogeny and taking the genome quality and evolutionary synteny breaks into consideration is of interest in comparative genomics studies.

12

(32)

1.3.2. Resources of synteny information

Various genome browsers now have functions to visualize synteny for well annotated genomes, including Ensembl syntenyview (Clamp et al., 2003), NCBI's Map Viewer (Wheeler et al., 2007), as well as TAIR synteny viewer (Lamesch et al., 2010) and Wormbase synteny viewer (Harris et al., 2004), which use the Generic Synteny Browser (GBrowse_syn) framework (McKay et al., 2010). These online browsers are based on pairwise synteny information, though they display synteny blocks in a multiple-species manner. In fact, only a few of resources provide real genome-wide multiple-genome synteny information (Table1.2), such as Cinteny (Sinha and Meller, 2007), Genomicus (Louis et al., 2013), SyntTax (Oberto, 2013), and OrthoClusterDB (Ng et al., 2009), and they focus mainly on prokaryotic genomes and vertebrate genomes.

(33)

Table1.2 Resources of multiple-species synteny information

14

(34)

1.3.3. Random breakage model vs. fragile breakage model

The first theory attempting to explain the mechanism of genome rearrangements during evolution is the random breakage model, which was postulated by Ohno in 1973 (Ohno, 1973), and later formalized by Nadeau and Taylor in 1984, suggesting that chromosome rearrangements occur randomly and the breakpoints are distributed uniformly and independently in the genomes (Nadeau and Taylor, 1984). Since then, a number of other studies have supported the random breakage model (Copeland et al., 1993; Lander et al., 2001; Mural et al., 2002; Nadeau and Sankoff, 1998; Schoen, 2000; Waddington et al., 2000).

However, in 2003, Pevzner and Tesler rejected the random breakage model and postulated an alternative fragile breakage model, which suggests the existence of fragile regions with high propensity for genome rearrangements (hotspots), by showing the evidences of breakpoint reuse in evolution based on the comparison of human and mouse genomes (Pevzner and Tesler, 2003b). They argued that the previous studies supporting the random breakage model was because of the ignorance of a large number of very short synteny blocks, in another word, lack of enough resolution (Pevzner and Tesler, 2003b). Although there are some debates between the supporters of the two models (Alekseyev and Pevzner, 2007; Peng et al., 2006;

Sankoff, 2006; Sankoff and Trinh, 2004), recently, more and more studies on both vertebrate genomes (Becker and Lenhard, 2007; Gordon et al., 2007; Hinsch and Hannenhalli, 2006;

Kemkemer et al., 2009; Mlynarski et al., 2010; Murphy et al., 2005; Ruiz-Herrera et al., 2006) and insect genomes (Bhutkar et al., 2008; Zdobnov and Bork, 2007) further explore and favor the fragile breakage model of chromosome evolution, with suggestion of an geometric rather than uniform distribution of breakpoints along chromosomes (Zdobnov and Bork, 2007).

(35)

1.3.4. Possible evolutionary constraints on genome rearrangements

The detailed studies of these syntenic gene families and gene clusters revealed several possible evolutionary constraints to maintain conserved genome architecture. It has been suggested that in bacterial and archaeal genomes, the proteins encoded by gene pairs with conserved order are likely to interact physically (Dandekar et al., 1998) or co-transcribed (von Mering et al., 2003). For eukaryotic genomes, various studies have illustrated that conserved neighboring genes are often transcriptional regulated coordinately (Arnone et al., 2012; Chopra, 2011; Iida et al., 2007; McKimmie et al., 2005). Such correlation can be associated with factors including relative gene orientation, intergenic distance, and functional relationships (Dávila López et al., 2010). In addition, some studies suggested that the cis- regulatory elements in the syntenic regions play an important role for the conservation of genomic organization across different genomes during evolution (Maeso et al., 2012; Quijano et al., 2008). Moreover, cases of conserved gene cluster nested within a large gene have been reported in previous researches, suggesting the possible structural constraints to preserve the conserved genomic neighborhood (Assis et al., 2008; Iida et al., 2007; Kumar, 2009).

Furthermore, previous large-scale studies on prokaryotic genomes and fungal genomes have shown that the genes located inside conserved gene clusters evolve slower than the ones outside (Dandekar et al., 1998; Hachiya and Sakakibara, 2009; Lemoine et al., 2007), and the studies of primate genomes have revealed the co-localization of synteny breakpoints and copy number variants (Carbone et al., 2006; Kemkemer et al., 2009; Roberto et al., 2007), implying the existence of evolutionary constraints on the conserved syntenic regions.

16

(36)

1.3.5. Conserved gene clusters in various lineages

Since the insect homeotic gene complexes (HOM-C) and the vertebrate Homeobox (Hox) clusters, which have conserved organization on the chromosome in many animals, were found about thirty years ago (Akam, 1989; Beeman, 1987; Duboule and Dollé, 1989; Gaunt, 1988; Graham et al., 1989), a number of conserved synteny blocks or clustered gene families were investigated over the last few decades. In bacterial and archaeal genomes, conserved gene clusters and pairs were identified, such as ribosomal proteins involved in ribosomal particle formation, ATP synthase subunits, ATP-binding cassette (ABC) transporter subunits, and RNA polymerase subunits (Dandekar et al., 1998; Eym et al., 1996). Examples including ribosomal protein pairs L21-A/S9-A and S16/L13, iron transport proteins, mitochondrial heat-chock proteins, and DNA repair protein pairs Rad16 and Rad7 were shown in studies on fungal genomes (Dávila López et al., 2010; Reed et al., 1998). Besides the clusters of Hox gene family, other gene clusters are found conserved in both vertebrates and insects as well:

the Histone gene clusters (Braastad et al., 2004; Marzluff et al., 2002; Nagel and Grossbach, 2000), the Iroquois (Irx) homeobox gene clusters (Irimia et al., 2008; Kerner et al., 2009), the Wingless (Wnt) gene clusters (Bolognesi et al., 2008; Nusse, 2001), and transcription factor Sox gene clusters (Mazzuchelli et al., 2011; McKimmie et al., 2005). For vertebrates, additional clusters like the Secretory calcium-binding phosphoprotein (Scpp) gene cluster (Kawasaki and Weiss, 2003), the Natriuretic peptide (Npp) gene cluster (Houweling et al., 2005), and the Vitellogenin (Vtg) gene cluster (Babin, 2008; Finn et al., 2009) have been identified. Moreover, studies have also reported gene families with conserved gene order in insect genomes, which include the GMC oxidoreductase gene family (Iida et al., 2007), the Osiris (Osi) gene family (Dorer et al., 2003; Shah et al., 2012; Zdobnov et al., 2002), the Nimrod (Nim) gene superfamily (Somogyi et al., 2010), the Yellow gene family (Ferguson et al., 2011), and the Temperature-induced paralytic E (TipE) gene family (Derst et al., 2006; Li et al., 2011).

(37)

1.4. Insect genomes – excellent materials for synteny research

Insects, which diverged millions of years ago, are the largest and the most diverse group of animals on earth (Figure1.3) (Wilson, 1993). Because they evolve considerably faster than vertebrates (Figure1.4) (Wyder et al., 2007), the comparison of insect genome architectures can provide us with an exceptional opportunity to examine gene arrangements over large evolutionary distances. Particularly, comparative genomics study on insect genomes should help to fish out the genome rearrangements that are constrained by selection. Moreover, many insects are vectors of pathogens causing various human diseases. For example, some mosquitoes of the genus Anopheles are the vectors of malaria (Neafsey et al., 2013).

Therefore, the better understanding of the evolution of insect genome architectures, which facilitates functional genomics studies, may highlight possible strategies for insect control, including insecticide development (Dorer et al., 2003; Li et al., 2011). Currently, there are a number of annotated insect genomes are publicly available, and these genomes cover quite diverse insect lineages, including Hemipterodea, Hymenoptera, Coleoptera, Lepidoptera, Diptera, and etc. Thus, the synteny study in this work mainly focuses on these insect genomes, as well as some other arthropod genomes. Meanwhile, the vertebrate genomes are also studied using the same methodology.

18

(38)

Figure1.3 Diversity of known species (Wilson, 1993).

(39)

Figure1.4 Phylogenetic tree of insects and vertebrates (adapted from Wyder et al., 2007).

Branch length is proportional to the rate of amino acid substitutions estimated by the maximum-likelihood approach. The branch lengths of insect lineage (red) are longer than vertebrate lineage ones, indicating that insects evolve faster than vertebrates.

20

(40)

Chapter 2

Objectives

(41)

As described in the introductory chapter, the comparative study of gene arrangements across multiple species can facilitate a variety of biology researches, not only for better understanding the evolution of genome architecture, but also for other evolutionary and functional genomics studies, as well as genome projects. With the increasing number of publicly available genomic data and the accessibility of high-performance computing resources, it is now possible to employ comparative genomics approaches on large-scale synteny studies.

This research aims to explore the evolutionary constraints on gene arrangements in animal genomes, for which a comprehensive synteny resource is fundamental. Therefore, the objectives of this thesis are:

to develop a high-throughput multiple-species synteny delineation strategy which can identify synteny blocks along the species phylogeny taking into consideration the genome quality and evolutionary synteny breaks and is suitable for animal genome evolution studies

to create a database to store the resulted synteny data and a web interface allowing public access to these data

to analyze the synteny data in order to better understand the evolution constraints on animal genome rearrangements

In this thesis, the orthology-based synteny delineation strategy is described in the synteny block identification workflow section in the Methods chapter, which detailed addresses the major steps in the workflow offering advantages in multiple-species synteny identification. In the Results chapter, a detailed examination of the remarkably stable TipE gene cluster as a pilot study making use of the result generated by the preliminary workflow is addressed in the evolution of insect TipE gene cluster section. The description of OrthoBlock database that stores the resulted synteny block information generated by the identification workflow and its web interface are also presented in the Results chapter, as well as the analysis of the synteny block data generated by the multiple-species synteny block identification workflow developed in this study. As an example of using this synteny delineation method to study the evolution of genomic architecture, the results of genome

22

(42)

rearrangement studies for tobacco hawkmoth genome are presented. Following the discussion section, the main results and findings in this thesis are concluded.

(43)

Chapter 3 Methods

24

(44)

Various bioinformatics approaches were applied in this thesis for data acquisition and integration, data manipulation, data management and display, as well as computational data analysis. High performance computing was also extensively used in this research. In this chapter, the details of the methodologies and the bioinformatics programs/tools applied in this thesis are listed and described. First, the acquisitions of two major input data materials – orthology information and genomic position information – are introduced. Following the description of the method using for genome assembly quality assessment, the workflow for synteny block identification is detailed addressed. The programming languages used in the workflow development include Perl (www.perl.org), Python (www.python.org), Ruby (www.ruby-lang.org) and Bash (www.bash.org). Then, the development of MySQL-based OrthoBlock database and PHP (php.net)-based web interface is described. In the last section of this chapter, the detailed methods for the analysis of OrthoBlock synteny data are elaborated. For statistical analysis in this thesis, R language (www.r-project.org) is mainly used.

(45)

3.1. Acquisition of orthology and genomic position information

Synteny delineation across multiple species starts with the mapping of orthologous genomic loci which are then used as anchors to build synteny blocks. Orthologous relations among protein-coding genes provide reliable anchors suitable for synteny block identification over large evolutionary distances. Thus, the principle input materials for synteny delineation are orthologs and their genomic positions from each of the species to be examined.

26

(46)

3.1.1. Orthology information from OrthoDB

In order to generate synteny blocks based on pre-identified orthologs, one or several publicly available orthology databases with accurate and comprehensive orthology data need to be selected for orthology information retrieval. Although the orthology databases differ substantially in methodology of orthology delineation, most of them provide generally correct orthology predictions (Boeckmann et al., 2011; Trachana et al., 2011), thus the main criteria for orthology data resource selection in this work is the coverage of arthropod species. As shown in Table3.1, OrthoDB (www.orthodb.org) (Waterhouse et al., 2013) covers the largest number of arthropod genomes, includes all arthropod species containing in other orthology databases (Altenhoff et al., 2011; Mi et al., 2013; Ostlund et al., 2010; Penel et al., 2009;

Powell et al., 2012; Vilella et al., 2009), and has good vertebrate species coverage. In addition, it provides comprehensive orthology data with orthologs delineated at each node of the species phylogeny. Therefore, the orthology information required for synteny identification in this study is retrieved from OrthoDB.

Orthology data from OrthoDB were used throughout the development of the synteny identification methods, and the results presented in this thesis employed arthropod and vertebrate data from OrthoDB version 6 (2012) (Waterhouse et al., 2013). OrthoDB6 provides orthology data for protein-coding genes from 45 arthropods (Figure3.1) covering various lineages including Arachnida, Myriapoda, Crustacea, Isoptera, Hemipterodea, Hymenoptera, Coleoptera, Strepsiptera, Lepidoptera, and Diptera; and 52 vertebrates (Figure3.2) covering lineages including ray-finned fishes, amphibians, reptiles, birds, marsupials, hoofed mammals, carnivorans, rodents, and primates. In total, 559,707 arthropod genes (see Table3.2 for details of each species) are classified in orthologous groups at 44 nodes (Figure3.1) along the arthropod phylogeny, and 898,744 vertebrate genes (see Table3.3 for details of each species) are classified in orthologous groups at 51 nodes (Figure3.2) along the vertebrate phylogeny. These data therefore provide a comprehensive set of genomic anchors can be used for synteny delineation across two large and diverse animal lineages.

(47)

Table3.1 Orthology databases

28

(48)

Figure3.1 Phylogenetic tree of arthropod species. Node IDs are assigned to each phylogenetic node. Nodes with at least five species are labeled red.

(49)

Figure3.2 Phylogenetic tree of vertebrate species. Node IDs are assigned to each phylogenetic node. Nodes with at least five species are labeled red.

30

(50)

Table3.2 List of arthropod genomes studied in this thesis

(51)

Table3.2 List of arthropod genomes studied in this thesis (cont.)

32

(52)

Table3.3 List of vertebrate genomes studied in this thesis

(53)

Table3.3 List of vertebrate genomes studied in this thesis (cont.)

34

(54)

3.1.2. Genomic position information from genome resources

Genomic position data for protein-coding genes in all the genomes studied were retrieved from their corresponding GFF (General Feature Format) or GTF (General Transfer Format) files from various sources according to the genebuild versions used in OrthoDB (Table3.2 and Table3.3). The source databases includes AphidBase (Legeai et al., 2010), BeetleBase (Kim et al., 2010), FlyBase (Marygold et al., 2013), Hymenoptera Genome Database (Munoz-Torres et al., 2010), SilkDB (Duan et al., 2010), VectorBase (Megy et al., 2011), wFleaBase (Colbourne et al., 2005), and several genome consortia for arthropods; and Ensembl (Flicek et al., 2013) (Release 67, May 2012) for vertebrates.

(55)

3.2. Assessment of genome assembly quality

As genome assembly quality can influence synteny block identification through the introduction of many artificial breakages in highly fragmented genome assemblies, it is important to examine the relative qualities of each assembly. This was assessed in this study using the principles of “N50” statistics. In genomics, a genome's N50 statistic is defined as the contig length N for which the collection of all contigs that are longer than or equal to N contain at least half of the total nucleotides in the genome. The principle of a genome's N50 statistic was adapted for this study to count genes and orthologs rather than nucleotides, since orthologous protein-coding genes were used as anchors in the synteny delineation procedure.

Therefore, N50gene is defined as contig length N for which at least half of all genes in one genome are in the collection of all contigs of length N or longer. Similarly, N50ortholog is defined as contig length N for which at least half of all orthologs in one genome are in the collection of all contigs of length N or longer. To determine N50gene/ortholog of a genome, the following algorithm was used (presented in pseudocode):

T = total number of genes/orthologs in the genome array_contig = array of sorted contigs

S = 0

for each contig in array_contig {

L = number of genes/orthologs on the contig S += L

if ( S >= T * 0.5 ) { N50gene/ortholog = L } }

36

(56)

3.3. Synteny block identification workflow

The synteny block identification strategy, which is developed on the basis of an earlier method (Zdobnov and Bork, 2007; Zdobnov et al., 2002), follows four major steps through an SQL-based workflow: 1) pairwise synteny identification; 2) species set generation; 3) N-wise projection; 4) synteny to orthologous group assignment. Figure3.3 outlines the overview of the workflow: for each node with at least five species on the species tree, pairwise synteny blocks are generated for each pair of species, and a list of all possible sets of five species are generated; next, ortholog-based five-way synteny blocks are generated from the pairwise synteny blocks for each five-species set by N-wise projection, followed by assigning the resulted synteny blocks to orthologous groups. In addition, species set distances and synteny block lengths are calculated and stored for species set selection and synteny block selection in the OrthoBlock web interface.

(57)

Figure3.3 Synteny block identification workflow. Arrows show the flow of control; processing steps are represented as rectangular boxes, and the most time-consuming steps are labeled red; input/output data are represented as parallelogram, initial input data are labeled blue, and output data stored in the OrthoBlock database are labeled green.

38

(58)

3.3.1. Species phylogenetic tree

The arthropod and vertebrate species trees (Figure3.1 and Figure3.2) used in the synteny block identification workflow, which was reconstructed using the maximum-likelihood approach based on protein sequence distances of concatenated single-copy orthologs (Kriventseva et al., 2008), was retrieved from OrthoDB. Unique node identifiers were assigned to each phylogenetic node on the species trees.

(59)

3.3.2. Pairwise synteny identification

As the gene position data for arthropod genomes were retrieved from various resources, there were some differences among the file formats. To obtain uniform data for pairwise synteny comparison, several scripts were written for transforming original GFF/GTF files into new GFF/GTF files with matching formats. The gene position data were then imported into a MySQL table.

To identify pairwise synteny blocks for a given pair of species, the MySQL tables with gene position data and OrthoDB orthologous relationship data must be queried. First, all orthologous genes (including multi-copy orthologs) with at least one orthologs in the other species are selected as anchors for synteny determination based on OrthoDB data, and these genes are ranked according to their genomic positions for the two species respectively. Next, two pairs of anchors (e.g. A-A' orthologous pair and B-B' orthologous pair where genes A and B are from species S, and genes A' and B' are from species S', Figure3.4), are grouped into anchor-tetrads if the genes are neighbors in both species (A and B were neighbors in species S, and A' and B' were neighbors in species S'). The number of intervening orthologs, which is calculated based on the orthologous gene ranking, is used to determine whether two anchors are neighbors – the two anchors are considered neighbors if the number of orthologs between them is no greater than a cut-off value Nint. If the Nint is set to zero, only the most stringent synteny blocks are obtained. In order to generate more and longer synteny blocks, some minor shuffling within blocks can be allowed by increasing Nint. After experimenting with different numbers of allowed intervening orthologs (Figure3.5), the default setting for pairwise synteny comparisons is set at Nint=2. Then, the anchor-tetrads are extended into synteny blocks with as many anchor-tetrads as possible by clustering all anchor-tetrads that share a common anchor pair (e.g. anchor-tetrad A-A':B-B' and anchor-tetrad B-B':C-C' share a common anchor pair B-B', Figure3.4). If no other anchor-tetrad is found sharing a common anchor pair, the anchor-tetrad itself is identified as a synteny block, i.e. the minimal synteny block consists of a single anchor-tetrad. In the last step, since tandem duplications are not considered as synteny breakage in this study, paralogs presenting in the same synteny block are grouped as a single anchor for the following N-wise synteny identification step, in order to largely reduce the space and running time required.

40

(60)

Figure3.4 Pairwise synteny identification. In this example, for species S and species S', orthologous gene pairs A-A' (red) and B-B' (blue) are intervened by orthologs labeled gray.

Since the genes A/A' and B/B' are neighbors (number of intervening orthologs between them is no greater than the threshold) in both species, orthologous pairs A-A' and B-B' are grouped into an anchor-tetrad. If there is another anchor-tetrad containing orthologous pairs B-B' and C-C', the anchor-tetrads are merged into one synteny block based on the common anchor pair (B-B' in this case).

(61)

Figure3.5 Number of synteny blocks identified for different number of allowed intervening orthologs. This experiment is done for five-way synteny block identification across Drosophila melanogaster, Anopheles gambiae, Tribolium castaneum, Apis mellifera, and Pediculus humanus. Most of the blocks identified for different Nint contain two orthologous groups – only one block with three groups is identified for Nint equal to zero; one block is identified with three, four, and five groups respectively, when Nint is set to 1, 2, and 3 (data not shown). In order to obtain more synteny blocks and meanwhile to keep stringency, Nint=2 is set as default.

42

(62)

3.3.3. Species set generation

The number of synteny blocks identified across multiple species decreases rapidly when the number of species increases. This decreasing is more severe if the genome added is fragmented. It is noticed that no synteny block is identified at arthropod root node when all 45 arthropod genomes required, and only four blocks are identified across seven distant genomes with relatively good quality (Drosophila melanogaster, Anopheles gambiae, Tribolium castaneum, Apis mellifera, Pediculus humanus, Daphnia pulex, and Ixodes scapularis). To reduce the influence of artifact synteny breakages introduced by genome assembly gaps as well as the impact of exceptional breakages happened in certain genomes during evolution, when performing N-wise synteny identification, all possible combinations of five species (five-species sets) which represent both clades of a phylogenetic node were generated for each node with at least five species along the species tree (22 nodes for arthropod lineage and 24 nodes for vertebrate lineage). In this way, more and longer synteny blocks can be identified for a phylogenetic node. Figure3.6 shows an example of five-species set generation: for a phylogenetic node with two clades containing three and four species respectively, 21 sets of five-species can be produced. The numbers of species sets generated for each node are listed in Table3.4 for both arthropods and vertebrates.

Figure3.6 An example of species set generation. For the phylogenetic node (labeled as yellow star), all possible combinations of five species representing both branches (blue and red, respectively) are listed.

(63)

Table3.4 Number of five-species sets for each phylogenetic node in this study

44

(64)

3.3.4. Species set distance calculation

The species set distance is defined as the sum of the amino acid sequence distances between all possible combinations of two species from the five-species set. The sequence distance between two species often used to estimate divergence time was calculated on the distance matrix of all species pairs, which was converted from the Newick format species phylogenetic trees based on branch length by the Newick Utilities (Junier and Zdobnov, 2010). The species / species set distance in this thesis is therefore always in units of the number of amino acid substitution per site.

(65)

3.3.5. N-wise projection

For each set of five-species, the N-wise projection algorithm was applied on previously identified pairwise synteny blocks of all possible combinations of two species from the five, in order to generate five-way synteny blocks that were conserved in all five genomes.

Figure3.7 shows how the projection algorithm works: for two species pairs A-B and B-C sharing common species B, if a pairwise synteny block from species pair A-B shares common anchors in species B with a pairwise synteny block from species pair B-C, these anchors in species B together with their orthologous anchors in species A and C define a three-way synteny block of species set A-B-C. Three-way synteny blocks are extended to four-way, then five-way blocks by the same algorithm.

For each node of the arthropod phylogeny, N-wise projection was performed on all possible five-species sets to generate five-way synteny blocks. For vertebrates, because of the higher similarity of the genomes and the larger number of possible species sets for certain nodes, it is more time and space expensive to generate five-way synteny blocks for all possible species sets. Therefore, for the vertebrate nodes with more than 10,000 possible species sets (Node 51, 52, 53, 57, 58, 61, 66), the five-way synteny blocks were produced only for the top 10% species sets with largest species set distance; and for the root node of vertebrate lineage (Node 44), only the top 10% of the species sets containing two fishes and three other vertebrates were considered (Table3.4).

46

(66)

Figure3.7 N-wise projection. The figure shows an example of N-wise projection for three species – A, B, and C. The colored dots represent orthologous anchors. Dots with the same color represent orthologs in the same orthologous group. The pairwise synteny block between species A and B are labeled green, and the pairwise synteny block between species B and C are labeled pink. Because the A-B pairwise synteny block and the B-C pairwise synteny block share common anchors in species B (the red and the yellow dots), the red orthologous anchors and the yellow ones in the three species form a three-way synteny block.

(67)

3.3.6. Synteny to orthologous group assignment

To map ortholog-based five-way synteny blocks generated by N-wise projection to OrthoDB orthologous groups, the gene identifiers of the ortholog anchors were substituted with their corresponding orthologous group identifiers and then merged if they were in the same synteny block (Figure3.8). Synteny blocks with only one orthologous group were discarded.

For each synteny block, the orthologous groups in the block were ordered based on their coordinates, which were calculated by averaging mid-point coordinates of group member anchors presenting in the block.

Figure3.8 Map five-way synteny blocks to orthologous groups. Here shows an example of a synteny block for five species (S1 to S5): the ortholog anchors in the same orthologous group are labeled with the same color; the gene identifiers (e.g. A1-1, B2, and C4-1) of the ortholog anchors are substituted with their corresponding orthologous group identifiers (e.g. OG-A) and then merged if they are in the same synteny block; the output of the mapping is a synteny block containing three orthologous groups – OG-A, OG-B, and OG-C.

48

(68)

3.3.7. Synteny block length calculation

For each synteny block, which contains at least two orthologous groups, the number of groups inside the block was counted and stored as the length of the synteny block.

Références

Documents relatifs

Analysis of individual DNA sequencing reads within multiple cancer types with a single TP53 mutation revealed that the variant (mutant) allele fraction (VAF) of the TP53 reads

We found that (1) SAL1 and its related genes arose in eutherian mammals with lineage-specific duplications in rodents, horse and cow and are lost in human, mouse lemur, bushbaby

Using a classical linear activation function, we would obtain: ϕ(∆σ, β) = max(0, min(c∆σ + β, 2)), meaning that, depending on the sign of β, the expression level of an isolated

visible in both genera; overall, 14 sites were subject to pervasive positive selection, with another 7 sites under episodic positive selection within a subset of species of either

present at low frequency (36% of the genes are present in less than 10% of the strains). The evolution of the gene content for some of the major nodes of the R. solanacearum

These include: (i) the direct estimation of mutation and substitution rates using experiments where virus genomes with either known or accurately inferable sequences are used

Le Conseil scientifique, par l’intermédiaire de son président, collabore à l’installation d’une serre à Welwitschia mirabilis à Porrentruy, en partenariat étroit avec le

roretzi genome was investigated by bioinformatics methods (Fig. 3 and Additional file 1) using three methods: homology search using mature miRNAs sequences deposited in miRBase