Computational identification and analysis of conserved non-coding vertebrate elements

(1)

Thesis

Reference

Computational identification and analysis of conserved non-coding vertebrate elements

DOUSSE, Aline

Abstract

Les séquences conservées non-codantes sont des éléments qui ont été préservés sélectivement dans le génome d'espèces de vertébrés à travers des millions d'années d'évolution. Ce travail présente une nouvelle stratégie pour identifier ces éléments grâce à une méthode d'alignement global. Ces éléments conservés non-codants sont stockés dans une base de donnée et accessibles via une interface web. Ces éléments génomiques ont ensuite été analysés en utilisant des méthodes computationnelles. Une partie d'entre eux ont des limites bien définies. Certains interagissent avec certaines protéines impliquées dans l'architecture tridimensionnelle du génome, donnant de nouvelles pistes quant à la fonction de ces éléments.

DOUSSE, Aline. Computational identification and analysis of conserved non-coding vertebrate elements. Thèse de doctorat : Univ. Genève, 2016, no. Sc. 4907

URN : urn:nbn:ch:unige-822943

DOI : 10.13097/archive-ouverte/unige:82294

Available at:

http://archive-ouverte.unige.ch/unige:82294

Disclaimer: layout of this document may differ from the published version.

1 / 1

(2)

UNIVERSITÉ DE GENÈVE

Département de médecine FACULTÉ DE MÉDECINE

génétique et développement Professeur Evgeny M. Zdobnov

Département d'informatique

FACULTÉ DES SCIENCES Professeur Ron D. Appel

Computational Identification and Analysis of Vertebrate Conserved

Non-Coding Elements

THÈSE

présentée à la Faculté des sciences de l'Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique

Par

Aline Dousse

De

Arconciel (Fribourg)

Thèse N°4907

Genève, 2016

(3)

2

Dousse A, Junier T, Zdobnov EM. CEGA-a catalog of conserved elements from genomic alignments. Nucleic Acids Res. 2015; gkv1163–. doi:10.1093/nar/gkv1163

(4)

3

A CKNOWLEDGEMENTS

First I would like to warmly thank my thesis director, Prof. Evgeny Zdobnov for giving me the opportunity to join his group and work on such interesting projects. I am very grateful for his early and continuing faith in my capacities to evolve as a complete bioinformatician.

I would like to thank the members of my thesis committee, Prof. Ron David Appel, Prof. Ioannis Xenarios and Prof. Philipp Bucher who invested the necessary time to read and evaluate the present work.

For all the courses, workshops and happy encounters, I would like to thank the Swiss Institute of Bioinformatics and the PhD Training Network in particular.

During my time at CEGG, I greatly appreciated the help, discussions, support and fun times with all the present and past colleagues of the group. So in a few words, thanks to Thomas Junier who first introduced me to this project and help me start, to Charles Vejnar for his critical thinking and political or tennistic discussions, to Jia Li for her feminine presence and positive spirit, to Adrian Cesar Razquin for his friendship and long-lasting support, to Ismael Padioleau for his comments and support, to Tom Petty for all the passionate discussions and ideas as well as for organizing a lot of cool social events, to Robert Waterhouse for sharing his passion for science and for the good aperos, to Fredrik Tegenfeldt for the “kinder-cafés”, the personal and technical support, to Isabelle Cosandier for the great administrative work and all our conversations, to Panos Ioannidis for his positive spirit and his support, to Felipe Simao for being always pleasant and helpful and sharing his extensive knowledge of the weirdness of the world, to Weihua Chen for his scientific mind and comments, to Francisco Brito for his optimism and confidence, to Evgenia Kriventseva for her encouragements, to Alexis Loetscher for his fresh mind and to Mathieu Cosandier for his help on this project. The time I spent in CEGG will stay in my memory both for scientific and personal aspects.

Geneva failed to conquer my heart, but I have to recognize I still enjoyed my time there and was lucky to meet great and inspiring people, among which I would like to

(5)

4

mention Ximena Bonilla, Ivana Gasic and Julien Bryois. Since they are all pursuing their scientific career, I wish them a fulfilling future with science, friendship, adventures, love and happiness.

On a side note, I would like to point out the positive effects of running and yoga, helping me remain calm and focused. For this, I thank all my running partners, the ones who participated or cheered for me during the Escalade or the Geneva Marathon. A special mention should go to Michelle’s Yoga Lab studio, a piece of heaven within Geneva, where I practiced my balance and found some inner peace.

My close friends Joëlle Panchaud, Laura Stucki, Christophe Bovigny and not so geographically close Julien Steiner have been very helpful, providing me with support, good dinners, drinks and endless discussions. Special thanks to Laura who proofread this manuscript even though its topic is far from her usual interests. I am very grateful to my fiancé Lionel Bussard who coped with my ups and downs, helped me remain focused and strong in all situations and always tried to push me a little bit further while remaining very loving and supportive.

I would also like to share my affectionate thoughts for my grandparents Madeleine, Henri and Michel who were always supportive and proud of me. I wish they were still here today to share this happy time with them. I would like to warmly thank my parents Marie-Josée and André, my sister Danielle and my brother Nicolas who believed in me more than I believed in myself and were always there to support me and provided a relaxing and happy environment. In that regard, I want to especially acknowledge my niece Alice, who was born 3 days before I actually started my PhD. I closely witnessed her growing up, and sometimes dreamed that my projects would in some way bloom as much as her. But seriously, what could possibly compete with learning to walk, talk, write and count as well as being so smart and cute at the same time?

(6)

5

T ^{ABLE OF} C ^ONTENTS

1 ABSTRACT 7

2 RÉSUMÉ 9

3 INTRODUCTION 11

3.1 From Genetics to Genomics ... 11

3.2 Comparative Genomics and Evolution ... 12

3.2.1 Homology, Orthology and Paralogy ... 13

3.2.2 Phylogenetic reconstruction, Phylogenomics ... 14

3.2.3 Synteny ... 15

3.2.4 Multiple Genomic Alignments ... 16

3.3 The Non-Protein-Coding Genome ... 18

3.3.1 Gems in the Junk: Conserved Non-Coding Sequences ... 18

3.3.2 CNCs Definitions and Identification Methods ... 19

3.3.3 CNCs Origins and Evolutionary History ... 21

3.3.4 CNCs in Insects and Plants ... 22

3.4 Biological Function of the Non-Coding Genome ... 23

3.4.1 Genome architecture ... 26

3.4.2 CNCs Biological Functions ... 28

3.4.3 CNCs Sequences and Locations ... 29

3.4.4 Diseases and Traits associated to CNCs ... 30

3.4.5 Are CNCs a Collection of Enhancers? ... 32

3.5 Aims of the Present Study ... 35

4 RESULTS 37 4.1 Defining a Catalogue of Vertebrate CNCs ... 37

4.2 Biological, Functional and Evolutionary Characteristics of CNCs ... 51

5 DISCUSSION 99

(7)

6

5.1 Sensitive, Comprehensive and Scalable Identification of CNCs ... 100

5.1.1 Synteny Block Identification ... 100

5.1.2 Multiple Sequence Alignments ... 101

5.1.3 Conservation Computation ... 102

5.1.4 HMM Profile Searches ... 103

5.1.5 CEGA Database Perspectives ... 104

5.2 Large-scale Computational Characterization of CNCs ... 105

5.2.1 CEGA CNCs overall Characteristics ... 105

5.2.2 Boundaries ... 105

5.2.3 Sequence Uniqueness and AT-enrichment ... 106

5.2.4 Genomic Distribution ... 107

5.2.5 Distance between CNCs and from CNCs to Genes ... 108

5.2.6 Conservation and Binding Sites ... 110

5.2.7 Enrichment of Specific Protein Binding ... 110

5.2.8 Classification Methods ... 111

5.2.9 Functional Clustering ... 112

5.2.10 Evolutionary History between Teleost and Amniote Lineages ... 113

5.2.11 Vertebrate CNC Origins ... 113

5.2.12 Perspectives on CNC Characteristics ... 114

5.3 Conclusion ... 116

6 REFERENCES 118 7 APPENDIX 140 7.1 CEGA supplementary material ... 140

7.2 CNCs characteristics, supplementary material ... 149

(8)

7

1 A ^BSTRACT

Comparative genomics is an instrumental strategy in increasing the understanding of the genome, in particular enabling the identification of functional regions. For instance, the first comparisons of human and mouse genomes revealed that approximately 5%

was under functional constraints. By subtracting the known coding ratio of 1.2 %, it revealed the existence of an uncharacterized functional repertoire, larger than the protein coding one, the Conserved Non-Coding sequences (CNCs). Previous studies on restricted sets of CNCs showed that subsets of these elements bear gene regulatory properties. However, this functional role does not fully explain the conservation extent and the length of CNCs.

In order to gain new insights into the functional properties of these elements, the first objective of this study was to delineate comprehensive and unbiased sets of CNCs along the vertebrate phylogeny. To this end, a computational pipeline was devised. Based on synteny block delineation, multiple global-local alignments and phylogenetic modelling, we set up a relational database called CEGA (Conserved Elements from Genomic Alignments) and available publicly through a dynamic web interface.

CEGA database provided the fundamental sets for large-scale investigations of CNCs. Some of the previously described findings were recapitulated with our more comprehensive set, such as the positional clustering of CNCs and the vicinity of genes involved in embryonic development. Actually, considering the conservation of distances between elements and genes, CNCs seem to remain at a conserved distance from one another, while they are more independent from the closest gene or even to the genes with shared synteny. This suggests a cooperative role of CNCs rather than a direct association with nearby genes. CNCs have defined starts and ends, but the sequence uniqueness also prevails at the boundaries, where no specific motif was identified. Also, the binding of transcription factors fails to fully explain the overall conserved status of CNCs. Investigations on the biological features of CNCs showed an enrichment in protein involved in higher order chromatin structures, implying an involvement of

(9)

8

CNCs in genomic architecture. Based on similar biological features, no distinct subset of CNCs could be delineated, probably suggesting that they share common features as a group. Looking into CNCs evolutionary history, all of the studied vertebrate elements confirmed being subjected to a faster rate of evolution in the teleost lineage rather than evolving more slowly in amniotes. Finally, the vertebrate CNCs could not be detected with sequence homology to invertebrate organisms.

Altogether we present a strategy for consistent identification of CNCs and elaborate on vertebrate CNCs characteristics. Our observations highlighted that CNCs might function cooperatively and at least a subset of them is involved in higher order chromatin structures. Hereby, our studies also demonstrates the relevance of comprehensive sets of CNCs in large-scale computational studies.

(10)

9

2 R ^ÉSUMÉ

La génomique comparée est un outil de choix pour l’identification des éléments fonctionnels du génome. Les premières comparaisons entre le génome de l’homme et de la souris ont révélé que 5% de la séquence du génome humain est sujette à une contrainte fonctionnelle. Ceci a permis de mettre en lumière un nouveau répertoire fonctionnel, d’une taille plus de deux fois supérieure à la partie codante pour des protéines. Ces éléments, dont la caractérisation est loin d’être complète, sont regroupés sous la terminologie d’éléments Conservés Non-Codants ou CNCs. Des études précédentes ont montré le rôle d’un nombre restreint de CNCs dans l’activation de l’expression des gènes. Cependant, cette fonction n’explique pas complètement l’étendue de la conservation des CNCs.

Afin de mieux comprendre les propriétés fonctionnelles ainsi que la diversité des CNCs, l’objectif premier de cette étude a été d’identifier exhaustivement ces éléments de manière objective chez les vertébrés. Dans ce but, un pipeline de calcul a été conçu, basé sur trois étapes principales : la délinéation de blocs de synténie en utilisant des protéines orthologues comme marqueurs, l’alignement multiple de ces séquences génomiques à l’aide d’un algorithme d’alignement multiple global-local, et enfin l’utilisation d’un modèle phylogénétique identifiant les éléments conservés. Cette stratégie a permis l’élaboration d’une base de données relationnelle, appelée CEGA (Conserved Elements from Genomic Alignments), disponible publiquement au travers d’une interface web dynamique.

Le contenu de cette base de données a ensuite permis de poursuivre la caractérisation des CNCs. Certains résultats d’études précédentes, tels que la formation de clusters positionnels et l’enrichissement en gènes impliqués dans le développement embryonnaire dans les alentours des CNCs ont été vérifiés avec notre ensemble étendu de CNCs. Toutefois, la distance entre CNCs semble être conservée alors que les gènes alentours semblent être indépendants. Ceci suggère un rôle coopératif de ces éléments, plutôt qu’un lien direct avec les gènes voisins. L’unicité de la séquence des CNCs est aussi vraie pour leurs limites où aucun motif n’a pu être identifié. De plus, la liaison de

(11)

10

facteur de transcription n’explique pas entièrement le niveau de conservation des CNCs.

Les analyses des caractéristiques biologiques de ces éléments ont en revanche montré un enrichissement d’interactions avec plusieurs protéines importantes pour la structure tridimensionnelle du génome, impliquant les CNCs dans l’établissement de ces structures. En étudiant la diversité des CNCs, aucun sous-ensemble de CNCs n’a pu être clairement mis en évidence, suggérant que ces éléments partagent également la plupart de leurs caractéristiques fonctionnelles. A propos de leur histoire évolutionnaire, aucun sous-groupe ne présente un taux d’évolution plus lent chez les amniotes par rapport à des espèces de vertébrés plus basales. Ceci confirme l’hypothèse d’une évolution plus rapide des CNCs chez les poissons téléostéens suite à une duplication supplémentaire de leur génome. Enfin, les CNCs de vertébrés n’ont pas été identifié par homologie de séquence chez des organismes invertébrés.

En somme, cette thèse présente une stratégie pour l’identification complète et non biaisée des CNCs et apporte de nouvelles pistes pour comprendre leurs propriétés fonctionnelles. En particulier, certaines analyses mettent en évidences l’hypothèse d’un rôle coopératif et structurel des CNCs. De ce fait, l’utilité d’étudier des ensembles compréhensifs de CNCs à large échelle est également prouvée.

(12)

11

3 I NTRODUCTION

3.1 From Genetics to Genomics

At the beginning of the 19^th century Gregor Mendel founded modern genetics by studying the trait inheritance at an observable phenotypic level with pea plants. He described discrete units of inheritance and rules of heredity at a time when nothing was known about the support for the inherited material. In the 1910s, these original ideas progressed into the concept of genes representing the determiners, originally present in gametes, by which characteristics of an organism are defined [1]. A century after Mendel, Thomas Morgan established a model where genes are linearly arranged, by studying the segregation of mutations in Drosophila melanogaster [2]. In the 1940s, studies from Beadle and Tatum [3] gave birth to the idea of “one gene, one enzyme” by observing that mutations in genes caused defects in metabolic steps. Understanding of the nature of the physical basis of heredity came in the same decade when Avery, MacLeod and McCarty showed that DNA was the molecule carrying the information for bacterial transformation. Further understanding of DNA and how it is the cornerstone of heredity was clarified a decade later through the resolution of its three- dimensional structure by Francis Crick and James Watson [4]. The double-stranded architecture of the DNA molecule was the key explain how it can be copied and transmitted, with occasional errors or mutation on one of the copies. Indeed, during DNA replication, each strand is used as a template for the replication machinery, to synthesize a new complementary strands, thereby producing two copies. The replication machinery is however not foolproof and sometimes introduces an erroneous complementary nucleotide in the strand.

The discovery of DNA structure also opened the way to new techniques from sequencing to molecular cloning [5,6]. Further technological improvements and automation allowed the sequencing of the human genome by 2001, opening a new era of genomics [7,8], where the whole collection of genes as well as surrounding

(13)

12

sequences are now investigated altogether instead of using a one by one approach. This initial project paved the way for the sequencing of other species: with improved techniques, the mouse and other vertebrate genomes rapidly followed the publication of the human one [9–11].

3.2 Comparative Genomics and Evolution

The sequencing of the human genome revealed its complexity as well as how little we actually knew about it. Using algorithms to delineate the genes in the genome sequence, only 1.2% of the whole sequence was shown to be encoding proteins [7,8]. It seemed clear that in order to understand the complexity of the genome, other strategies were needed apart from solely decrypting its sequence; this is when the relevance of comparative genomics became obvious [12]. In comparative genomics, large-scale genomic sequences of different species are compared in order to pinpoint the similarities and differences of genomic features and to unravel the evolutionary relationship between organisms [13].

Analyzing sequences in the light of evolution was shown to empower the identification of functional genomic elements already at a time when the full sequences of a few viruses were the only genomes available [14]. Back then, the similarity of viral sequences revealed the shared ancestry of two families of viruses and similarity in gene order. Since then, comparative genomics was used to greatly increase the understanding of evolutionary processes and present informed hints about the potential functions of genomic elements. Such inferences are based on one key principle of evolution, first stated by Charles Darwin, that all life forms have evolved from a single common ancestor (the Last Universal Common Ancestor or LUCA) [15]. Therefore, in genomics terms, the genomes of two species sharing a common ancestor both derive from the same ancestral genome. Thus when examining sequences of extant species, a traceable similarity in sequences across multiple or distant organisms is likely to be a proof of their evolutionary constrained status. In other words, these regions accumulated less mutations than if they were neutrally evolving, likely because the sequences encode critical biological functions that have to be retained [12,16].

(14)

13

However, this does not imply that sequences with no established similarity to others do not bear biologically important roles. Some features are present only in some species or some lineages. For those, sequence conservation outside the lineage presenting the characteristics of interest will not be identified. In some cases it could, however, be approached by studying at generally conserved sequences displaying a specific accelerated evolution (increased number of non-neutral nucleotide substitutions) in the species or lineage of interest. With such strategies, comparative genomics is instrumental for the identification of functional elements and approaching their significance.

3.2.1 Homology, Orthology and Paralogy

The first step to conduct comparative genomics studies is to identify sequences that derived from the same ancestral region. Such genes, originating from a common ancestor, are called homologous [17]. Homology is further divided into two classes: if the two genes share ancestry because of a speciation event as shown on Figure 3-1, they are called orthologous. If they are the result of a duplication event, they are termed paralogous [18]. This means that orthologs are found in different species, and originate from a single ancestral gene. Thus, functional features from orthologous genes or sequences are in principle retained from the ancestral species to the extant species [17].

Paralogy however does not result in functionally analogous genes. The duplicated gene is released from the original constraint and allowed to diverge from its copy, resulting in functional divergence as well.

Figure 3-1: Schematic representation of orthology due to a speciation event and paralogy following gene duplication. Simplified scheme from [17].

(15)

14

3.2.2 Phylogenetic reconstruction, Phylogenomics

Complementary and instrumental to comparative genomics, the study of phylogenetic relationship between species is critical for evolutionary analyses. This concept is closely linked to the concept of LUCA having given rise to all the extinct and extant species. The relationship between all the resulting organisms can be described using phylogenetic trees, similarly to what Charles Darwin drew in “The Origin of Species” [15] and illustrated in Figure 3-2.

Figure 3-2: The tree of life by Charles Darwin, showing how species are distinct from one another and how they diverged from a common ancestor. Image from [15].

With the advances of sequencing, genetics and genomics, homologous nucleic or protein sequences have replaced the shared morphological features as core information to construct phylogenetic trees [19]. Mathematical models based on these sequences have been developed to reconstruct the evolutionary relationship between species.

Identified homologous sequences are used to build a matrix. This step precedes the implementation of one of the three established models further described below [19,20]:

1. Distance methods convert the sequence matrix into an evolutionary distance matrix, in which the distance is the number of changes that occurred between two sequences. This distance matrix is subsequently used to infer the phylogenetic tree typically using Neighbor-Joining algorithms [21]. This is a

(16)

15

very fast and efficient method with sequences presenting little divergence. A weakness of this method is that the observed differences do not exactly reflect the evolutionary distance: the occurrence of multiple substitutions at one site might result in the underestimation of the sequence distance [22].

2. Maximum parsimony criterion minimizes the total number of changes to explain the observed data [23]. All possible trees (or a sample of them using heuristics) are scored according to the minimum number of changes or mutations required to produce the observed data. The principle of this method is that the simplest hypothesis capable to explain the data should be selected. The drawbacks of this method is first the lack of knowledge of the mutational pathway along the tree. Secondly, the “long branch attraction”

poses a risk, where similar nucleotides observed as a result of convergent evolution are mistaken for direct inheritance. This results into distantly related organisms incorrectly inferred as closely related [24].

3. Likelihood methods rely on a probability function that the observed data is the result of a given tree. This function requires the relative probability of various events describing the individual likelihood of mutations or back- mutation from one nucleotide to another or to a very similar one. Thus, all mutational paths leading to the observed data are considered to finally find the optimal mutational path [25].

3.2.3 Synteny

Synteny also relates to the similarity of sequences. Originally, it depicted the physical co-localization of genomic loci within an organism [26]. By extension, it also describes conserved gene order between species [27]. The same way orthologous genes originate from an ancestral one, the general organization of them, or the synteny is inherited by the resulting species. The relevance of synteny resides in the understanding of genome rearrangement history. More recently, the concept of “genomic regulatory blocks” has been proposed [28], where genes remain collinear with their regulatory regions through evolution. Furthermore, the delineation of such collinear blocks is an

(17)

16

essential step to perform further comparative investigation, especially genome alignments.

3.2.4 Multiple Genomic Alignments

In comparative genomics, multiple genome alignments is a fundamental approach.

While very early sequence comparisons could be done by hand, algorithms were developed to align long sequences more efficiently. Similarly to comparative genomics, multiple sequence alignments are based on the assumption that the sequences to be aligned derived from a common ancestor through a series of mutational processes and aim to assemble alignments reflecting this biological relationship between sequences [29,30]. This also implies the insertion of gaps for nucleotides in the sequence that have no homologue in the other, due to insertions, deletions or incomplete data [31].

Computation of multiple genomic alignments poses a multitude of problems that require the use of heuristics to be resolved. There are two main strategies for sequence alignment: the local and the global approach. In a global alignment, the whole length of the sequence will try to find corresponding orthologous nucleotides in the other sequence. The Needleman-Wunsch algorithm typically implements this idea, by tracing an optimal path of similarity between the two sequences from beginning to end [32]. A local alignment strategy, however, looks for orthology of fragments of the sequence and aligns them, while the rest of the sequence remains unaligned [31,33]. The Smith- Waterman algorithm implements this method by optimizing the similarity after comparing segments of the whole sequence of all the possible lengths [34].

The advantage of a local alignment is its focuses on the stretches that are homologous and not necessarily collinear, thus enabling it to handle alignments of regions with genomic rearrangements. Global alignments however provide a higher sensitivity by aiming to find homologous nucleotides over the entire sequence. A combination of both strategies also exists and is sometimes called “Glocal” [35]. In that case, local homologous collinear stretches are identified along the sequence and the sequence between them is aligned using a global strategy [36]. Actually, in order to obtain multiple sequence alignments, most methods rely on accurate pairwise alignments first. The challenge at this step is the lengths of sequences, in the multi-

(18)

17

megabase range [31]. To align such long segments, the classical Smith-Waterman or Needleman-Wunsch algorithms would be too inefficient and time-consuming. Instead, seeded alignment is the method of choice. Various flavors of this strategy are used in different alignment programs, some look for nearby but nonconsecutive exact matches (Blastz [37]), inexact consecutive matches (CHAOS [38] used in LAGAN [36]) maximal exact matches with relaxed third codon position (AVID [39]), or unique maximal matches from a suffix tree (MUMmer [40]) [31]. Seed alignments can be then further extended using local alignments (based on the Smith-Waterman algorithm) or global alignments.

So far we described the basics that are used in pairwise alignments; however the goal is to obtain multiple sequence alignments. In order to handle multiple sequences, a progressive alignment strategy is usually used, where two sequences, or intermediate multiple sequence alignments are sequentially aligned according to a guide tree. The generated multiple alignment can be further improved by iterative refinements [31]. The guide tree can either be given by the user or inferred from the pairwise alignments. In more details, the two most closely related species sequences are aligned first, adding the closest species at the next level and so on, until all sequences have been aligned. Other strategies are also used, for example MultiPipMaker [41] and Vista [42] are human- centric, in the way that they align all species to human. Such methods can be useful when trying to infer annotation from a reference species, but are less accurate than the previously detailed progressive alignments [43]. The seeds sequences between pairs of species can also be used to identify the species in which some segments are more similar, independently of the phylogeny, allowing for a regionally improved alignment.

DIALIGN [44], for example, implements such a strategy.

A refinement step is often necessary in order to correct suboptimal alignments of multiple sequences [31]. Position of gaps can be fine-tuned, or iteratively, one sequence is removed from the alignment, the alignment is re-computed without this sequence and by comparing the resulting alignments, the final alignment is corrected. This method is implemented in MLAGAN, for example [36].

A recent initiative has focused on benchmarking different multiple genome alignment [29]. Its outcome highlighted the differences in specificity and sensitivity of

(19)

18

the available implementations and algorithms also depending on the evolutionary distance between species. All in all, it seems that there is not only one multiple genome aligner that is valid for all situations, and that there is still room for improvement in that domain, especially for the handling of translocated or duplicated genomic regions [29].

3.3 The Non-Protein-Coding Genome

Since the 1970s, the genome was thought to be composed of a known useful part, the genes, and the rest in between, not bearing any biological function, which was even termed “junk-DNA” [45]. Sequencing of the human genome further revealed that the protein-coding fraction was only 1.2% [7]. So is 99% of our genome, fashioned by million years of evolution, really junk? New interests were turned to this “junk”, trying to figure out its value. New hypotheses emerged: rather than the number of gene or the size of the genome, the recipe explaining the complexity of organisms could reside in the non-coding part of the genome. John Mattick demonstrated that the ratio of the non- coding genome actually correlates with developmental complexity [46]. Again the human genomic sequence could not by itself reveal much about the functionality of the non-coding part of the genome, therefore comparative studies came into play.

3.3.1 Gems in the Junk: Conserved Non-Coding Sequences

The revelation of the genomic fraction not coding for proteins was a first surprise.

Early comparative studies measuring the conserved fraction of the human genome brought to light a second curiosity: an estimated 5% of the entire genome was shown to be conserved since the divergence of mouse, dog and human [9,47]. When subtracting the 1.2% coding part to that estimate, the remaining 3.8% of the genome represents a repertoire of sequences that must bear a biological function to explain their level of conservation, but this role has yet to be understood. In the present manuscript, this repertoire of genomic elements will be referred to as Conserved Non-Coding sequences and shortened to CNCs.

(20)

19

3.3.2 CNCs Definitions and Identification Methods

A dozen years before the whole-genome sequencing of human was completed, reports about CNCs were already made. Studying the globin genes locus, the alignment of upstream sequences of different species revealed conserved phylogenetic footprints.

These sequences were shown to be able to regulate transcription of the nearby globin genes [48,49]. A few years later, Laurent Duret identified similar elements in vertebrate sequence alignments. He defined highly conserved regions as sequences with more than 70% identity over at least 100 nucleotides [50]. Observing no known RNA structure or protein coding region, he clearly pinpointed the mystery of these sequences: they are conserved through long evolutionary distances because of their functional properties, but they are much longer than the previously identified regulatory elements (up to 75 nucleotides long). While vertebrate whole-genome sequences were not yet available, pairwise comparisons of increasing sequences length between human and mouse also delineated CNCs diverse genomic loci [51–57]. At the same time, these reports pleaded for sequencing of mouse and other vertebrate genomes as well as human in order to perform such studies at a global scale. At a longer evolutionary distance, studies between mouse and fugu sequences underlined the relevance of fish-mammal comparisons. They also presented a method to functionally test CNCs by assessing their regulatory activity via a reporter gene in transgenic animals [58–60].

In 2000, when investigating the interleukin locus using comparative genomics approaches, several CNCs were identified, from human up to fugu [61]. In this study, the role of a 401 nucleotides long CNC, with 80% identity in mouse, human, rabbit, cow and dog was further examined in vitro.

The Multi-species Conserved Sequences or MCSs, as made clear by their name, were identified by multiple species comparison, rather than by pairwise study. They were the result from the analysis of 1.8 Mb of human chromosome 7 and its orthologous sequences in 12 vertebrate species [62]. Instead of the usual identity threshold, two independent probabilistic methods were used. With such a strategy, 70% of the resulting conserved elements were found as non-coding with an average length of 58bp. Going further in the number of species used, eight other mammals sequences contributed to the establishment of high-resolution profiles of CNCs [63]. Furthermore, the use of multiple

(21)

20

species comparison allowed for the identification of a substantial number of the identified MCSs that could not have been identified by pairwise studies alone, also underlining the varying contribution of each species to the delineation of MCSs. Most (98%) of their non-coding MCSs did not correspond to known regulatory regions, and a similar ratio of MCSs was found more than 1Kb away from transcription start sites (further away than the expected range of location for promoters), raising more questions about the function and the regulatory mechanism of these elements.

Using a more stringent cutoff for the delineation of conserved elements, UltraConserved Elements (or UCEs) were defined as segments of a human-mouse-rat alignment with 100% identity over at least 200bp. 481 UCEs passed this threshold, 23%

of them overlapping coding regions and 53% with no apparent transcription [64]. Even by using very stringent cutoffs, these UCEs appeared to remain functionally and evolutionarily heterogeneous: intronic elements seem to be associated with developmental genes, but the intergenic ones often lie in gene deserts, away from any known gene. Also while some UCEs could be traced to fish, where only a “core” region was conserved, others could only be traced to chicken. This study also underlined the location of UCEs close to genes involved in early development. This finding was reinforced by other studies: 3’583 elements having at least 95% identity over more than 50 bp. between human and mouse with locally similar sequences found in fugu appeared to be clustered together and located in loci associated with transcription factor genes and other key regulators of development [65]. In another study, CNCs of at least 100 bp. were found through local sequence similarity searches between human and fugu and reported similar findings [66]. In vivo assays performed on 25 of these sequences resulted in 23 of them showing a regulatory activity [66].

The opportunity to use multiple genomes, and their known evolutionary history to identify conserved elements resulted in the development of new tools for this purpose.

Adam Siepel et al. developed a phylogenetic hidden Markov model, where parameters are estimated from the data itself, resulting in the identification of stretches of nucleotides with higher probability of being under the conserved state of the HMM [67]. Using this tool, they identified conserved elements in vertebrates but also in insects, worms and yeasts. Vertebrate highly conserved elements were found to be

(22)

21

significantly enriched in gene deserts, implying a potential distal cis-regulatory function.

3.3.3 CNCs Origins and Evolutionary History

While the biological function requiring such extent of sequence conservation remains largely unknown, investigating the evolutionary history and the origin of those elements could potentially provide some hints. First, it has been confirmed that CNCs result from purifying selection and not from residing in regions with lower mutation rate, also termed mutation cold spots [68]. A significantly lower number of single nucleotide polymorphisms (SNPs) are found in CNCs compared to non-conserved regions in the human population [69,70]. While this could be the result of both hypotheses, CNCs being under purifying selection or lying in mutation cold spots, it already highlights that the process shaping CNCs is still active in the human genome. To discriminate between the two assumptions, allele frequencies were investigated since mutation rate differences do not affect the frequency distribution [71,72]. Using chimp as ancestral genome to define the “new” or derived alleles in human, the derived allele frequency was compared between CNCs and non-conserved regions. It was observed that within three different human populations, there was a significant shift towards rare derived alleles revealing the evolutionary conserved nature of CNCs in human.

The model for gene evolution proposed by Ohno [73] gives a basis for the understanding of the evolution of CNCs. He proposed that after gene duplication, one of the duplicates can either evolve a new function, or the ancestral function can be divided in the duplicates, due to the relaxation from selective constraints. Accordingly, some duplicated CNCs are able to drive tissue-specific expression that often overlaps [74,75], and some others show distinct expression profiles [76], implying that slight sequence changes can result in functional changes.

CNCs appear to be mostly unique in terms of sequence, so they most likely originated independently, from existing genomic elements, deriving and acquiring a new function resulting in their exaptation. Some CNCs have been traced to former exons [77], mobile elements [78,79] or repeats [80]. The general origin of vertebrate

(23)

22

CNCs could not easily be traced further back to non-vertebrate species [64,66,81]. Less than a hundred conserved elements were identified in the basal vertebrate sea lamprey [82], and only 56 could be traced to a chordate organism, the lancelet or amphioxus [83]. In general, these elements were identified close to regulatory genes shaping body plans [76,84]. Interestingly, a sample of these chordate CNCs were also able to show gene regulatory activity in zebrafish. Looking further away in the phylogeny, only a few elements were identified with minimal similarity between the fruit fly and human [85].

Vertebrate CNCs, appear to originate or get fixed at specific lineages. A burst in the number of CNCs was observed in the tetrapod lineage [86], where the constraint on CNCs seems to have increased. It was however also shown that CNCs in teleost fishes evolved more rapidly. CNCs were lost or diverged, due to the release of the evolutionary pressure on regions of their genome, following the teleost supplementary whole genome duplication [87]. Using local alignments it was demonstrated that approximately 40% of CNCs present in eutherian species were present before the divergence of the bony fish from the cartilaginous fish. Another 12% found its origin in the bony vertebrates, 18% in tetrapods, 16% in amniotes and 10% in therians [88]. The evolutionary rate of CNCs is variable in different vertebrate clades or species, and likely to be indicator of their maintained or diverged functionality [89–94].

3.3.4 CNCs in Insects and Plants

Interestingly, CNCs are found in plants [95,96], yeasts and insects [67]. With the exception of a few elements [85,96], vertebrate CNCs do not display sequence similarity with these. However, similarly as for vertebrate CNCs, they are found close to developmental genes [95,97]. Also in insects, CNCs are associated with the maintenance of large syntenic regions [28,98]. Examining the genes associated with CNCs in human, 40 orthologous genes were similarly associated to independent CNCs in worms and flies [81]. However comparisons between vertebrate and insect CNCs revealed minor differences: the latter were less abundant, less frequently associated with transcription factor genes and also of shorter length [99,100].

(24)

23

3.4 Biological Function of the Non-Coding Genome

CNCs represent a small part of the whole non-coding genome. Considering CNCs as a functional part, what is the purpose of the remaining 95% of the genome? While the long-known protein-coding genes represent the easily visible output of the DNA code, the large non-coding genome holds critical functions and is still the subject of thorough investigations. First, it does not seem favorable for organisms to have to replicate such a large bunch of genomic sequences every time a cell divides, especially if only 2-5%

presents a functional interest. It is however still a controversial subject with two prevailing hypotheses: either the non-coding genome is largely non-functional and contains accumulated residual, no-longer needed gene sequences or selfishly copied DNA [101] or most of the noncoding genome should be functional.

Figure 3-3: Percentage of the different non-coding components of the genome. Adapted with numbers from table in [101]

(25)

24

This question was incidentally the subject of interest of a large consortium, the ENCyclopedia Of DNA Elements (ENCODE) [102], who performed large-scale analyses to identify functional elements of the genome.

The precise expression of protein-coding genes requires a fine-tuned regulation at the transcriptional, post-transcriptional, translational and post-translational levels. These regulatory functions are largely undertaken by the non-coding genome. Figure 3-3 shows a summary of the non-coding element share in the human genome.

Promoters are one group of non-coding regulatory elements, which specifically initiate transcription of a particular gene [103]. Upstream or downstream untranslated sequences (5’ and 3’-UTR) also regulate the expression of a gene, either by influencing the transcriptional efficiency, the stability of the mRNA or causing translational inhibition due to their complex structure or the binding of micro RNA (miRNA) [101,104]. DNA sequence can be methylated on cytosine residues, enabling them to interact with protein complexes that in turn suppress transcription [105]. Introns, parts of the transcripts that are removed from the mature mRNA and not translated into amino acid sequence, are also able to regulate gene expression. Besides allowing alternative transcripts to be generated from the same pre-mRNA, introns occasionally encompass alternative promoters, miRNA or long non-coding RNA genes (lncRNAs), transcription enhancers, silencers or terminators [101].

Enhancers regroup genomic non-coding elements on which the binding of transcription factors activates transcription of a gene [106]. Specific proteins were shown to interact with enhancers: namely, p300 the global transcription coactivator protein, Mediator complex subunits and RNA polymerase II. Active enhancer sequences are also sensitive to DNase I and are associated with specific histone modifications. The histone H3 in contact with enhancers bears modifications such as the acetylation of the lysine 27 (H327ac) mono- or di-methylation of the lysine 4 (H3K4me1 or H3K4me2) [107,108]. Enhancers can be located near or further away from the gene they regulate, requiring some looping of the genomic regions and can be present in multiple copies in the genome, probably in order to increase the stability of transcriptional regulation [109]. Recently eRNA, long non-coding RNA originating

(26)

25

from enhancer sequences were identified and shown to be involved in the formation and stabilization of the enhancer-promoter loop [110].

Chromatin insulators are structural elements that isolate functional domains from each other, either by blocking the enhancer effect or by blocking the spreading of heterochromatin. These regions typically encompass CTCF binding sites, recruiting the cohesin complex and in turn stabilizing chromatin interactions [101]. Matrix-attachment regions (MARs) also regulate gene expression through chromatin structural changes. In that case, these AT-rich sequences are attached to the nuclear matrix. They are dynamically rearranged during differentiation, resulting in changes of genetic expression [111].

While all the described regulatory elements account for a small amount of the non- coding genome, another portion of the “junk” is populated by regions transcribed as non-coding RNA. In the early days of the ENCODE consortium, focusing on the study of 1% of the human genome [112], 93% of the genome was found to be transcribed at varying levels throughout a pool of tissues. The pool of all RNAs not encoding proteins or annotated functional transcript (such as ribosomal RNA, small nuclear or nucleolus RNA, transfer RNA), was labeled “the dark matter” [113]. This “dark matter” is composed of many different classes of non-coding RNA, mainly described based on their sequence length. The short noncoding RNAs regroup the miRNA, siRNA and piRNA, involved in genetic regulation mostly by repressing the translation of specific target genes [114]. The long noncoding RNAs (lncRNAs) class is the most numerous, and the most enigmatic class of noncoding RNA, regrouping transcribed sequences of over 200nt [115]. Enigmatic, since the function of most of them remains unknown.

These transcripts display similar features as mRNA, being subjected to splicing, having polyadenylation signal. Some described functions of lncRNAs relate to the recruitment of chromatin-remodeling proteins to specific loci, probably affecting the gene expression regulation [116,117]. Others showed translational regulatory properties [118].

Remnant sequences of genes, which lost their functional properties are also found in the noncoding genome. Usually termed pseudogenes, these elements were originally

(27)

26

considered to be inactive. Reports of pseudogenes being transcribed and competing for the binding of miRNA to their target transcripts revealed their regulatory potency [101].

Finally, repeated sequences represent the majority of the human genome. Their functional role remains broadly a mystery except for telomeres and centromeres.

Telomeres are the extremities of the chromosomes that ensure their stability and their full replication [119]. Centromeres are structures joining the chromatids and are directing chromosome segregation during cell division [120]. Apart from those known repetitive regions, 45% of the human genome is covered by mobile elements:

retroelements or DNA transposons that are also largely transcribed and thus potentially participating in regulatory functions [121].

We are still far from appreciating the full complexity of the noncoding genome and its functions. The ENCODE consortium bluntly claimed that 80% of the human genome is functional based on their armada of biochemical assays, contrasting with the 5%

being under constraint. This claim was criticized for the liberal use of the “functional”

term: being transcribed or associated with chromatin marks does not necessarily imply a real biological role [122].

3.4.1 Genome architecture

While the focus has been turned mainly to the sequence of the genome itself, there is more to it than a string of nucleotides. The DNA itself is actually packaged as a complex with proteins, mainly histones, into chromatin [123]. Chromatin can be more or less condensed, depending largely on its transcriptional activity and the stage of the cell cycle. Chromatin structure is therefore dynamic and its structural changes depend on the packing proteins, the histones. They are prone to post-translational modifications at their amino acid tail. Various modifications can occur at distinct amino acid positions, which in turn regulates the degree of chromatin compaction or accessibility.

For example active promoters are associated with the di- or tri-methylation of the lysine 4 of histone H3 (H3K4me2 and H3K4me3), enhancers are enriched in H3K4me1, H3K4me2, H3K27ac and the binding of the histone acetyltransferase protein p300,

(28)

27

repressed genes are located within regions bearing H3K9me2 and/or H3K9me3 or H3K27me3 modifications [124].

Figure 3-4: The different scales of genome architecture from [125], showing the TADs, the local DNA looping as well as chromosome territories and positioning.

Another level of complexity determines the positioning of chromatin within the nucleus and its three-dimensional structure, as shown on Figure 3-4. Indeed, the genome is not just randomly wound. On the first level, the chromatin needs to be locally organized. Local DNA contacts involve the formation of loops, in order to bring specific enhancer or silencer elements in the proximity of genes they regulate, or by arranging genes in clusters [126]. CTCF, a transcription factor with repressive and insulator activity is one protein involved in the formation of DNA loops [127]. At a larger scale, chromatin position inside the nucleus is also important and not fortuitous.

The periphery of the nucleus is a generally repressive environment. Regions interacting with the nuclear lamina, lamina-associated domains (also termed Matrix Attachment Regions, MARs), can be mapped and such domains show low level of transcriptional

(29)

28

activity [128]. These interactions change during cell differentiation, and result in the loss of lamina association for some genomic regions. Such repositioning of genes subsequently activates their expression in a cell-specific manner [129].

Additionally, regions encompassing actively transcribed genes are generally grouped together in what is called the A-compartment in opposition to clustered gene-poor regions residing in the B-compartment [130]. With further improvements in the techniques, the increased resolution allowed the identification of smaller domains within the A and B compartments. These Topologically Associated Domains (TADs) are defined by marked long-range associations between loci. Conversely to the A and B compartments that are cell-type specific, TADs appear to be rather stable structures with defined boundaries [131].

3.4.2 CNCs Biological Functions

Going back to the conserved part of the noncoding genome, the CNCs, we previously described, their ability to regulate gene expression [59,61,66], and to be clustered in proximity of developmental genes [65–67]. The assumption that CNCs function as enhancers was assessed in vivo for a set of elements by Pennacchio et al. [132]. They tested the ability of elements, either mammalian UCEs, deeply (human–fugu) conserved CNCs, or fulfilling both criteria, to drive the expression of a reporter gene in mice embryos. From the 167 elements tested, 75 showed reproducible activity, with the majority (50 elements) having a reported effect on only specific anatomical structures.

Interestingly one third of the elements showed gene regulatory properties in the forebrain. Using the sequences from this subset of elements, six motifs were identified as significantly over-represented, revealing the possibility to identify regulatory sequences in CNCs with “extreme” levels of conservation, as well as predict the anatomical structure in which they are active with specific sequence motifs [60].

The idea to assess the regulatory property of CNCs was further expanded to larger sets of elements, giving birth to VISTA Enhancer Browser [133]. In this database, the result from in vivo assessments of CNCs is available. Currently, 2’262 elements are indexed, 1’195 of them with a reported enhancer activity. This raises a legitimate

(30)

29

question: what is the other half up to? They present similar levels of conservation but were not able to switch the reporter gene on. The explanation for this could be diverse.

First, the presented assays are limited to a specific time point, embryonic day E11.5.

Elements could as well be specifically active at another developmental stage thus remaining silent at the tested time point. Secondly, while there are well established visual methods to assess the enhancing function of genomic sequences, the assessment of silencing activities is far more arduous. Another study systematically investigated a set of CNCs showing no cis-regulatory property, 20% of them revealed to bear such a regulatory blocking property [134]. Finally, these sequences could also hold different properties than enhancing the transcription of genes. For example, some elements were previously shown to be involved in post-transcriptional regulation, acting on splicing [135] or RNA editing [136], while another hypothesis is that CNCs are structurally important [68,137].

3.4.3 CNCs Sequences and Locations

By analyzing CNCs sequences, investigators attempted to “break the code” of CNCs, trying to understand the reason for their extraordinary level of conservation as well as their functional characteristics. It appeared that CNCs do not share a common vocabulary or similar patterns, or at least none that could be identified. Only a handful of sequences identified as forebrain enhancers appeared to share a common motif [132].

In a different study, 10% of CNCs were shown to contain matrix-attachment regions motifs [138]. This observation supports the idea of CNCs being involved in the structural regulation of the chromatin. It could also be correlated with more general reports, highlighting the over-representation of A and T nucleotides in CNCs, especially at CNCs boundaries [139,140], since matrix attachment regions are AT-rich. On the other hand, AT-richness has also been associated with poor nucleosome occupancy [141], enabling CNCs to be accessible to transcription factors or other interactors. One study identified the enrichment of some motifs at the boundary of ultraconserved elements [140]. One of the observed motif contains the core recognition sequence for homeodomain-containing proteins, transcription factors involved in the development.

Apart from a general AT enrichment, CNCs appear to be mostly unique in terms of

(31)

30

sequence. Furthermore, regions of segmental duplications and copy number variations are significantly depleted in CNCs [142]. Also, this was reported to be the case especially in healthy cells but not in cancer ones [143].

A feature that could be both related to a structural role and a regulatory one is the genomic location of CNCs. They seem to form clusters and lie close to developmental genes, but also spread large gene deserts [144]. Their organization and order within a genomic region also seems to be conserved [66] and the distance from one CNC to another also appears to be under constraints [145]. Such observation could imply a required coordination for CNCs functionality. Investigated in the light of chromatin conformation capture experiments, CNCs were even shown to be interacting with one another [146]. Another study examining the retention of CNCs after genomic duplication events, it was shown that one copy of the duplicated genomic region is more likely to retain all or most of the CNCs while the other one loses all of them [147].

Investigations on the distance between ultraconserved elements showed that it generally follows a power-law-like distribution, underlining an over-representation of elements lying at long distances from one another [148]. The conclusion of this study points towards possible long-range interaction of CNCS facilitated by 3D-chromatin folding.

Altogether, these observations further suggest that CNCs are retained and lie within specific genomic locations because they are interacting with each other.

3.4.4 Diseases and Traits associated to CNCs

Related to their function, we could wonder what happens when CNC sequence gets disrupted; does it result in a visible phenotype, a pathology? Actually, 88% of trait or disease-associated single nucleotide polymorphisms (SNPs) were shown to reside in intronic or intergenic loci [149]. This sharpened the interests for CNCs, uncharacterized functional elements in the non-coding genome. It appears that polymorphisms associated with human diseases are enriched 1.37 fold in mammalian CNCs [12,150].

Once again, clinicians did not wait for population genomics studies to identify disease- causing locus in CNCs. The first reports associated β-thalassemia to translocation breakpoints or deletions in a CNC in the locus control region of the β-globin gene [151,152].

(32)

31

The role of a CNC in a congenital physical anomaly, preaxial polydactyly, was also established. Interestingly, pathogenic nucleotide substitutions in a CNC are able to disrupt the regulation of the sonic hedgehog gene (SHH), located 1 Mb further away, in turn causing the hand deformity [153,154]. The nucleotide positions causing the anomaly are conserved from fugu to human. In the forebrain, an impaired SHH gene regulation, associated with a CNC upstream of this key developmental regulator gene is responsible for cerebral malformation, holoprosencephaly [155]. Autism was associated with another genetic deregulation in the forebrain, linked to polymorphism of a distinct CNC [156].

Hirschsprung disease, a pathology presenting developmental abnormality of the intestinal tract, has also been associated to CNC polymorphism. Pathogenic mutations in the coding part of this gene were already known in relation to the disease [157]. The non-coding element is located in the first intron of the RET developmental gene.

Polymorphisms in this CNC were shown to disrupt the binding of the transcription factor SOX10, impairing the transcription of the RET gene and resulting in variable severity of the phenotype [158,159].

Pierre Robin sequence, another congenital condition presenting facial anomalies is in some cases associated with CNCs polymorphisms or local genomic rearrangements encompassing them [160]. In this case, the associated CNCs are located in a gene desert, more than 1 Mb away from SOX9, a gene encoding a transcription factor involved in the embryonic development. SOX9 knock out or direct disruption are lethal with developmental defects recapitulating the observed phenotype of Pierre Robin sequence with additional other skeletal and sexual defects. Translocations or polymorphisms in CNCs loci in the gene desert upstream SOX9 are associated either with Pierre Robin sequence or sexual developmental anomalies [161]. These CNCs were further shown to bear enhancer chromatin marks. Altogether, these findings would imply a fine-tuned regulation of SOX9 by the combined enhancer activity of CNCs located in the upstream gene desert. Such hypothetical mechanism of action would support the idea of additive effects and modularity of enhancers, in the form of CNCs, each responsible for a tissue specific expression pattern [162].

(33)

32

3.4.5 Are CNCs a Collection of Enhancers?

Various phenotypes, related to diseases as exposed previously, as well as non- pathogenic ones were linked to CNCs polymorphisms or disruption, causing a relative change in gene expression [153–156,158–160,163–166]. It was hence a surprise when 4 highly conserved elements were knocked out of a transgenic mouse, one at a time, giving birth to fertile offspring with an apparent normal phenotype. Furthermore, even though the individual elements were chosen because of their enhancer properties, no significant change in expression of the nearby genes was observed [167]. In another study, two gene deserts (1.5 Mb and 0.845 Mb long) encompassing 1’243 CNCs (with more than 70% identity over at least 100 bp) was also deleted in mice resulting in mice with similar life expectancy, overall phenotype and gene expression than normal animals [168]. Both of these reports have shaken the “conserved-functional hypothesis”.

Could these elements be disposable after all? Of course it is impossible to control all possible faint phenotypical changes in the knock-out mice, and laboratory transgenic animals represent a distorted view from natural conditions. Nonetheless, these arguments fail to completely invalidate the two independent studies.

Interestingly from one of the first studies of gene expression related to CNCs knock out, the expression of the surrounding genes were compared between normal cells and a cell line derived from transgenic mice where this element was knocked out. Their conclusion points out that tested elements are involved in gene activation through the modulation of chromatin structure rather than being a classical enhancer: the conserved element does not directly increase the level of nearby gene expression in cells, but actually seems to stimulate more cells to express them [61].

On the other hand the redundant role of regulatory elements (although not showing a conserved status) has already been demonstrated: two enhancers of a T-cell receptor locus have to be simultaneously deleted in order to produce an observable phenotype [169]. A similar mechanism could therefore be hypothesized for CNCS: regulatory elements with redundancy. The deletion of a long stretch of multiple CNCs would not really fit with this proposition, if we expect CNCs to coordinately modulate the nearby gene expressions [168].

(34)

33

The lack of detectable phenotype in CNC knock out mice adds up to other arguments challenging the common view of CNCs as enhancers or regulatory elements. The undetectable regulatory activity in vivo or in vitro for half of the tested conserved elements [133], the generally heterogeneous characteristics of CNCs [170] and the degree of conservation of the CNCs sequence most likely cannot solely be due to the superimposition of transcription factor binding sites (TFBS) on such long stretches [68].

Indeed, the binding of transcription factors does not require perfect sites, as exemplified by position weight matrices of transcription factor binding motifs. Moreover, the nucleotides lying between binding sites would not be under functional constraint.

An interesting study focused on CNCs interacting partners, using 481 ultraconserved elements as baits, and in turn analyzing the proteins binding to them [171]. The UCE interactome they presented contains many chromatin remodelers with nucleosome shifting abilities, as well as tissue development related transcription factors, but is depleted in heterochromatin-promoting proteins. They further showed that densely packed, or even overlapping TFBS could explain the conserved status of some ultraconserved elements when comparing the motif superimposition in ultraconserved elements to enhancers’ motifs. The AT-richness of CNCs sequence is only partially responsible for the observed superimposition of motifs. Another justification advanced for all conserved position not being direct binding partners is that some transcription factors require a specific genomic context, as shown in another study [172]. Other investigations also underlined that the positions in UCEs corresponding to TFBS were under stronger selective pressure, looking at allele frequencies of variants from human population genomic data [173].

Nonetheless, transcription factor binding sites and regulatory elements were shown to be generally evolvable and not highly conserved. Transcriptional regulation was compared in five different vertebrate species and revealed large interspecies differences.

Species-specific binding of transcription factors appeared to be more frequent than shared ones: neither the sequence motif nor the binding strength was found to correlate with sequence conservation, altogether suggesting a high turnover in regulatory elements and sequences [174]. When enhancers were tested individually using synthetic mutated sequences, only some of the positions affecting their activity were shown to be conserved or to overlap known TFBS, suggesting that highly conserved sites might be

Computational identification and analysis of conserved non-coding vertebrate elements

Thesis

Reference