• Aucun résultat trouvé

Long-range and temporal aspects of the genetics of gene expression

N/A
N/A
Protected

Academic year: 2022

Partager "Long-range and temporal aspects of the genetics of gene expression"

Copied!
149
0
0

Texte intégral

(1)

Thesis

Reference

Long-range and temporal aspects of the genetics of gene expression

BRYOIS, Julien

Abstract

We aimed to better understand the genetics of gene expression over large distances and over time. In a first project, we performed a cross-sectional eQTL study in LCLs. The large statistical power provided by the large sample size allowed us to detect a large number of genes affected by eQTLs (in cis and in trans). Importantly, we provided biological explanations for several trans-eQTLs by showing that they first affected a gene in cis, which led to an effect on gene expression in trans. In a second project, we explored time-related changes in the genetic effects on gene expression. Our results suggest a model where ageing is associated to a loss of genetic control and to the downregulation of genes involved in protein synthesis and oxidative phosphorylation. The downregulation of these genes then appears to lead to the upregulation of genes involved in the lysosome, spliceosome and in the cytoskeleton.

BRYOIS, Julien. Long-range and temporal aspects of the genetics of gene expression. Thèse de doctorat : Univ. Genève, 2015, no. Sc. 4797

URN : urn:nbn:ch:unige-744739

DOI : 10.13097/archive-ouverte/unige:74473

Available at:

http://archive-ouverte.unige.ch/unige:74473

Disclaimer: layout of this document may differ from the published version.

(2)

Département de Médecine génétique FACULTÉ DE MÉDECINE et Développement Prof. Emmanouil T. Dermitzakis Département d’Informatique FACULTÉ DES SCIENCES Dr. Frédérique Lisacek

Long-range and Temporal Aspects of the Genetics of Gene Expression

THÈSE

présentée à la Faculté des Sciences de l’Université de Genève pour obtenir le grade de Docteur ès sciences, mention bioinformatique

Par Julien Bryois

De Moudon (Vaud)

Thèse N° 4797

Genève, 2015

(3)

Bryois J, Buil A, Evans DE, Kemp JP, Montgomery SB, Conrad DF, Ho KM, Ring S, Hurles M, Deloukas P, Davey Smith G, Dermitzakis ET. Cis and Trans Effects of Human Genomic Variants on Gene Expression. PLoS Genet. 2014 Jul 10;10(7):e1004461. doi:10.1371/journal.pgen.100446

UNIVERSITÉ

...

DE GENIEVE

Doctorat ès sciences Mention Bioinformatique

Thèse de Monsieur Julien BRYOIS

intitulée :

"Long-range and Temporal Aspects of the Genetics of Gene Expression"

La Faculté des sciences, sur le préavis de Monsieur M. DERMITZAKIS, professeur ordinaire et directeur de thèse (Faculté de médecine, Déportement de médecine génétique et développement), Madame F. LISACEK, docteure et codirectrice de thèse (Institut Suisse de Bioinformotique), Monsieur B. CHOPARD, professeur ordinaire (Déportement d'informatique), Monsieur S. ANTONARAKIS, professeur ordinaire (Faculté de médecine, Déportement de médecine génétique et développement), Monsieur M. GEORGES, docteur (Groupe interdisciplinaire de génoprotéomique appliquée, Université de Liège, Liège, Belgique), autorise l'impression de la présente thèse, sons exprimer d'opinion sur les propositions qui y sont énoncées.

Genève, le 13 juillet 2015

Thèse - 4797 -

r\ /

Le Décanat

N.B . - La thèse doit porter la déclaration précédente et remplir les conditions énumérées dans les "Informations relatives aux thèses de doctorat à l'Université de Genève".

(4)

Acknowledgement

First of all, I would like to thank my supervisor Professor Manolis Dermitzakis for giving me the opportunity to work on several exciting projects.

I would also like to thank him for his scientific advice and his critical thinking.

In addition, I am very grateful for the opportunity to participate in several courses, workshops and international conferences, such as the Leena Peltonen School of Human Genomics, the American Society of Human Genetics and the GM2 conference.

I also would like to thank Professor Michel Georges, Professor Stylianos Antonarakis, Professor Bastien Chopard and Dr. Frédérique Lisacek for accepting to be part of my thesis committee and for their fair judgment of my work.

I am also very grateful to the NCCR frontiers in Genetics PhD program for accepting me as one of its students, allowing me to follow classes taught by top geneticists all over Switzerland and for funding my salary during the first two years of my PhD thesis. The NCCR program also allowed me to perform rotations in different laboratories, which allowed me to get first hand experience in three very interesting laboratories. I would like to thank Professor Bart Deplancke and Professor Pedro Herrera for accepting me in their laboratory for rotation and treating me like one of their lab members.

I would also like to thank all my colleagues over the years for their help, scientific advices and for interesting discussions. In particular, I would like to thank Alfonso Buil for his technical help in the accomplishment of this thesis and Andrew Brown for proofreading this thesis. I would also like to thank Halit Ongen, Nikos Panoussis, Olivier Delaneau, Cédric Howald, Ana Viñuela, Marco Garrieri, Maria Gutierrez-Arcelus, Tuuli Lappalainen, Helena Kilpinen, Ismael Padioleau, Alexandra Nica, Alisa Yurovsky, Thomas Giger, Stephen Montgomery, Cibele Masotti, Diogo Meier, Pedro Ferreira, Nicolas Damond, Ximena Bonilla, Aline Dousse, Alexandra Planchon, Deborah Bielser, Luciana Romano and Ancilla Stefani.

I would also like to thank all the generous donors who agreed to donate blood samples for the studies presented in the thesis. Without their contribution, none of the results of this thesis could have been obtained.

Finally I would like to thank my family for their help and support during these last four years. I would like to specifically thank my mother Carla, my father Christian, my grand parents Otto and Wijntje, my brothers Gaël and Maxime, my sister Larissa and my girlfriend Sophie.

(5)

Table of Contents

 

1 ABSTRACT 5

2 RÉSUMÉ 6

3 INTRODUCTION 8

3.1 A brief history of genetics 8

3.1.1 Monogenic traits 8

3.1.2 Polygenic traits 9

3.3 Genome-wide association studies 13

3.3.1 GWAS, missing heritability and missing biology? 15 3.4 From SNPs to common diseases and complex traits 16

3.4.1 Expression QTLs 17

3.5 Gene-environment interactions 19

3.6 Thesis aim 20

4 RESULTS 22

4.1 Cis and trans effect of human genomic variants on gene expression 22 4.2 Age related changes in the genetics of gene expression 39

4.3 Software and computational use 74

4.3.1 Computational resources used 77

5 DISCUSSION 78

5.1 What is missing in the genetics of gene expression? 78 5.1.1. Integrating genomic variants effects over multiple molecular

phenotypes 80

5.1.2 Integrating molecular phenotypes over different environments 81 5.2 Limitations of the results presented in this thesis 82

5.3 Conclusion 84

6 REFERENCE 85

7 APPENDIX 94

7.1 Normalization of illumina infinium HumanMethylation450 array 94 7.2 Role of cis-eQTLs as expression modifiers affecting penetrance 139

(6)

ABSTRACT 1

Many human traits are the product of both genetics and environment. A large number of studies have recently explored the genetic basis of a number of traits and found that thousands of genetic variants were associated with hundreds of phenotypes. However, most of these variants were found to be located in non-coding parts of the genome, suggesting that they affect traits by modulating gene expression levels. In order to discover the role of these variants at the molecular level and to detect functional SNPs in the human genome, many studies have explored the effect of genetic variation on gene expression; these are known as expression quantitative trait loci (eQTL) studies. In this thesis, I first aim to better understand the genetics of gene expression in cis and in trans by performing a large cross-sectional eQTL study in lymphoblastoid cell lines (LCLs). Cross sectional studies give us a snapshot of gene expression at a particular time. To better understand the genetics of gene expression and its relationship to ageing, I considered in a second step longitudinal gene expression data from a twin cohort.

In a first project, we performed a cross-sectional eQTL study in LCLs.

The large statistical power provided by the large sample size allowed us to detect a large number of genes affected by at least one cis-eQTL. In addition, we detected genes with several independent cis-eQTLs that significantly increased the variance explained of these genes. We also discovered that copy number variants (CNVs) were more likely to affect gene expression than SNPs and estimated tissue specificity of previously detected cis-eQTLs in an unbiased manner. We also detected trans-eQTLs, both by testing all known SNPs, as well as by concentrating on subsets of functional SNPs, such as non-synonymous variants, trait-associated variants or cis-eQTLs. We provided biological explanations for several trans-eQTLs by showing that they first affected a gene in cis, which led to an effect on gene expression in trans.

Finally, we showed that cis-eQTLs have a small effect on many genes in trans. Overall, we show that eQTL studies with a large sample size can be used to find meaningful biological relationships between genes and to better understand the genetics of gene expression.

In a second project, we explored time-related changes in the genetic effects on gene expression. We found that gene expression was moderately stable, mainly because non-genetic effects were weakly correlated over time.

Though global genetic control of gene expression was extremely stable over time, we detected many SNPs that had a different effect on gene expression over time, reflecting gene-age interactions. These SNPs were enriched in enhancer regions, suggesting that enhancers are more likely to interact with age than promoters. Finally, we propose a model where ageing is associated to a loss of genetic control and to the downregulation of genes involved in protein production and oxidative phosphorylation. The downregulation of these genes then appears to lead to the upregulation of genes involved in the lysosome, spliceosome and in the cytoskeleton.

(7)

RÉSUMÉ 2

La plupart des caractères humains sont influencés par des facteurs génétiques et environnementaux. Un grand nombre d’études ont récemment exploré la base génétique d’un grand nombre de caractères et découvert que des milliers de variants génétiques sont associés à des centaines de phénotypes. La plupart des variants découverts ont été localisés dans la partie non codante du génome, suggérant qu’ils affectent les divers phénotypes par un effet sur le niveau d’expression des gènes. Un certain nombre d’études ont alors exploré le rôle des variants génétiques sur l’expression des gènes dans le but de détecter l’effet moléculaire de ces variants; ceci est appelé étude des loci quantitatifs d’expression (eQTL). Dans le cadre de cette thèse, j’ai premièrement essayé d’augmenter nos connaissances du rôle de la génétique sur l’expression des gènes à proximité du gène (cis) et à longue distance (trans) en faisant une étude transversale à large échelle d’eQTLs dans des lignées de cellules lymphoblatoïdes (LCLs).

Une étude transversale consiste en l’analyse de l’expression des gènes à un instant donné dans la population. Dans un deuxième temps, J’ai analysé des données longitudinales d’expression des gènes dans une cohorte de jumeaux dans le but de mieux comprendre l’effet du vieillissement sur le control génétique de l’expression des gènes.

Dans le premier projet, nous avons effectué une étude transversale d’eQTLs dans des LCLs. Le grand pouvoir statistique qui découle du large nombre d’individus testés nous a permis de détecter un grand nombre de gènes affectés par au moins un cis-eQTL. De plus, nous avons pu détecter certains gènes affectés par plusieurs cis-eQTLs indépendants, ce qui nous a permis de sensiblement augmenter la proportion de variance expliquée de ces gènes. Nous avons aussi découvert que les variabilités du nombre de copies (CNVs) étaient plus susceptibles d’affecter l’expression d’un gène que les polymorphismes d’un seul nucléotide (SNPs) et avons estimé la spécificité d’eQTLs détectés dans d’autres tissues de manière non biaisée. Nous avons aussi détecté des trans-eQTLs, soit en testant chaque variant ou en testant uniquement des variants fonctionnels, tel que les variants non synonymes, les variants associés précédemment à certains phénotypes et les cis-eQTLs.

Nous avons fourni des explications biologiques pour plusieurs trans-eQTLs en montrant qu’ils affectent en premier un gène en cis, ce qui conduit à un effet sur l'expression de gènes en trans. En conclusion, nous avons montré que les études d’eQTLs à grande échelle peuvent être utilisées pour trouver des relations régulatrices entre différents gènes et permettent de mieux comprendre la génétique de l’expression des gènes.

Dans un second projet, nous avons exploré les changements liés au vieillissement dans la génétique de l'expression des gènes. Nous avons constaté que l'expression des gènes n’est que modérément stable dans le temps, principalement en raison d’une faible corrélation de facteurs non génétiques sur l’expression des gènes. Bien que le contrôle génétique globale de l'expression des gènes soit très stable dans le temps, nous avons pu

(8)

détecter de nombreux SNPs affectant différemment l'expression génique au cours du temps. Ces SNPs ont été découverts principalement dans les régions amplificatrices du génome, suggérant que les amplificateurs sont plus susceptibles d’avoir un effet différent au cours du temps que les promoteurs.

Enfin, nous proposons un modèle dans lequel le vieillissement conduit à une perte du contrôle génétique et à une diminution de l’expression de gènes impliqués dans la production de protéines et la phosphorylation oxydative. La diminution de l’expression de ces gènes semble ensuite conduire à une augmentation de l’expression de gènes impliqués dans les lysosomes, les splicéosomes et dans le cytosquelette.

(9)

INTRODUCTION 3

3.1 A brief history of genetics

Basic concepts of heredity were probably known already thousands of years ago as human began domesticating plants[1,2] and animals[3]. Our ancestors probably observed that offspring were more similar to their parents than to unrelated individuals, which allowed them to select crops with better yield and to tame animals. This process of domestication was a key step that allowed hunter-gatherer societies to become agricultural societies, a new societal system that has allowed modern civilization to thrive.

Until the beginning of the 20th century, the prevalent concept of heredity was that offspring inherited acquired traits in a blend coming from the traits of the mother and the traits of the father. This idea was proposed more than 2000 years ago by two greek philosophers, Hypocrate and Aristotle, and then later famously proposed by Jean-Baptiste Lamarck in his theory of evolution[4]. Modern genetics began with the rediscovery in 1900 of the laws of inheritance of Gregor Mendel. Mendel discovered that the first generation (F1) of a cross between peas of a pure breed (homozygote) had a uniform phenotype corresponding to the dominant phenotype of the F0 generation, arguing against a mix of the traits of the two parents. He also discovered that each gamete contained only one heredity factor by showing that crossing peas from the F1 generation resulted in a F2 generation with the characteristics of the F0 generation in a 3:1 ratio. Finally, he proposed that traits are inherited independently, which is now known to be only true when the underlying genetics are not located in close proximity.

The physical support of heredity was not known until 1944, when an experiment by Avery, MacLeod and McCarty showed that DNA was the support of genetic information[5]. This result contradicted the popular opinion at the time that proteins transmitted genetic information. In 1953, the discovery of the structure of the DNA by Jim Watson & Francis Crick implied a mechanism for the transmission of genetic information[6]. Indeed, a double helix of complementary strands indicated that the single heredity factor by gamete discovered by Mendel could be a single strand of DNA.

3.1.1 Monogenic traits

All the traits that Mendel examined in peas were monogenic (due to a single gene), and were either dominant or recessive. Many human traits and diseases were also found to follow Mendelian laws of inheritance, indicating that a single DNA region was responsible. For example, earwax type (wet or dry) is a monogenic trait that is due to a single polymorphism in the ABC11 gene[7,8], with the wet type of earwax being dominant over the dry type.

Several diseases in human are also known to be monogenic. Some are autosomal dominant such as Huntington’s disease or neurofibromatosis, while

(10)

others are autosomal recessive such as cystic fibrosis, Tay-Sachs, beta thalassemia or sickle cell anemia.

Monogenic diseases are usually rare because of purifying selection, although some monogenic diseases, like sickle cell anemia, are more prevalent because they are under balancing selection. Indeed, heterozygote individuals for the sickle cell anemia allele are protected from malaria due to a change in the shape of their red blood cells upon infection by the Plasmodium falciparum parasite, which leads to selective phagocytosis of infected cells and prevents replication of the parasite[9]. Another monogenic disease, cystic fibrosis, is highly prevalent in European population (1 in 25 is a carrier). This high prevalence has been hypothesized to be the result of balancing selection, as heterozygotes could have been protected from tuberculosis during the 17th century[10].

Because alleles located in close proximity are usually inherited together, a process known as linkage disequilibrium (LD), genetic markers (variable DNA regions in the population with a known chromosomal location) were used to identify the chromosomal region associated with monogenic diseases by looking at the co-segregation of traits and genetic markers in families affected by these diseases. These linkage analysis studies were very successful in establishing the molecular cause of a number of disorders[11].

For example, a linkage analysis performed in 1983 discovered that the gene responsible of Huntington’s disease was located on the short arm of chromsome 4[12], which later led to the discovery that the disease is due to a large number of trinucleotide repeats in the Huntingtin gene[13].  

3.1.2 Polygenic traits  

Although several human phenotypes are monogenic, the vast majority of common diseases and traits follow a more complex pattern of inheritance.

These traits, known as polygenic, include height, body mass index and skin color. In addition to being polygenic, the traits listed are also influenced by many environmental factors, such as food intake or sun exposure. This complicated inheritance together with complicated environmental influences earns these traits the qualification of complex. Several diseases have been considered to be complex for many years as they were more prevalent in certain families but were not inherited according to Mendel’s law, such as schizophrenia[14], diabetes, heart diseases, several types of cancer, or hypertension[15]. Although, many diseases and traits were suspected to be complex, formal proof of the polygenic nature of a trait was not easy to show, as the familiar environment and individual environments could in theory lead to the variability observed between and within families.

In order to tackle this problem and quantify the different sources of variation of a trait, geneticists performed twin studies. The main assumption of twin studies is that if a trait is more phenotypically correlated among monozygotic twin pairs than in dizygotic twin pairs, then the cause of the difference is due to genetics. This is because monozygotic twin pairs share 100% of their DNA, while on average dizygotic twin pairs only share 50% of

(11)

their DNA. Another important assumption of twin studies is that both types of twins come from the same familial environment. These assumptions allow the partitioning of the variance of the phenotype into components due to genetics, to a common environment and to a unique environment. The proportion of the total variance of a trait that is due to genetics is called the heritability of the trait and can be reported using two different parameters. The first parameter (H2), called the broad-send heritability, captures all genetic effects in the population, including non-additive and dominance effects. The second parameter (h2), called narrow-sense heritability, only captures additive effects and is usually the parameter of interest since additive effects make the greatest contribution to variance and because the phenotypic variance among relatives is only due to additive effects (with the exception of full siblings)[16,17]. Narrow-sense heritability (h2) is usually obtained in twin studies under the assumption that the difference in phenotypic correlation between monozygotic twin pairs and dizygotic twin pairs is only due to additive effects[18]. Twin studies have shown that many common diseases and common traits are genetically controlled and have quantified the role of genetics in their total variance. For example, twin studies found that height was 70-90% heritable[16,19], that lifespan was 25% heritable[20], that BMI was 40-70% heritable[21,22] and that Alzheimer’s disease was 74%

heritable[23]. One important point is that heritability estimations are only valid for a specific population in a specific environment. For example, the variance in the blond hair phenotype in a Chinese population is purely environmental (h2 =0) while in a British population the variance in the blonde hair phenotype is partly due to genetics. Overall, positive heritability estimates coupled to non-Mendelian pattern of inheritance showed that many traits and common diseases were indeed polygenic.

Although twin studies proved that many traits were genetically controlled, they did not provide information about the genes affecting the traits. This is a key goal in order to understand the biology of a trait and to develop potential therapies. As linkage analyses studies were successful in the discovery of the genetics of monogenic diseases, many attempts were made to use them in order to understand the genetic basis of polygenic traits between 1990 and 2000[24]. However, mapping of complex traits proved to be more difficult. Indeed, a review found that only 34% of the published studies on complex traits were genome-wide significant, according to criteria proposed by Lander and Kruglyak[24,25]. In addition, many results of whole genome scans performed on the same disease by different groups were inconsistent. This is not surprising as the main hypothesis of linkage analysis studies is that a trait is affected by only a few highly penetrant alleles, which is now known to be false for the vast majority of common diseases and complex traits. In order to circumvent the difficulty of whole-genome linkage analysis for polygenic diseases, many groups performed candidate gene association studies. In such studies, a gene is chosen based on its biological function and genetic markers in its vicinity are tested for a difference in frequency between individuals affected by a disease and healthy individuals. Candidate gene studies were found to produce many false positives as most results did not replicate, probably owing to improper multiple testing corrections[26].

(12)

In conclusion, the polygenic nature of complex traits was difficult to ascertain without a high number of polymorphic genetic markers that could capture most of the genetic variation in the population, without very large sample sizes and without access to the human genome sequence.

3.2 Technological advancement for the investigation of the genetics of polygenic traits

A large number of technical breakthroughs happened at the beginning of the 21st century that now allow us to interrogate the genetics of complex traits. First, a major effort by the public and private sector led to the sequencing of the human genome in 2001[27,28]. This showed us that the human genome is composed of approximately 3 billion base pairs and contains 20’000-25’000 genes[29], a number similar to the number of genes of the worm C.elegans, a much simpler organism[30]. In addition, the protein- coding fraction of the genome was found to be approximately 1.2% of the total genomic sequence, raising questions about the function of the rest of the human genome. Furthermore, the average human gene length was found to be very similar to drosophilia genes. One key difference, however, was that human genes were found to be composed of more exons, suggesting that alternative splicing could expand the protein-coding repertoire in human.

Despite being a major advance, the sequencing of the human genome was only a first step in understanding the genetics of complex traits, as it did not provide information on the variability of the genome in the population. A first comparison between the results of the public and private human genome sequences discovered between 1.4-2.1 millions of single base differences, called single nucleotide polymorphisms (SNPs), with approximately 1% of them located in protein coding sequences[27,28]. Similar to the genetic markers used to map the genetic basis of monogenic diseases but with the advantage of being more frequent in the genome, SNPs can be used to map the genetics of complex traits, as they can either directly affect a trait or be correlated to an unknown causal variant. Therefore, the characterization of a large number of SNPs in the population was a major requirement in order to map the genetics of complex traits.

In order to obtain a set of common SNPs that would capture most of the genetic variability in human populations, the Hapmap consortium started an ambitious genotyping project in four different populations. The aim of the Hapmap consortium was to genotype SNPs every 5000 bases with a minor allele frequency >5%, characterize their frequency in the population and measure the correlation between them[31]. The Hapmap results showed that recombination hotspots are common in the human genome and that the pattern of correlation between genomic markers follows a block-like structure of linkage disequilibrium. In addition, it was observed that the haplotype diversity was low, indicating that information about only a subset of SNPs was sufficient to infer the genotype of the remaining SNPs in the genome[32]. In the first phase of the Hapmap project, 1 million SNPs were genotyped and characterized. This number was increased to 3.1 million SNPs in the second phase of the project[33].

(13)

The human genome is not only different at many single base pairs but also possesses structural variation. One major type of structural variants is copy-number variation (CNVs). CNVs are segments of DNA ranging in size from kilobases (kb) to megabases (Mb) that can be duplicated or deleted and may account for 13% of the human genome[34]. CNVs were detected and characterized in the Hapmap population using comparative hybridization arrays, which resulted in the discovery of a total of approximately 5000 polymorphic CNVs[35]. The characterization of CNVs could be important in order to better understand the genetics of complex traits because most of the bases that differ between two genomes are located in CNVs[35]. Furthermore, several rare CNVs are known to dramatically affect disease risks. For example, a rare CNV at the 16p11.2 locus is associated to schizophrenia with large odd ratio (8 to 26) if the locus is duplicated[36] and associated with autism and obesity if the locus is deleted[37,38].

The 1000 genomes project was launched in 2008 in order to expand Hapmap results to a comprehensive list of all SNPs with minor allele frequency >1% and of all common structural variants. The pilot phase of the project assessed genomic variations using a combination of several approaches such as low coverage high-throughput sequencing of 167 individuals from four different populations, deep sequencing of two trios from two populations and exon targeted high-throughput sequencing of 697 individuals from seven populations[39]. Results from the pilot phase of the 1000 genomes project extended Hapmap results to approximately 15 million SNPs, 1 million short insertion and deletions and 20’000 structural variants.

Interestingly, it was observed that each genome contained approximately 250 to 300 putative loss of function variants that could potentially affect human health. In addition, the sequencing of the two trios allowed us to estimate that the average rate of de novo mutations per generation is 10-8, which corresponds to approximately 60 de novo mutation per generation. In a second step, the 1000 genomes project derived a haplotype map of 38 millions SNP, 1.4 millions indels and 14’000 CNVs from 1092 individuals originating from 14 populations[40]. The increased characterization of SNPs coupled to the development of imputation techniques now allows us to genotype only a fraction of all common SNPs and from this subset infer the genotype of several millions of SNPs, increasing the likelihood that a causal variant can be directly tested for association with a complex trait[41].

Overall, the sequencing of the human genome, the characterization of its variations in human populations, the development of dense genotyping chips and the development of statistical tools to impute SNPs not directly genotyped is now allowing us to perform large-scale whole genome association studies between a large number of genomic variants and complex traits.

(14)

3.3 Genome-wide association studies

The main assumption of genome-wide association studies (GWAS) is the common variant/common disease hypothesis[42,43], which assumes that much of the genetic risk of common disease will be due to a moderate number of variants with relatively high frequencies in the population (figure 3.3.1). The basis for this assumption lies in the observation that the vast majority of genetic differences in the population are common. Indeed, it has been estimated that two individuals genome’s are approximately 99.9%

identical and that 90% of the population level variation is due to common variants[31]. The principle of GWAS is relatively straightforward as it consists of genotyping a population of individuals that are affected by a disease and of a population of individuals that are not affected. SNPs are then imputed from the genotyped SNPs using the Hapmap or 1000 genomes reference panel in order to obtain the genotype of several millions of SNPs in the two populations. The frequency of each SNP is then compared between the affected and healthy populations. Any SNP with sufficient evidence that its frequency differ between the two populations (pvalue below the genome-wide significance threshold: 5*10-8) is called significant and becomes a strong candidate for association to the disease or the complex trait. The stringent whole-genome significance threshold (5*10-8) ensures that the proportion of false positives remains low and is the result of a Bonferroni correction using the estimated number of independent loci in the genome of a European population (1 million)[44,45]. Beyond this, the gold standard for unequivocal association between a genetic variant and a complex trait is the replication of the detected association in an independent cohort.

The first major GWAS was published in 2007 by the Wellcome Trust Case Control Consortium[46]. The consortium looked for associations between SNPs and 7 major diseases in populations of ~2000 affected individuals and 3000 shared controls. They found 1 association for bipolar disorder, 1 for coronary artery disease, 9 for Crohn’s disease, 3 for rheumatoid arthritis, 7 for type 1 diabetes and 3 for type 2 diabetes. The success of this early GWAS showed that the genetic basis of complex traits could finally be ascertained and led to a wave of GWAS for many complex traits and common diseases. Over the years, increasing sample sizes have meant the detection of an increasing number of loci associated to complex traits. For example, a recent GWAS in more than 250’000 individuals found that 697 variants, located at 423 loci were associated to height[47]. A GWAS of body mass index (BMI) in more than 335’000 individuals identified 97 associated loci[48]. Although, the BMI study was better powered than the height GWAS, far fewer loci were detected indicating that the genetic architecture of BMI is different from that of height. Another GWAS in 220’000 individuals detected 49 loci associated to body fat distribution (waist-hip ratio adjusted for BMI)[49]. Common diseases were also investigated using a large number of individuals and resulted in the discovery of a large number of loci affecting disease susceptibility. For example, 163 loci were found to be associated to inflammatory bowel diseases[50] and 108 loci were associated to schizophrenia[51].

(15)

Figure 3.3.1 Relationship between allele frequency and penetrance. Highly penetrant alleles are usually selected against and can only subsist in the low allele frequency spectrum. On the other hand, common variants with only marginal effects on disease susceptibility are less likely to be selected against, especially if their effect on health is only apparent after reproductive age. Figure adapted from McCarthy et al. [52].

Most genome-wide association studies focused on the role of SNPs on the genetics of complex traits as they are easy to measure, can be imputed with high accuracy and capture most of the common genetic variance in the population. However structural variants such as CNVs could also play an important role on complex traits because they affect large parts of the genome[35]. A first GWAS focusing on CNVs, performed by the Wellcome Trust Case Control Consortium, used a sample of 19,000 individuals to investigate the role of 3432 common CNVs (estimated to account for ~50% of all common CNVs larger than 500 base pairs) in eight diseases. Although, the CNV study was performed with a larger sample size than the first SNP GWAS, the CNV study was less successful. Indeed, associations for only 3 diseases were found, each connected to 1 CNV and all had been previously identified using SNPs[53]. This led the authors to conclude that GWAS performed with CNVs are more difficult to perform due to technical difficulties in properly measuring CNVs and that the effort is not worth it as most CNVs are well tagged by SNPs[53].

Overall, genome-wide association studies were very successful in mapping the genetics of complex traits and found more than 14’000 SNP-trait association for more than 600 traits[54].

(16)

3.3.1 GWAS, missing heritability and missing biology?

 

Although genome-wide association studies were very successful in the mapping of the genetics of complex traits, the analysis of the trait-associated SNPs posed two main problems. First, an analysis of GWAS results showed that 88% of trait-associated SNPs were located in non-coding parts of the genome with 45% of the SNPs located in introns and 43% in intergenic locations[55], suggesting that most trait-associated SNPs did not affect the trait through modifications of the structure of a protein but instead by modifying the level of expression of a nearby gene. Identification of the affected gene, a key step to understand the biology of the examined traits, can be relatively difficult as assigning the closest gene to the location of the associated SNPs is likely to result in false positives, especially for SNPs located in gene-rich regions[56].

The second issue that arose in the initial analysis of GWAS results was the problem of missing heritability[57]. For example, the 40 first loci associated to height captured only 5% of the phenotypic variance, despite a sample size of tens of thousands individuals. In addition, the 18 first loci associated to type 2 diabetes explained only 6% of its heritability. Several explanations for the missing heritability were proposed such as many more common variants of small effect sizes, the cumulative effect of many rare variants (MAF<1%) with moderate penetrance, structural variants poorly linked to tag SNPs, gene-gene interactions or inadequate shared environment accounting in the heritability estimation, which could have inflated the heritability estimates[57,58].

The missing heritability problem has now been partially solved by showing that all common SNPs taken together explain 45-60% of the variance in height, indicating that a lot of the missing heritability lies in weak effects of common SNPs. Furthermore, this result is likely to be an underestimate of the true variance explained by common SNPs as more heritability might lie in the imperfect LD between the tested SNPs and the causal SNPs[47,59]. A similar approach was used in a GWAS of BMI and showed that all SNPs taken together explained ~22% of the variance in BMI, which represents approximately 30-55% of its heritability[60]. Finally, it was recently estimated that the genomic inflation factor (an inflation of low pvalues compared to what is expected by chance), often observed in GWAS, was likely due to polygenic effects and not due to bias such as population stratification[61].

Overall, the success of GWAS was mitigated by the difficulty in interpreting the biology of complex traits from trait-associated variants and by the low amount of heritability explained by the discovered variants. Although, evidence now suggests that a large part of the missing heritability is due to common variants of low effect sizes, the identification of such variants and the understanding of their biological effects is likely to be difficult using traditional genome-wide association studies. One avenue is the use of molecular phenotypes, as the detection of variants affecting molecular phenotype could help to prioritize functional variants in traditional genome-wide association studies and to assign biological effects to trait-associated SNPs.

(17)

3.4 From SNPs to common diseases and complex traits

All organism-level phenotypes are the result of intermediate phenotypes, as each functional variant first impacts either the regulation of a gene or the structure of its protein. The sum of different cellular phenotypes then leads to an organ or tissue level phenotype that will ultimately lead to a whole-organism phenotype and potentially to diseases (Figure 3.4.1).

Figure 3.4.1 Whole-organism phenotype as a sum of molecular phenotypes affected by DNA variants. Adapted from Dermitzakis[62].

Technological developments, such as microarrays and high-throughput sequencing now allow the measurement of a large number of different cellular phenotypes in a genome-wide manner, which can help to bridge the gap between genotype and whole-organism phenotypes. For example, mapping of SNPs affecting chromatin accessibility found more than 8000 chromatin location significantly affected by a nearby SNP[63]. Interestingly, these SNPs were strongly enriched in transcription factor binding sites, suggesting that their role on chromatin accessibility is the result of a differential binding of transcription factors. In addition, 16% of the SNPs affecting chromatin accessibility also affected gene expression, suggesting a molecular mechanism linking the mutations to a change in chromatin and to an effect on gene expression. Other studies found that hundreds of CpG sites in the human genome were methylated differently depending on the genotype of nearby SNPs[64-67], which could also be the result of impaired transcription factor binding sites. Finally, one study found hundreds of SNPs affecting the decay rate of mRNA, showing that SNPs could not only affect the epigenetic landscape, but also the degradation rate of mRNA[68].

The mapping of diverse molecular phenotypes is a crucial step in order to detect functional elements in the genome, to understand their primary effects and to understand the biology of complex traits.

(18)

  17   3.4.1 Expression QTLs

 

  One major intermediate phenotype linking genetic variants to whole- organism phenotypes is gene expression. It was observed that 88% of whole- organism phenotype associated SNPs are located in non-coding parts of the genome[55]. As genes are the functional unit of an organism, if trait- associated SNPs are not located in the protein coding sequence, they are likely to act on the whole-organism phenotype through changes in gene expression levels. This hypothesis was confirmed by the discovery that trait- associated variants were more likely to affect gene expression than matched SNPs[69]. SNPs affecting gene expression are called expression quantitative trait loci (eQTLs) and are usually separated in two different categories: the cis-eQTLs, SNPs that directly affect gene expression and are located in close proximity to the gene, and trans-eQTLs, which act indirectly on gene expression usually from a large distance (Figure 3.4.1.2).  

 

Figure 3.4.1.2 Detection of cis and trans-eQTLs in a population of individuals.

The density plot shows that individuals with the AA genotype are less frequent in the population but have higher expression level than individuals with the AG genotype or the GG genotype. Adapted from Dermitzakis[70].

The first eQTL study was performed in a yeast cross in 2003 and found hundreds of cis-eQTLs, as well as a few trans-eQTLs[71]. Shortly after, it was shown that gene expression was highly variable in human and that gene expression was more similar in individuals from the same family than in unrelated individuals, implying that gene expression was heritable[72]. The first eQTL study in human was published in 2004 and found that a large number of genes were affected by eQTLs[73]. One important discovery of this first human eQTL study was that, compared to whole-organism phenotypes, genetic mapping of gene expression resulted in a large number of significant hits, likely owing to the more direct effect of genetics on gene expression than on whole-organism phenotypes.

Nature Reviews | Genetics Metabolite

Blood lipids Organismal

quantitative trait

Gene expression

DNA

Main effect

Cell phenoytpe Main effect

DNA

Genetic interaction TSS

Gene

Candidate cis-acting SNP

Candidate trans-acting SNP 1Mb

a b

1Mb

Gene F

Gene D

Gene E

Gene I Gene H

Gene A

Gene G Gene B

Gene C

Frequency

7.5 4.5 1.5 -1.5

-3.0 3.0 6.0

Expression level 100

80 60 40 20 0

0 GG

AG AA

eQTL SNPs

1Mb window

Effect of genetic variants at different phenotypic levels

c d

Figure 1 | Analytical approaches made possible by using cellular phenotypes. a | Identifying quantitative trait loci (QTLs) for molecular phe- notypes: the link between a cis- and a trans-acting genetic variant and a molecular phenotype, such as gene expression. Detection of variants that act in cis begins by focusing on a small genomic window around the tran- scription start site (TSS). By contrast, detection of variants that act in trans involves testing the whole genome. Blue and red lines represent the SNPs to be tested in cis or in trans. The distribution shows the density plots of gene expression for the red SNPs that are stratified by genotypes AA, AG or GG. b | Inference of gene interaction networks. The aim is to infer the relationship and molecular interactions between genes either directly (in the form of protein–protein interactions) or indirectly (by modulation of one gene by another gene). These networks represent relationships between

genes that usually lack directionality and can be represented by modules of genes with increased levels of interaction (as illustrated by the bold lines linking genes A–E) relative to other neighbouring genes with lower levels of interaction (as illustrated by the thin lines linking genes A–F). c | Inference of phenotype interaction networks and their integration with genetic infor- mation. The figure highlights the relationship between molecular pheno- types and how the variance of one phenotype has an impact on the variance of another. Directionality can be inferred when a genetic variant (that is unmodified in the lifetime of an individual) influences the phenotypic vari- ance of one or more of them. d | Additive and interaction effects of genetic variants. A set of genetic variants that seem to have a linear (additive) effect on molecular phenotypes may appear to operate under epistatic interaction at an organismal phenotype level.

Cytoplasmic phenotypes. These phenotypes are defined with respect to the molecular composition of the cytoplasm or the interac- tions that take place in this cellular compart- ment. They are either quantities of molecules that stem from genomic phenotypes (for example, protein abundance6) or certain metabolites or by-products that result from signalling or biochemical cascades. Usually, the signalling or biochemical cascades are pathway-specific or linked to a certain cell type or tissue, or they are relevant to the function of only a few genes20,21. These

properties make them more informative about specific processes in the cell and can represent the synergistic effect of many genomic phenotypes, such as gene expression and protein abundance.

Although the subcategorization described above is useful for defining a phenotype, it is desirable to obtain a combination of cyto- plasmic phenotypes in a real experiment. The ability to integrate local effects on the genome with specific cellular processes allows us to understand how genome variation modulates function in a cell-type-dependent manner.

The phenotypic definitions given above can be classified further by the state of the cell when the measurement takes place.

Steady-state measurements. These are measurements that are taken under ‘nor- mal’ phsyiological conditions, in which cells are simply dividing in culture or as part of a tissue biopsy. One caveat with such a definition is that ‘normal’ is a per- ceived state, and it is frequently unknown whether cells are under any type of stress or exposure to an agent. These issues are more

P E R S P E C T I V E S

(19)

Since then, a large number of eQTL studies have been performed in order to detect the genetic basis of gene expression[74-86]. The first study to investigate the relative impact of CNVs and SNPs on gene expression was published in 2007 and discovered that CNVs played a significant role in the regulation of gene expression, with little overlap with the effect of regulatory SNPs[85].

Expression QTL studies have been performed in many different tissues, an important step to fully understand the genetics of gene expression as some eQTLs only have an effect in some tissues[74-85]. The first study investigating sharing of cis-eQTLs across tissues found that most eQTLs were tissue specific[74]. Retrospectively, it is likely that the low sharing estimation was largely due to low statistical power as this result has been contradicted by larger subsequent studies[78,83]. Interestingly, tissue shared eQTLs were found to be located close to transcription start sites, while tissue specific eQTLs were located further away, suggesting that the effect of promoters are more likely to be shared across tissues while enhancers are more likely to be tissue specific [74,87]. Another study estimated the sharing of whole-genome genetic effects on gene expression between lymphoblastoid cell lines and whole blood using samples from monozygotic twin pairs and found that the whole-genome genetic control of gene expression was largely independent between the two tissues[88]. Although this result would need to be replicated with a larger sample size, this suggests that even if the main cis-eQTLs are often shared, the remaining additive genetic effects of gene expression could be largely independent between tissues. A more detailed picture of the genetics of gene expression across tissues will emerge soon as the GTEX consortium is currently mapping eQTLs in a large number of tissues in over 900 individuals[89].

The inter-population variability of the genetics of gene expression was also investigated in two studies[76,84]. Similarly to studies investigating tissue specific effects, it was observed that eQTL shared across many populations are usually located closer to the transcription start site than eQTLs shared by one or few populations, suggesting that enhancers play a larger role in population specific effects on gene expression. However, it could also be the result of the lower effect sizes of eQTL located further away from the transcription start site, as eQTLs with low effect sizes are less likely to replicate than eQTLs with large effect sizes[84].

Most eQTL studies have measured gene expression using microarrays[74,75,78,83,85,90]. However, in 2010, two eQTL studies used for the first time RNA-seq in order to quantify gene expression[77,80]. This new technology proved to have several advantages over microarrays. First, RNA- seq allows to quantify all expressed genes in a tissue, instead of a subset of selected genes with microarrays. Secondly, RNA-seq allows to better quantify highly expressed genes that have a saturated signal on microarrays.

Furthermore, RNA-seq allows to measure allele specific expression, which can be used in order to detect the effect of rare variants on gene expression[77,86]. RNA-seq also provides information on alternative splicing events. Finally, gene quantifications using RNA-seq was shown to result in a larger number of eQTL discovered compared to microarrays[77].

(20)

Gene expression was shown to be highly heritable with 40-90% of the genes with significant heritability across different tissues[91]. Interestingly, it was observed that the heritability of the cis region only captures 12-36% of the total heritability of gene expression[78,92,93], indicating that trans effects play a major role in the genetics of gene expression. However, the detection of trans effects proved to be more difficult than for cis-eQTLs[78,81,94], owing to smaller effect sizes and the need for more stringent multiple testing corrections. In addition, replication of the few detected trans-eQTLs was often not attempted due to a lack of large replication cohorts. In order to reduce the number of statistical tests required in the trans analysis of gene expression, a few studies have focused on a reduced set of variants, such as SNPs located in expressed genes[95] or SNPs associated to complex traits and common diseases[90], which led to the identification of a few hundred replicated trans- eQTLs.

Overall, eQTL studies were very successful and mapped cis-eQTLs in many different tissues, as well as in diverse populations. However, the detected cis-eQTLs cannot fully explain the genetic variance of gene expression, indicating that more loci affecting gene expression remain to be discovered. In addition, evaluating the sharing and specificity of genetic effects across tissues, the changes of genetic effects over time and their interaction with environmental factors will be key to understand complex traits.

3.5 Gene-environment interactions

Genome-wide association studies discovered thousands of loci associated to complex traits, while eQTLs studies associated a putative functional role for many of them. Despite these successes, the etiology of complex disease is still far from understood. Gene-environment interactions are likely to play an important role in the etiology of common diseases.

Indeed, epidemiological studies observed that the prevalence of many diseases is very low in young individuals and steadily increases as individuals age[96-100]. For example the prevalence of diabetes is less than 0.5% in individuals below 25 years old, but around 15% in those over 80 years old[101]. As there is no reason to expect that young individuals carry fewer risk alleles for common diseases than older individuals (if we omit the role of de novo mutations), it seems likely that the relevance of risk alleles increases over time due to interaction with the environment. In addition, the effect of risk alleles may only be seen in specific environments. For example, it is hypothesized that the recent obesity epidemic is due to recent environmental changes allowing the expressivity of underlying risk alleles in the population[102].

A few studies in mice found that the penetrance of risk alleles was dependent on a specific environment. For example, the penetrance of a mutation located in a gene associated with Crohn’s disease was shown to increase after infection by a specific virus, leading to an intestinal phenotype similar to Crohn’s disease[103]. Furthermore, it was shown that the penetrance of heterozygote deletions of two transcription factors implicated in congenital scoliosis was increased by hypoxia during gestation[104]. These

(21)

two studies showed that gene-environment interactions could play an important role in disease etiology.

Investigating gene-environment interactions is relatively difficult in human, as one cannot control human environments. One avenue is the use of allele specific expression (ASE) in large cohorts of monozygotic twins (MZ).

Indeed, If the difference in ASE between MZ twin pairs that are heterozygote at a specific SNP is larger than for MZ twin pairs that are homozygote at the same SNP, this indicates that one of the two alleles of the heterozygote SNP interacts with its environment[105]. Another avenue is to test gene- environment interactions in vitro. For example, one can add different drugs or toxin to the cell culture media and look for genotype specific response on gene expression[106-109]. Finally, one can look at the cumulative effect of all environments and its interaction with genetics in the form of gene-age interactions. This strategy recently discovered hundreds of eQTLs significantly interacting with age in cross-sectional studies[110-112], indicating that genetic effects on gene expression can change over time.

3.6 Thesis aim

The aim of this thesis was to increase our understanding of the genetics of gene expression in cis, in trans and over time. To accomplish the first two goals, I had the opportunity to study cross-sectional data obtained from lymphoblastoid cell lines in 869 children of the ALSPAC cohort (section 4.1)[83]. In contrast to most eQTL studies, copy number variants were genotyped in the ALSPAC cohort allowing us to compare the relative impact of SNPs and CNVs on the genetics of gene expression. The large sample size of the cohort increased the statistical power to detect multiple independent cis-eQTLs per gene and investigate their role on the heritability of gene expression, to assess tissue sharing of cis-eQTLs in an unbiased manner and to detect trans-eQTLs. Most studies previously investigating trans-eQTLs did not replicate or provide biological explanations for their findings. I tried to address both issues by replicating trans-eQTLs in two different cohorts and by exploring the role of functional SNPs on the regulation of gene expression in trans.

In a second project, we aimed to explore the role of genetics on gene expression over time. To accomplish this goal, we obtained longitudinal whole-blood transcriptomes using RNA-seq in a twin cohort (section 4.2).

The transcriptomic data were obtained at two time points separated by approximately two years on average. As this study is the first longitudinal study investigating the genetics of gene expression over time, we aimed to answer many different questions. First, we aimed to see whether the heritability of gene expression was stable over time and to detect putative genes with a change in heritability. Secondly, we wanted to examine the stability of gene expression over time and identify sources of variability by obtaining correlations of the genetic, common environment and unique environment components of gene expression over time. Third, we wanted to investigate the sharing of cis-eQTLs over time. Fourth, we wanted to explore gene-age interactions by detecting SNPs with a different effect on gene

(22)

expression over time. Fifth, we aimed to find differentially expressed genes over time and examine the relationship between genetics and differential expression. Finally, we aimed to explore differences in the heritability of splicing over time, detect differentially spliced genes, detect alternative splicing QTLs, and to investigate their sharing over time.

(23)

RESULTS 4

4.1 Cis and trans effect of human genomic variants on gene expression

The role of genetics on the regulation of gene expression in cis is relatively well established and, cumulatively, the majority of genes in the human genome were found to be affected by at least one cis-eQTL in at least one tissue[74,76-78,81,85,94,113]. On the other hand, trans effects on gene expression remain more elusive. For example, a study with more than 1000 samples in human failed to find any significant trans-eQTLs[94]. Other studies found hundreds of trans-eQTLs that were not tested for replication or that did not replicate well[78,81]. For example, trans-eQTLs detected in the MUTHER study replicated at a rate of 0-13%[78]. Overall, it was observed that the statistical power to detect trans-eQTLs was lower than for cis-eQTLs because of lower effect sizes coupled to an increased burden of multiple testing corrections[113,114].

However heritability analysis showed that most of the heritability of gene expression lies in trans. For example, it was estimated that only 37% of the heritability of gene expression is in cis in blood[93] and 12-35% in LCLs[78,92]. This indicates that, cumulatively, trans effects play a larger role on gene expression but that the strength of their individual effects are usually smaller. Therefore, in order to better understand the genetics of gene expression in trans, I used the large sample size of the ALSPAC cohort (869 individuals) to detect trans-eQTLs and replicate them in different cohorts[83].

In order to decrease the burden of a large number of statistical tests, I also performed trans-eQTLs analysis focusing on functional SNPs, such as cis- eQTLs, SNPs associated to complex traits and diseases, and non- synonymous SNPs. I discover tens of trans-eQTLs that replicated relatively well (34-55%) in the MUTHER and GEUVADIS datasets[76,78]. I also provide biological explanations for trans-eQTLs by showing that they are enriched in cis-eQTLs. More importantly I discover causal relationships between genes regulated in cis and in trans by the same variants, showing that genetic perturbations allow us to reconstruct meaningful regulatory relationships between genes. Finally, I explore the relative role of CNVs and SNPs on gene expression, assess tissue specificity of cis-eQTLs and show that many cis- eQTLs affect the expression of numerous genes in trans. This result is consistent with a large role for trans effects on the genetics of gene expression and a large number of variants, each with a small effect, influencing the genetics of gene expression.

Références

Documents relatifs

Nevertheless, the respective proportions of cDEF and tDEF (Figure S8B) are significantly different between the two floral phenotypes from Late Stage 2 – Early Stage 3 onwards, with

in 6 rainbow trout isogenic lines in response to an early temperature treatment by EpiRADseq, is novel.. 20 Indeed, to our knowledge, very few papers have looked at the role of

We also demonstrate its genome- wide application to the integrative search of new regions with strong association between DNA copy number and gene expression accounting for

The quality of training has to be objectively assessed, using final examination on a national basis covering basic sciences and clinical skills, after satisfactory completion of

Lecture : en 2003, 69 % des demandes de licenciement de salariés protégés sont justifiées par un motif économique ; 34 % des salariés licenciés s'inscrivant à l'Anpe déclarent

Even though the exploration of epigenetic phenomena has been intensively developed for the study of cancer and many human disease syndromes, only a few groups are studying

Electrophoretic anal- ysis of some tissue extracts (testes, seminal vesicle, mucus glands and ejaculatory bulb) demonstrated that Lap-D is present in all parts of the reproductive

Nevertheless, the respective proportions of cDEF and tDEF (Figure S8B) are significantly different between the two floral phenotypes from Late Stage 2 – Early Stage 3 onwards, with