• Aucun résultat trouvé

What is missing in the genetics of gene expression?

Cis and Trans Effects of Human Genomic Variants on Gene Expression

5.1 What is missing in the genetics of gene expression?

 

A large number of eQTL studies were performed in the last 10 years and discovered a large number of genes affected by cis-eQTLs in many different tissues[74,76,77,83,90,94,95]. However, similarly to whole-organism phenotypes, the eQTLs discovered so far cannot fully explain the genetic variance of gene expression, indicating that many genomic variants remain to be discovered. Indeed, studies found that cis-eQTLs explained on average 25-38% of the heritability of gene expression[78,130]. As the proportion of the heritability of gene expression located in cis was estimated to range from 12-37%, these results suggest that a large part of the cis genetics of gene expression has been found and that most genes are affected by only few independent variants in cis[78,92,93]. A precise quantification of the role of

common SNPs (MAF >5%) located in cis found that on average 60-79% of the cis-heritability was accounted for by all common SNPs[78]. This suggests that rare variants, not well imputed SNPs, poorly tagged structural variants and non-additive effects explain 21%-40% of the heritability of gene expression in cis. Therefore, in order to fully explain the genetic variance of gene expression in cis, it will be necessary to obtain whole-genome sequences in a large number of samples. This will allow to test the effect of rare variants, to detect structural variants and to measure their effect on gene expression. In addition, whole-genome sequences will allow to investigate epistasis between independent cis-eQTLs in an unbiased manner. Indeed, without whole-genome sequence, a statistical interaction between two SNPs located in close proximity could be the result of haplotypic effects and not real epistasis (i.e: an unknown causal variant is located on a specific haplotype that is tagged by the combination of the two statistically interacting SNPs)[131].

The major part of the genetics of gene expression is located in trans and the detection of genomic variants affecting gene expression in trans will require much larger sample sizes than for the detection of cis-eQTLs because of their low effect sizes and because of the stringent multiple testing corrections needed in order to limit the number of false positives[78,91].

Although, trans-eQTLs are expected to have individually only a small effect on the expression of a single gene, they can play an important role on complex traits and diseases as they can affect a large number of genes. For example, the KLF14 locus, which is associated to type 2 diabetes and HDL cholesterol level was found to be associated to a large number of genes in trans, providing a molecular mechanism for the effect of the locus on disease risk[132]. In addition, we observed (section 4.1) that cis-eQTLs affected a large number of genes in trans but that their individual effects were too small to be detected with our sample size[83].

It will also be important to further explore the role of structural variants, which are often neglected in genome-wide association studies because of the technical difficulty in their accurate measurement[35,53]. Before the study described in section 4.1, only one study had investigated the role of CNVs on gene expression but in a much smaller sample size with a lower number of CNVs[85]. As I observed that CNVs were more likely to affect gene expression than SNPs[83], it is likely that common structural variants play a role on the variability of many traits and should be accounted for if we aim to explain all heritability.

Epistasis has been proposed to explain parts of the missing heritability observed in genome-wide association studies[57,133]. As organism level phenotypes are the result of molecular phenotypes, part of the missing heritability of gene expression could also be due to epistasis, as for example, a cis-eQTL affecting the binding of a transcription factor could have a different effect on gene expression if the transcription factor is highly expressed than if the transcription factor is lowly expressed. Epistasis was shown to be relatively widespread in model organisms[134,135] but it’s role in human remains controversial[17,18,136]. One line of evidence suggest that epistatis could play a significant role in the genetics of gene expression, as it was

observed that a significant fraction of the variance in allele specific expression was due to interactions between cis and trans effects[137]. In addition, recent studies have reported epistatic interactions affecting gene expression in human[138,139]. Although the results of Hemani et al.[138] were disputed because many of the reported interactions could be explained by haplotypic effects[131], it seems likely that some of their interactions, as well as the interactions detected by Brown et al.[139] are real. I explored the effect of eQTL-eQTL interactions on gene expression in the ALSPAC cohort (869 individuals) but could not detect any significant interactions after multiple testing corrections and stringent filtering to exclude potential haplotypic effects. It is probable that the detection of epistasis will need much larger sample size in human than in model organism because of the lower allele frequencies in human populations[134].

In order to fully understand the genetics of gene expression, it will be necessary to perform eQTL studies in different tissues as some genetic variants might have an effect only in a specific tissue or have different effects in different tissues[65,66,74,75,78]. Almost all studies so far focused on the sharing of the main cis-eQTL per gene across tissues. However, as discussed above, the main cis-eQTLs explain only a small fraction of the genetic variance of gene expression on average. Therefore, in order to get a complete picture of the shared genetics between tissues at the whole-genome level, it will be necessary to perform large scale eQTL studies in twin cohorts, as the relatedness between individuals allow to obtain genetic correlations that capture the correlation of the whole genome additive genetic effects between two traits (in this case, between the same trait but in different tissues)[88]. In addition, the observation that eQTLs located at greater distance from the transcription start site are more tissue specific[113] would need to be confirmed with larger samples sizes as the lower effects sizes of eQTLs located further away is likely to impact their replication rate across tissues.

Overall, the remaining of the genetics of gene expression will be detected by large-scale eQTL studies performed using DNA-sequencing, in different tissues and ideally in twin cohorts. Such studies will not only be informative about the genetics of complex traits but could allow to better understand cellular pathways and give insights into cellular biology (section 4.1).

5.1.1. Integrating genomic variants effects over multiple molecular phenotypes

 

In order to precisely understand the chain of events that leads a SNP to a disease, it is necessary to integrate the effects of SNPs over many molecular phenotypes[70]. Ideally, we would like to find that one SNP prevents the binding of a transcription factor, which leads to changes in gene expression. This change in gene expression would then propagate to a change in protein production, which could have an effect at a systemic level and lead to an increased disease risk.

The first studies integrating the effect of multiple molecular phenotypes at the population level are starting to appear and are resulting in unprecedented insight into the molecular effects of common variants and into the cross talk of different regulatory layers. Most integrative studies published so far have focused on the integration of genetics, DNA methylation and gene expression[64-67,140]. One study from our laboratory recently investigated the complex relationships between these three layers of information and observed that some SNPs first affected DNA methylation, which resulted in changes in gene expression, while other SNPs first affected gene expression, which resulted in changes in DNA methylation levels (section 7.1)[65]. Other integrative studies have explored the relationships between genetics, gene expression and protein levels and found that most protein QTLs are also eQTLs[141,142]. However, the converse is not always true indicating that buffering mechanisms could attenuate the effect of some cis-eQTLs.

The use of diverse molecular phenotypes could allow to increase statistical power in order to detect trans effects on gene expression. For example, one allele could reduce the expression of a transcription factor in cis. A lower level of this transcription factor could result in a lesser binding to some regulatory element in trans, which could be associated with an increased methylation level. Although, we might not have power to detect a significant effect of that allele on gene expression in trans, we might indirectly find that the allele has an effect on gene expression in trans by showing that changes in methylation level of the regulatory element is associated to a change in gene expression. This theoretical framework is supported by a study that identified almost 2000 methylation sites affected in trans by at least one SNP in only 1748 individuals[143]. The SNPs-methylation sites pairs were often located on different chromosomes (85%) and had more than 90%

replication in independent cohorts. Although, no trans-eQTL study has tested genome-wide SNPs in such a large sample size, it seems unlikely that we would detect as many genes affected in trans by an eQTL because of their less direct effect.

Overall, the integration of multiple molecular phenotypes will allow to better understand the crosstalk of regulatory layers, to understand how risk alleles lead to increased disease risk and probably to an increased statistical power for the detection of trans-eQTLs.

5.1.2 Integrating molecular phenotypes over different environments  

Although, the integration of different molecular phenotypes will largely improve our understanding of the impact of genetics on complex traits, it will be important to also assess how these effects changes in different environments and over time. Indeed, some variants might have a deleterious effect only in a specific cell type, at a specific age or after exposure to a specific environment. Efforts are underway to accomplish this difficult task and some studies are already discovering that some eQTLs are only active in specific environments. For example, genotype specific effects on gene expression and protein levels were detected in immune cells upon stimulation by a pro-inflammatory cytokine and by an endotoxin[107,144,145]. In addition,

the effect of different drugs were found to elicit genotype specific response on gene expression[108,109]. Such studies will need to be extended to test the genome specific response to a large number of different environments, the most essentials being risk factors of common diseases, age, diet, the metabolome and the microbiome composition.

Longitudinal studies could play an important role in the understanding of the changes in the genetics of gene expression over time (section 4.2) and to understand why some individuals become sick while others stay healthy.

Particularly interesting longitudinal studies could focus on healthy individuals with a high genetic risk score for a specific disease. The collection of several molecular phenotypes at multiple time point would then allow to better understand the etiology of that specific disease. In addition, longitudinal data at the individual level could play a significant role in disease prevention. For example, it was shown that longitudinal data obtained in one individual predicted the onset of type 2 diabetes, which could be prevented by lifestyle modifications[146].

Overall, understanding how the genome responds to diverse environments will probably be key to understand the etiology of many complex diseases and to propose targeted disease prevention measures.

5.2 Limitations of the results presented in this thesis  

  In this section, I would like to address the limitations of the results presented in this thesis. First, I would like to discuss the limitations of the cross sectional eQTL study performed in lymphoblastoid cell lines (section 4.1)[83].

The first limitation is that we used lymphoblastoid cell lines (LCLs), which are obtained by transforming B lymphocyte with the Epstein-Barr virus.

Therefore, some of the detected eQTLs in LCLs might not be observed in primary tissues. This is unlikely to be an important limitation in cis as most cis-eQTLs detected in LCLs were found to replicate well in other tissues[78]. In addition, most cis-eQTL detected in other tissues were found to replicate in LCLs (section 4.1)[83]. However, the extent of tissue specific effect of eQTLs remains unexplored in human and the relevance of the detected trans-eQTLs in other tissues remains unknown.

A second limitation is that the LCLs were derived from blood samples obtained from individuals who were 9 years old. Although, the homogeneity of the data likely increased the statistical power to detect eQTLs, some of the detected eQTLs might be children specific. Such children specific eQTLs could explain in part why we could not replicate all cis-eQTLs (71-81%

replication) and all trans-eQTLs (34-55% replication) in LCLs derived from adult donors[76,78,83].

Another limitation is the relatively “small” sample size of the study (869 individuals) for the detection of trans-eQTLs. Although the study is large compared to many published eQTL studies[74,76,77,80,95], the statistical power to detect trans-eQTLs is still clearly limited. Indeed, I showed that