Software and computational use - Cis and Trans Effects of Human Genomic Variants on Gene Expres

Cis and Trans Effects of Human Genomic Variants on Gene Expression

4.3 Software and computational use

All results presented in this thesis were performed at the Vital-IT (http://www.vital-it.ch) center for high-performance computing of the Swiss Institute of Bioinformatics (SIB) using the following softwares:

- Bash (v4.1.2) was used to handle and format files. It was also used to parallelize computational jobs on Vital-IT.

- Perl (v5.18.2) was used to format files, filter files and to obtain simple statistics from result files (such as the mean, the sum or the median of numerical values).

- R (v2.13.0 and v3.0.2) was used to normalize data, filter data, plot results and perform many types of statistical analysis (described in figure 4.3.1 to figure 4.3.14). The type of statistical analysis performed include simple statistics (using the function mean(), var(), median()), linear regression (using the lm() function), linear mixed models (using the lmer() function), principal component analysis (using the prcomp() function), standard normalization of variables (using the rnorm() function), the correction for multiple testing (using the qvalue() function, or the p.adjust() function), fisher tests (using the fisher.test() function), analysis of variance (using the anova() function), the causal inference test (using the cit library) and Bayesian networks (using the bnlearn library). Bayesian networks are acyclic directed graph that describe the relationship among variables. Each network is associated to a probability distribution describing the probability to observe expression values in function of the parameters of the network. Because of the Markov property of Bayesian networks (each variable is conditionally independent of its non-descendants given its parent variable), the probability distribution of the network can be separated in local probability distributions that only depend on the variable’s parent(s). In our analysis, we used an “expert knowledge”

approach by setting the configuration of the possible networks and by estimating the maximum likelihood of the networks by maximizing the parameters describing the local probability distributions using the bnlearn R package. The likelihood of the different networks were then compared in order to detect the most likely networks. We used the following R libraries:

aroma.light, GenABEL, Mass, qvalue, RColorBrewer, Matrix, lme4, bnlearn, cit, sna, peer, edgeR, DESeq, DESeq2.

- Popgenomix (v1.0) was used to detect cis and trans-eQTLs in the first study (section 4.1)[83]. Popgenomix is a software that performs spearman rank correlation between two vectors representing genotype and gene expression and allows to perform permutations in order to derive null distributions. This software was previously used in several eQTL studies[74,75,77].

- MatrixeQTL (v1.0)[117] is a cis-eQTL mapper that I tested for accuracy. I showed that pvalues obtained for cis-eQTL with matrixeQTL were extremely similar to pvalues obtained in the same dataset using Popgenomix. I also developed a pipeline in bash with Tuuli Lappalainen in order to parallelize MatrixeQTL and to perform permutations. The pipeline developed was used in the GEUVADIS project to map cis-eQTLs[76].

- FastQTL (v1.0) is a eQTL mapper that I used for the detection of cis-eQTLs and alternative splicing QTLs (asQTLs) in the second study (section 4.2).

- SOLAR (v7.2.5) is a software allowing to perform many types of statistical genetics analysis. It was used to estimate heritability, genetic and environmental correlations and to investigate interaction between genotype and time in the second study (section 4.2)[118].

- VCFtools (v0.1.12b) was used to handle and filter VCF files[119].

- Bedtools (v2.17.0) was used to find overlapping genomic regions between different bed files[120].

- LiftOver (v1.0) was used to translate genomic coordinate from the human genome hg18 to the human genome hg19 coordinates.

- Altrans (v1.0) was used to estimate relative splicing events[121].

- Eigenstrat (v1.0) was used to in order to detect potential population stratification[122].

- Peer (v1.0) was used to normalize gene expression data for the effect of unknown covariates[123].

I also used data from several databases and published datasets such as ENCODE[124], HAPMAP[31], 1000 Genomes[40], the GWAS catalog (http://www.genome.gov/gwastudies/)[125] and MUTHER[78]. The UCSC genome browser was used for data visualization and data extraction[126].

Finally, DAVID[127] was used to perform gene set enrichment analysis using KEGG pathways[128] and other gene ontology terms databases. I used DAVID for gene set enrichment analysis because it is the most widely used software for this type of analysis. However, it is likely that the results would be improved by using another software as DAVID suffers from a lack of update in the queried gene ontology terms databases.

The workflows leading to the results presented in this thesis are presented in the following figures (Figure 4.3.1 & Figure 4.3.2). The processing of the genotype data is shown in red, the processing of the phenotype data is shown in yellow and the integration of genotypic and phenotypic data is shown in orange. The validation performed in section 4.1 is shown in light blue (Figure 4.3.1). Analyses based solely on gene expression/splicing data are shown in purple/green (Figure 4.3.2). The workflow of the processing of the genotypic and gene expression data in order to perform eQTL analysis is standard in the field. However, analyses that go beyond a simple cis-eQTL analysis are more dataset specific. For example, heritability cannot be estimated without family data or differential expression analysis cannot be performed without at least two conditions. Overall, the workflow presented in figure 4.3.1 could be applied to any large-scale cross-sectional dataset while the workflow presented in figure 4.3.2 could be applied to any dataset with family structure and with gene expression measured using RNA-seq in two conditions.

Figure 4.3.1: Workflow of the analyses performed in section 4.1 with arrows indicating the directionality of the workflow. Analyses related to the processing of expression/genotype data are shown in yellow/red. Analyses related to the integration of genotype and gene expression data are shown in orange. The light blue color represents external datasets used for the replication of the detected trans-eQTLs.

Figure 4.3.2: Workflow of the analysis performed in section 4.2 with arrows indicating the directionality of the workflow. Analyses related to the processing of expression/genotype data are shown in yellow/red. Analyses related to splicing are shown in green, analyses related to gene expression are shown in purple. Analyses related to the integration of genotype and splicing/gene expression data are shown in orange.

ALSPAC individuals

Raw genotype QC & Imputation

Processed genotype Expression image data

Normalization

Quantified expression

eQTL analysis

cis-eQTLs Independent cis-eQTLs Heritability explained

Tissue specificity Trans-eQTLs

Replication

MUTHER cohort GEUVADIS cohort

trans effects of

cis-eQTLs Causal inference

Validation

Integration Phenotype

Genotype

Longitudinal Data in Twins

Raw genotype

QC & Imputation

Processed genotype

eQTLs Sharing over time Differential effect of

SNPs over time Expression Image data

QC & Mapping & Normalization

Quantified expression

Differential expression Causal inference

Downregulated/Upregulated genes change in a genome dependent/independent manner Heritability Downregulated genes

lose heritability

Heritability of the change

Correlation of gene expression over time

Correlation of the components of gene

expression Quantified

splicing events Heritability Differential

splicing over time

Altenative splicing QTLs

Integration Genotype

Phenotype

Expression Splicing

4.3.1 Computational resources used

The work presented in this thesis is the result of a total 51.65 years of CPU usage (in 2012, 2013, 2014), which would cost approximately 158’346.- in a commercial setting (Figure 4.3.1.1A). At the end of 2014, storage of projects and backups took a total of 7.2 TB of disk space which would cost 3585.- in a commercial setting (Figure 4.3.1.1B). Although CPU usage was the highest in 2012, it led to only one TB of data stored, while 2013 and 2014 both lead to an increase in storage of 3TB per year. This suggests an increase in computation efficiency over time, probably because of the increased experience and the use of more optimized and already established scripts.

This data shows that the work performed during this thesis could not have been performed without the use of a compute cluster, such as Vital-IT.

Figure 4.3.1.1: (A) CPU usage (in years) and reference commercial cost (in thousand of CHF) associated with the amount of CPU usage. (B) Total storage (in Terabyte) and reference commercial cost (in CHF) associated to the storage of the data.

Year Storage (TB) 01234567

2012 2013 2014

01000200030004000 Reference cost (CHF)

Year CPU (years) 05101520

2012 2013 2014

010203040506070 Reference cost (thousand CHF)

A B

DISCUSSION 5

The field of genetics made tremendous progress in the last 150 years. We went from believing in the inheritance of acquired traits to the identification of thousands of genomic variants affecting hundreds of diseases, complex traits, as well as thousands of molecular phenotypes[76,78,90,125]. Although, we made impressive progress, our understanding of the genetics of complex traits is still lacking. In this thesis, I aimed to improve our knowledge of the genetics of gene expression by performing a large cross-sectional eQTL study in LCLs (section 4.1)[83]. This study discovered a large number of variants affecting gene expression in cis and in trans, provided a biological explanation for several trans effects and gave insight into the genetic architecture of gene expression. Although the discovery of regulatory variants is very important, their effect on gene expression is not fixed and varies across tissues and over time[74,75,78,88,110,112,129]. In order to better understand the temporal aspects of the genetics of gene expression, I analyzed longitudinal transcriptomic data obtained from blood samples in a twin cohort (section 4.2). We observed that gene expression was moderately correlated over time, primarily because non-genetic effects were weakly correlated between the two time points. We found that the global effects of genetics on gene expression were extremely stable over time. However, some regulatory variants, preferentially located in enhancers, can have a different effect on gene expression over time. The main finding of the longitudinal study is that ageing seems to be associated with a loss of genetic control of genes involved in protein production and oxidative phosphorylation and to their downregulation over time. The downregulation of theses genes then appears to drive the upregulation of genes involved in autophagy, the spliceosome and the actin cytoskeleton.

In this discussion, I would like to address what I think is still missing in our understanding of the genetics of gene expression in order to have a comprehensive understanding of the human genome and its relationship with the environment. In addition, I would like to discuss some limitations of the results presented in this thesis and ways to address them.

Dans le document Long-range and temporal aspects of the genetics of gene expression (Page 75-79)