Quantitative omics analysis

Frequently, RNA-seq data are used for differential expression analysis, in which the expression level of genes is compared between different conditions or samples. The mapped reads listed in BAM files have to be converted into quantitative estimates per gene, or even per transcript, in order to perform DE. Whereas probabilistic RNA-seq aligners such as Kallisto directly estimate transcript abundance, and some aligners (such as STAR) have a built-in functionality, in most other cases tools such as HTSeq-count are required to obtain HTSeq-count tables per gene (Figure 12).^158,159 It uses BAM or SAM files as input together with a Gene Transfer Format (GTF) file, which contains

4 Analysis of sequencing data 32 gene and exon annotation, and can be obtained from, for example Ensembl, but can also be in-house created.¹⁶⁰ HTSeq-count will hence determine the number of reads overlapping the position of a specific gene (region) or transcript. As different samples are featured by different sequencing depths, the counts have to be rescaled prior to statistical analysis, called normalisation, to allow comparison between samples (Figure 12). Note that normalisation is the generic name for an omics strategy to remove putative non-biological, technical variation between samples, such as library size in sequencing experiments. Often library sizes are calculated as the total number of mapped reads. To obtain normalised gene counts, the raw gene counts per sample are essentially divided by their corresponding library size, divided by the average one over all samples. However, more elaborate methods, such as trimmed mean of M values (TMM), are often better suited for data normalisation.¹⁶¹ Different statistical methods for DE analysis therefore typically have their own way of normalisation and the DE models include obtained normalisation factors rather than using pre-normalised count data as input.

For statistical inference, sequencing count data were originally modelled with a Poisson distribution, but the latter only captures variance due to technical “sampling”

effects, largely ignoring the excess variance caused by biological variability. Hence, new methods using a negative binomial distribution were developed, such as DESeq and edgeR (Figure 12).^162,163 These methods have the additional advantage that they use moderation to estimate the parameters for this distribution. That is to estimate the variance for a locus they take into account that the variance typically depends on the mean expression, a trend which can be estimated empirically by borrowing information over all genes. Another method, called Limma, was originally created for DE analysis of micro-array data, which roughly follow a normal distribution. Limma-voom was then developed to allow application of normal-based statistics on sequencing count data by logarithmic transformation of the counts, in addition to also modelling a mean-variance trend for moderated variance estimation.¹⁶⁴ Afterwards, parametric or non-parametric statistics can be used to test for DE between genes.

Next to DE, also differential methylation is an interesting topic to study. As described in section 3.5, sequencing of DNA methylation is either based on bisulphite conversion or enrichment. In the former, methylation percentages can be calculated and compared. For bisulphite sequencing, the number of methylated CpGs can be determined by calculating the ratio of Cs over the sum of Cs and Ts, the latter corresponding to an unmethylated C.¹⁶⁵ For the Inifinium HumanMethylation BeadChips, β-values are similarly defined as the ratio of the methylated probe intensity on the intensity of methylated and unmethylated probes. M-values, on the other hand, are the binary logarithmic ratio of methylated probe intensity on unmethylated probe intensity.¹⁶⁶ After optional normalisation of M-values or similar values for bisulphite sequencing, methods such as Limma can be applied for

detection of differential methylation.¹²⁹ As different probes are often correlated, also methods that combine methylation levels of neighbouring CpGs were developed, such as BISEQ.¹⁶⁷ For enrichment-based sequencing, be it to study enrichment of methylated CpGs, histone modifications or similar features, initial peak calling is common as data cannot be simply summarised per gene or transcript as done for RNA-seq data analysis. MACS is a peak calling algorithm that searches for positions in the genome enriched for mapped reads, taking into account several biases such as mapping bias and CNV.¹⁶⁸ By creating GTF files based on these identified peaks, strategies from RNA-seq DE analysis, such as HTSeq-count-based data summary followed by edgeR and DEseq, can also be used to study differential enrichment.¹²⁹

Figure 12 Overview of the steps in a differential expression study.

After alignment and quality control, counts per gene or transcript are determined. Data is then normalised and differential expression is calculated. Adapted from ¹⁵³

5 Methods to detect allele-specific expression 34

5  Methods to detect allele-specific expression

The development of sequencing technologies, especially RNA sequencing, made it possible to study specific alleles at single base resolution. By analysing polymorphic sites in the genome, the distinct effect of each allele can be evaluated in the transcriptome. In contrast to mice experiments, where high levels of heterozygosity are achieved in single individuals by crossing different strains, the basic rationale of most human ASE studies is to identify allelic imbalance in RNA-seq data at heterozygous sites, which are identified by prior DNA genotyping. As ASE results in higher expression of one allele, the allelic ratio of RNA-seq data at a heterozygous SNP will be skewed towards higher counts of that allele.⁵⁰ Many methods based on this concept have already been developed and are primarily used for detection of cis-eQTLs.^169–171 However, most methods require DNA data for the identification of heterozygous loci, which increases cost of the experiments. As two types of data are necessary, mostly small-scale studies were performed so far, and statistical ASE detection methodologies focus on individual samples. Only a few methods allow ASE detection in larger populations or allow meta-analysis across individuals.^172,173 In early methods, allelic counts of the reference and variant allele in heterozygous individuals were compared with a binomial test.^174–176 However, at polymorphic sites, reads containing the reference allele will (in general) more easily map to the genome than the variant allele and RNA-seq data may thus entail a mapping bias.

Furthermore, also other technical artefacts, such as nucleotide composition, increase technical variance in RNA-seq data. To address overdispersion present in RNA-seq data, more elaborate statistical methods were developed. Allelic read counts are often modelled with a beta-binomial distribution upon which a binomial test or likelihood ratio test can test for allele-specific expression.173,176–180 The beta-binomial distribution and modified binomial statistic improve technical biases, yet do not entirely eliminate mapping bias. The latter particularly prevents correct cis-eQTL detection as both mapping bias and cis-eQTL result in higher expression of a specific

allele. Many different techniques try to overcome the bias, such as remapping of the data or mapping to a polymorph-aware reference genome.^173,181

Imprinting is an extreme case of ASE, where one allele is (almost) completely silenced and only a single allele is expressed. However, partial imprinting is sometimes present, leaving the imprinted allele slightly expressed. In mice, imprinted genes are usually identified using genome and transcriptome data from reciprocally crossed subjects (Figure 13).^182–185 In humans, trio data to infer parental genotypes are generally used. By validating the parental origin of the single expressed allele at heterozygous sites, many human imprinted genes were identified.^186–188 As trio data are not always available (especially when studying tissue-specific imprinting), other methods that combine genotyping and expression data were developed. Without parental data, imprinting is, however, difficult to distinguish from other ASE effects.

Baran et al. classified the degree of allelic imbalance at heterozygous sites and only those with the highest degree were deemed imprinted.¹⁸⁹ Here, again, genotyping data are necessary to only retain heterozygous sites. Another study employed a Bayesian model with allele frequencies derived from dbSNP to detect imprinting in mRNA-seq data from multiple samples.¹⁹⁰

As imprinting is epigenetically regulated, studying epigenetic marks is an alternative approach for the detection of imprinted genes.¹⁹¹ Monoallelic expression is regulated by allele-specific methylation and detection of parent-of-origin-specific differential methylation between alleles in, for example, bisulphite sequencing data is thus

Figure 13 Detection of imprinting in mice using reciprocal crosses. Two inbred parental strains, strain 1 and 2, produce genetically identical, but phenotypically different offspring. As the parental genotypes are known, the distinct offspring’s phenotypes show that coat colour is maternally expressed.¹⁷⁹

5 Methods to detect allele-specific expression 36 representative for imprinting. This approach revealed known as well as novel candidate imprinted genes.^192–194 Also chromatin modifications have a key role in healthy imprinting regulation and analysis of various chromatin marks was already performed to identify genome-wide imprinting.^191,195

In 2014, a method to study genome-wide monoallelic methylation in MBD-seq data was developed at the BIOBIX lab.¹⁹⁶ By screening for deviation from biallelic methylation (indicated by heterozygous samples) in large datasets instead of single samples, monoallelically methylated sites (apparently homozygous) were detected.

The method is based on a population genetics theorem, the Hardy-Weinberg Equilibrium (HWE). The theory states that in a random mating population genotype and allele frequencies remain constant and can be estimated. For monoallelically methylated sites less heterozygous samples will occur than expected by HWE, as only one allele is methylated and hence only homozygous samples will be observed in MBD-seq data. Firstly, the genotypes were estimated from the MBD-seq data in an iterative way to improve allele and genotype frequencies in every iteration.

Afterwards, the allele frequencies were used to estimate the expected number of heterozygotes by HWE. The number of heterozygotes could then be compared to the expected number, and when significantly less heterozygous samples were observed the locus was called monoallelically methylated. Contrary to previous methylation studies, no expensive bisulphite sequencing was necessary. Yet studying methylation differences only offers a limited view of imprinting as it is not the sole epigenetic regulator. Furthermore, a difficult dataset with mixed tissues was studied and only monoallelic methylation, representing imprinting, was here taken into account. To date, no genome-wide methods solely using RNA-seq data exist that can distinguish between multiple types of ASE. Moreover, most methods focus on deviation from biallelic expression in individual or smaller sets of samples, which may result from technical artefacts, and ignore how population-level information can improve detection. Furthermore, no large-scale experiments have studied ASE aberrations in diseased tissue, as most studies focus on single loci instead on genome-wide deregulation of ASE.¹⁹⁷ A comprehensive overview of imprinted genes as well as other ASE genes in distinct tissues and their deregulation in disease is therefore still lacking.

|

DETECTION OF

ALLELE-SPECIFIC EXPRESSION

Adapted from:

Tine Goovaerts, Sandra Steyaert, Chari Vandenbussche, Jeroen Galle, Oliver Thas, Wim Van Criekinge, Tim De Meyer.

A comprehensive overview of genomic imprinting in breast and its deregulation in cancer. Nature Communications | (2018) 9:4120 | DOI: 10.1038/s41467-018-06566-7

*These authors contributed equally

6  A comprehensive overview of

imprinting in breast tissue and its

deregulation in breast cancer

Dans le document Exploring allele-specific expression mechanisms in health and disease (Page 51-59)