Normalization and differential analysis of RNA-seq data

(1)

Normalization and differential analysis of RNA-seq data

Nathalie Villa-Vialaneix INRA, Toulouse, MIAT

(Mathématiques et Informatique Appliquées de Toulouse)

[email protected] http://www.nathalievilla.org

SPS Summer School 2016

From gene expression to genomic network

(2)

Outline

1 Normalization

2 Differential expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data

(3)

A typical transcriptomic experiment

(4)

A typical transcriptomic experiment

(5)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads)

skewed data

overdispersion (variancemean)

(6)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

(7)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

black line is

“variance = mean”

(8)

Steps in RNAseq data analysis

(9)

Part I: Normalization

(10)

Purpose of normalization

identify and correct technical biases (due to sequencing process) to make counts comparable

types of normalization: within sample normalization and between sample normalization

(11)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library effectbiological effect

(12)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

(13)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(14)

Within sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene B is expressed with a number of transcripts twice larger than gene A

gene A gene B

(15)

Within sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

both genes are expressed with the same number of transcripts but gene B is twice longer than gene A

gene A gene B

(16)

Within sample normalization

(17)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(18)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene A is more expressed in sample 3 than in sample 2

gene A in sample 2 gene A in sample 3

(19)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)

gene A in sample 2 gene A in sample 3

(20)

Between sample normalization

(21)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

IfK_gjis the raw count for genegin samplej then, the normalized counts is defined as:

Ke_gj = ^K^gj s_j

in whichs_j=C_j⁻¹ is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(22)

Principles for sequencing depth normalization

Basics

In practice: Raw counts correspond to different sequencing depths

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

Gene G 15 25 9 5 20 14 17

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj s_j

(23)

Principles for sequencing depth normalization

Basics

In practice: A correction multiplicative factor is calculated for every sample

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

Gene G 15 25 9 5 20 14 17

1.1 1.6 0.6 0.7 1.4 0.7 0.8

C_j

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj s_j

(24)

Principles for sequencing depth normalization

Basics

In practice: Every counts is multiplied by the correction factor corresponding to its sample

Gene 3 92 161 76 70 140 88 70

1.1 1.6 0.6 0.7 1.4 0.7 0.8

x

Gene 3 101.2 257.6 45.6 49 196 61.6 56

Cj

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj s_j

(25)

Principles for sequencing depth normalization

Basics

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj s_j

(26)

Principles for sequencing depth normalization

Definition

Ke_gj = ^K^gj s_j

in whichs_j=C_j⁻¹ is the scaling factor for samplej.

Three types of methods: distribution adjustment

(27)

Principles for sequencing depth normalization

Definition

Ke_gj = ^K^gj s_j

in whichs_j=C_j⁻¹ is the scaling factor for samplej.

Three types of methods:

(28)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

s_j= ^D^j

1 N

PN l=1D_l in whichN is the number of samples andD_j=P

gK_gj.

raw counts normalized counts

0 5 10 15

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

cpm (... ,

normalized . lib . sizes = TRUE )

(29)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

(Upper) Quartile normalization[Bullard et al., 2010]

s_j=

Q_j^(p)

1P_N

l=1Q_l^(p)

in whichQ_j^(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

edgeR:

calcNormFactors (... , method = " upperquartile " , p = 0.75)

(30)

Method using gene lengths (intra & inter sample normalization)

RPKM: Reads Per Kilobase per Million mapped Reads

Assumptions: read counts are proportional to expression level, transcript length and sequencing depth

s_j = ^D^j^L^g 10³×10⁶ in whichL_g is gene length (bp).

edgeR:

rpkm (... , gene .length = ...)

(31)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample: geometric mean across samples

R_g =





 YN

j=1

K_gj







1/N

(geometric mean is less sensitive to extreme values than standard mean)

0 5 10 15

0 250 500 750 1000

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

(32)

Relative Log Expression (RLE)

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference K˜_gj= ^K^gj

R_g with R_g =







N

Y

j=1

K_gj







1/N

1 2 3 4

0 250 500 750 1000

count / geo. mean ^Samples

Sample 1 Sample 2

(33)

Relative Log Expression (RLE)

Method:

˜

s_j=^median_g n K˜_gjo

factors multiply to 1: s_j = ˜s_j exp₁

N

PN

l=1log(˜s_l)

1 2 3 4

0 250 500 750 1000

count / geo. mean ^Samples

Sample 1 Sample 2

with

K˜_gj = ^K^gj R_g and

R_g =







N

Y

j=1

K_gj







1/N

(34)

Relative Log Expression (RLE)

Method:

0 5 10 15

0 250 500 750 1000

## with edgeR

calcNormFactors (... , method =" RLE ")

## with DESeq

estimateSizeFactors (...)

(35)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

w_g(j,r) withw_g(j,r) =

Dj−Kgj

DjKgj + ^D_D^r^−K^gr

rKgr

.

(36)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

⇒remove extreme data for fold-changed (M) and average intensity (A) M_g(j,r) =log₂ K_gj

D_j

!

−log₂ K_gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ K_gj D_j

!

+log₂ K_gr D_r

!#

select as a reference sample, the sampler with the upper quartile closest to the average upper quartile

−1 0 1 2

M(j,r)

None

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(37)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

D_j

!

−log₂ K_gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ K_gj D_j

!

+log₂ K_gr D_r

!#

Trim 30% on M-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

None 30% of Mg

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(38)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

D_j

!

−log₂ K_gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ K_gj D_j

!

+log₂ K_gr D_r

!#

Trim 5% on A-values −1

0 1 2

M(j,r)

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(39)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(40)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

most genes are not differentially expressed Correction factors:

˜

s_j=2^TMM(j,r) factors multiply to 1: s_j= ˜s_j exp₁

N

PN

l=1log(˜s_l) calcNormFactors (... , method =" TMM ")

(41)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Purpose of the comparison:

finding the “best” method for all cases is not a realistic purpose

find an approach which isrobust enoughto provide relevant results in all cases

Method: comparison based on several criteria to select a method which is valid for multiple objectives

(42)

Comparison of the different approaches

Effect on count distribution:

(43)

Comparison of the different approaches

Effect on differential analysis(DESeq v. 1.6):

Inflated FPR for all methods except for TMM and DESeq (RLE).

(44)

Comparison of the different approaches

Conclusion: Differences appear based on data characteristics

TMM and DESeq (RLE) are performant in a differential analysis context.

(45)

Practical session

import and understand data;

run different types of normalization;

compare the results...

(46)

Part II: Differential expression analysis

(47)

Different steps in hypothesis testing

1 formulate anhypothesis H₀:

H₀: the average count for genegin the control samples is the same that the average count in the treated samples

which is tested against an alternative H₁: the average count for gene gin the control samples is different from the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H₀

4 deduce the probability that the observations occur under H₀: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H₀is unlikely: we say that “H₀ is rejected”. We have that:α=PH0(H₀is rejected).

(48)

Different steps in hypothesis testing

(49)

Different steps in hypothesis testing

(50)

Different steps in hypothesis testing

(51)

Different steps in hypothesis testing

5 conclude: if the p-value is low (usually belowα=5% as a convention), H₀is unlikely: we say that “H₀ is rejected”.

We have that:α=PH0(H₀is rejected).

(52)

Summary of the possible decisions

(53)

Types of errors in tests

Reality

H₀ is true H₀ is false

Decision

Do not reject H₀ Correct decision Type II error ,(True Negative) /(False Negative) Reject H₀ Type I error Correct decision

/(False Positive) ,(True Positive)

P(Type I error) =α(risk) P(Type II error) =1−β(β: power)

(54)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H₀is always true) =1−(1−α)^G Ex: forα=5% andG=20,

P(at least one FP if H₀is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H₀is true for allGtests

0.25 0.50 0.75 1.00

0 25 50 75 100

Number of tests

Probability

For more than 75 tests and if H₀ is always true, the probability to have at least one false positive is very close to 100%!

(55)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H₀is always true) =1−(1−α)^G Ex: forα=5% andG=20,

P(at least one FP if H₀is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H₀is true for allGtests

0.25 0.50 0.75 1.00

Probability

For more than 75 tests and if H₀ is always true, the probability to have at least one false positive is very close to 100%!

(56)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G₀−U G₁−V G−R

Total G₀ G₁ G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q=











U/R ifR >0 0 otherwise

(57)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G₀−U G₁−V G−R

Total G₀ G₁ G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q =











U/R ifR >0 0 otherwise

(58)

Adjusted p-values

Settings: p-valuesp₁, . . . ,p_G (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p₁, . . . ,˜p_G such that

Rejecting tests such that p˜_g < α ⇐⇒ P(U >0)≤α or E(Q)≤α

Calculating p-values

1 order the p-valuesp₍₁₎≤p₍₂₎≤. . .≤p_(G)

2 calculate˜p_(g)=a_gp_(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜_(g)are larger than 1, correct˜p_(g)←min{˜p_(g),1}

(59)

Adjusted p-values

Rejecting tests such that p˜_g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

(60)

Adjusted p-values

(61)

Adjusted p-values

3 if adjusted p-values˜p_(g)are larger than 1, correct˜p_(g)←min{˜p_(g),1}

(62)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg n_gA n_gB n_g

other genes N_A−n_gA N_B−n_gB N−n_g

Total N_A N_B N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables; the same marginals (“Total”).

(63)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg n_gA n_gB n_g

other genes N_A−n_gA N_B−n_gB N−n_g

Total N_A N_B N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables;

the same marginals (“Total”).

(64)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(65)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(66)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,K_g1¹ , ...,K_gn¹

1(condition 1) andK_g1² , ...,K_gn²

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

K_gj^k ∼NB(s_j^kλ_gk, φ_g) in which:

I s_j^kis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples)

estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λ_g1 =λ_g2}

(67)

Basic principles of tests for count data: 2 conditions and replicates

2

(condition 2)

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

(68)

Basic principles of tests for count data: 2 conditions and replicates

2

(condition 2)

λg1 λg2 φg

(69)

Basic principles of tests for count data: 2 conditions and replicates

2

(condition 2)

λg1 λg2 φg

concludeby calculating p-value⇒Test H0:{λ_g1 =λ_g2}

(70)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

K_g1¹ +. . .+K_gn¹₁∼NB(sλ_g1, φ_g/n₁)(and similarly for the second condition)

1 λ_g1 andλ_g2 are estimated (mean of the distributions)

2 φ_g is estimated independently ofλ_g1andλ_g2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(71)

First method: Exact Negative Binomial test

(72)

First method: Exact Negative Binomial test

(73)

First method: Exact Negative Binomial test

(74)

Estimating the dispersion parameter φ

_g

Some methods:

DESeq,DESeq2: φ_g is a smooth function ofλ_g =λ_g1 =λ_g2

dge <- estimateDispersion ( dge )

5e−01 5e+00 5e+01 5e+02

0.010.050.505.00

mean of normalized counts

dispersion

gene−est fitted final

edgeR: estimate a common dispersion parameter for all genes and use it as a prior in a Bayesian approach to estimate a gene specific dispersion parameter

dge <- estimateCommonDisp ( dge ) dge <- estimateTagwiseDisp ( dge )

(75)

Perform the test

Some methods:

DESeq,DESeq2: exact (DESeq) or approximate (Wald and LR in DESeq2) tests

res <- nbinomWaldTest ( dge ) results ( res )

res <- nbinomLR ( dge ) results ( res )

edgeR: exact tests

res <- exactTest ( dge ) topTags ( res )

(comparison between methods in [Zhang et al., 2014])

(76)

More complex experiments: GLM

Framework:

K_gj ∼NB(µ_gj, φ_g) with log(µ_gj) =log(s_j) +log(λ_gj) in which:

s_jis the library size for samplej;

log(λ_gj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λ_gj) =λ₀+x^>_j β_g in whichx_iis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

their interactions

(77)

More complex experiments: GLM

Framework:

their interactions

(78)

More complex experiments: GLM

Framework:

(79)

More complex experiments: GLM in practice

edgeR

dge <- estimateDisp (dge , design ) fit <- glmFit (dge , design )

res <- glmRT (fit , ...) topTags ( res )

DESeq,DESeq2

dge <- newCountDataSet ( counts , design ) dge <- estimateSizeFactors ( dge )

dge <- estimateDispersions ( dge )

fit <- fitNbinomGLMs (dge , count ~ ...) fit0 <- fitNbinomGLMs (dge , count ~ 1) res <- nbinomGLMTest (fit , fit0 )

p. adjust (res , method = " BH ")