Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA)

(1)

Analyses biostatistiques de données RNA-seq

Nathalie Vialaneix (MIAT, INRA)

en collaboration avec Ignacio Gonzàles et Annick Moisan

nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu

Toulouse, 4/5 novembre 2020

(2)

Outline

1 Exploratory analysis Introduction

Experimental design

Data exploration and quality assessment

2 Normalization Raw data filtering Interpreting read counts

3 Differential Expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

2

(3)

A typical transcriptomic experiment

(4)

A typical transcriptomic experiment

2

(5)

Outline

Experimental design

(6)

A typical RNA-seq experiment

1 collect samples

2 generate libraries of cDNA fragments

3 affect material to one lane in one flow cell

2

(7)

Steps in RNAseq data analysis

(8)

Part I: Experimental design

2

(9)

Confounded effects: a simple example

Basic experiment: find differences between control/treated plants

control group plant treated group plant

In summary, what is a good experimental design?

Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).

⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.

(10)

Confounded effects: a simple example

A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field

Field 1 Field 2

because you can not differentiate between differences due to the field and differences due to the treatment⇒confounded effects

In summary, what is a good experimental design?

2

(11)

Confounded effects: a simple example

A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field)

Field 1 Field 2

differences due to the field and differences due to the treatment can be

In summary, what is a good experimental design?

(12)

Confounded effects: a simple example

In summary, what is a good experimental design?

2

(13)

Effect & Variation

2 conditions, 2 genes whose expression distribution is:

first gene: different median levels between the two groups butlarge variance: differences may be non significant

second gene: different median levels between the two groups butvery small variance: differences may be significant

(14)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library preparation effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library preparation effectbiological effect

2

(15)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library preparation effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library preparation effectbiological effect

(16)

Advised policy for biological/technical replicates

biological replicates: different biological samples, processed separately (advised: ≥3 per condition)

technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)

6 biological replicates are better than 3 biological replicates×2 technical replicates! See [Liu et al., 2014] for further discussions.

2

(17)

Advised policy for biological/technical replicates

biological replicates: different biological samples, processed separately (advised: ≥3 per condition)

technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)

6 biological replicates are better than 3 biological replicates×2 technical

(18)

In summary: a simple two-condition experimental design for RNA-seq

Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)

extractionRNA Library preparation

Lane 1

Lane 2

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:

Flow cell 1

C11 1

C21 2

T11 3

T21 4

Flow cell 2

C12 1

C22 2

T12 3

T22 4

Flow cell 3

C13 1

C23 2

T13 3

T23 4

Flow cell 4

C14 1

C24 2

T14 3

T24 4

Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane

extractionRNA

Library preparation and barcoding

Pooling

Lane 1

Lane 2

Barcoded adapters Barcode adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

2

(19)

In summary: a simple two-condition experimental design for RNA-seq

Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)

extractionRNA Library preparation

Lane 1

Lane 2

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:

Flow cell 1

C11 1

C21 2

T11 3

T21 4

Flow cell 2

C12 1

C22 2

T12 3

T22 4

Flow cell 3

C13 1

C23 2

T13 3

T23 4

Flow cell 4

C14 1

C24 2

T14 3

T24 4

extractionRNA

Pooling

Lane 1

Lane 2

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

(20)

In summary: a simple two-condition experimental design for RNA-seq

extractionRNA

Pooling

Lane 1

Lane 2

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

2

(21)

In summary: a simple two-condition experimental design for RNA-seq

extractionRNA

Pooling

Lane 1

Lane 2

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

(22)

Part II: Exploratory analysis

2

(23)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads)

skewed data

overdispersion (variancemean)

(24)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

2

(25)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

black line is

“variance = mean”

(26)

Dataset used in the examples

Dataset provided by courtesy of the transcriptomic platform of IPS2

Three files:

D1-counts.txtcontains the raw counts of the experiment (13 columns: the first one contains the gene names, the others

correspond to 12 different samples; gene names have been shuffled);

D1-genesLength.txtcontains information about gene lengths;

D1-targets.txtcontains information about the sample and the experimental design.

2

(27)

Dataset used in the examples

These text files are loaded with:

raw_counts <- read.table("D1 - counts . txt " , header = TRUE , row.names = 1)

raw_counts <- as.matrix( raw_counts )

design <- read.table("D1 - targets . txt " , header = TRUE , stringsAsFactors = FALSE )

gene_lengths <- scan("D1 - genesLength . txt ")

(28)

Count distribution

Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):

This distribution is highly skewed and it is better visualized using alog₂ transformationbefore it is displayed.

library size

Thelibrary sizeis the sum of all counts in a given sample.

2

(29)

Count distribution

Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):

This distribution is highly skewed and it is better visualized using alog₂ transformationbefore it is displayed.

library size

(30)

Count distribution between samples

Thecount distributionbetween different samples can be compared with parallel boxplotsorviolin plots:

It is expected that, within a given condition (group), the count distributions are similar. The same is often also expected between conditions.

2

(31)

Check reproducibility between samples

MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):

M-values: log of ratio between counts between two samples:

M_g = log₂(K_gj)−log₂(K_gj⁰)

A-values: average log counts between two samples:

A_g = log₂(K_gj) + log₂(K_gj⁰) 2

whereK_gj stands for the counts for genegin samplej.

(32)

Check reproducibility between samples

MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):

2

(33)

Check similarity between samples

Similarities between samplescan be visualized with aHACand a heatmap:

perform hierarchical ascending classification (HAC) using Euclidean distance between samples: δ(j,j⁰) =P

g

log₂(K_gj)−log₂(K_gj⁰)2

visualize the strength of the similarity with heatmap.

(34)

Search for the main structures in the data: PCA

PCA(on log₂counts) can be used to project data into a small dimensional space and search for unexpected experimental effects in the data.

(MDS is equivalent to PCA when used with the standard Euclidean distance)

Remark: InDESeq, the functionplotPCAperforms PCA on the top genes with the highest variance.

2

(35)

Outline

Experimental design

(36)

Raw data filtering

Filteringconsists in removing genes with low expression. Different strategies can be used:

[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;

[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;

[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.

WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).

More sophisticated filtering

To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.

2

(37)

Raw data filtering

Filteringconsists in removing genes with low expression. Different strategies can be used:

[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;

[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;

[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.

WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).

More sophisticated filtering

To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.

(38)

Part III: Normalization

2

(39)

Purpose of normalization

identify and correct technical biases (due to sequencing process) to make counts comparable

types of normalization: within sample normalization and between sample normalization

(40)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

2

(41)

Within sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene B is expressed with a number of transcripts twice larger than gene A

gene A gene B

(42)

Within sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

both genes are expressed with the same number of transcripts but gene B is twice longer than gene A

gene A gene B

2

(43)

Within sample normalization

(44)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

2

(45)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene A is more expressed in sample 3 than in sample 2

gene A in sample 2 gene A in sample 3

(46)

Between sample normalization

gene A 752 615 1203

gene B 1507 1225 2455

gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)

gene A in sample 2 gene A in sample 3

2

(47)

Between sample normalization

(48)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

IfK_gjis the raw count for genegin samplej then, the normalized counts is defined as:

Ke_gj = ^K^gj

s_j×D_j ×10⁶ in which: D_j =P

gK_gjis the library size of samplej,s_jis the correction factor of the library size for samplej and thusC_j = _s¹⁰⁶

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

2

(49)

Principles for sequencing depth normalization

Basics

In practice: Raw counts correspond to different sequencing depths

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

Gene G 15 25 9 5 20 14 17

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj

(50)

Principles for sequencing depth normalization

Basics

In practice: A correction multiplicative factor is computed for every sample

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

Gene G 15 25 9 5 20 14 17

1.1 1.6 0.6 0.7 1.4 0.7 0.8

C_j

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj

2

(51)

Principles for sequencing depth normalization

Basics

In practice: Every counts is multiplied by the correction factor corresponding to its sample

Gene 3 92 161 76 70 140 88 70

1.1 1.6 0.6 0.7 1.4 0.7 0.8

x

Gene 3 101.2 257.6 45.6 49 196 61.6 56

Cj

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj

(52)

Principles for sequencing depth normalization

Basics

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 10⁵

Definition

Ke_gj = ^K^gj

2

(53)

Principles for sequencing depth normalization

Definition

Ke_gj = ^K^gj

gK_gjis the library size of samplej,s_jis the correction factor of the library size for samplej and thusC_j= _s¹⁰⁶

jDj.

Three types of methods: distribution adjustment

(54)

Principles for sequencing depth normalization

Definition

Ke_gj = ^K^gj

gK_gjis the library size of samplej,s_jis the correction factor of the library size for samplej and thusC_j= _s¹⁰⁶

2

(55)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

s_j=1 and thus:Ke_gj= ^K^gj D_j ×10⁶ (Counts Per Million).

raw counts normalized counts

0 5 10 15

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

cpm (... ,

normalized . lib . sizes = TRUE )

(56)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

(Upper) Quartile normalization[Bullard et al., 2010]

s_j=

Q_j^(p)

1 N

P_N

l=1Q_l^(p)

in whichN is the number of samples andQ_j^(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

edgeR:

calcNormFactors (... , method = " upperquartile " , p = 0.75)

2

(57)

Method using gene lengths (intra & inter sample normalization)

RPKM: Reads Per Kilobase per Million mapped Reads

Assumptions: read counts are proportional to expression level, transcript length and sequencing depth

s_j = ^D^j^L^g 10³×10⁶ in whichL_g is gene length (bp).

edgeR:

rpkm (... , gene .length = ...)

Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009].

(58)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample: geometric mean across samples

R_g =





 YN

j=1

K_gj







1/N

(geometric mean is less sensitive to extreme values than standard mean)

0 5 10 15

0 250 500 750 1000

2 center samples compared to the reference

3 compute normalization factor: median of centered counts over the genes

2

(59)

Relative Log Expression (RLE)

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference K˜_gj= ^K^gj

R_g with R_g =







N

Y

j=1

K_gj







1/N

1 2 3 4

0 250 500 750 1000

rank(mean) gene expression

count / geo. mean ^Samples

Sample 1 Sample 2

(60)

Relative Log Expression (RLE)

Method:

˜

s_j=^median_g n K˜_gjo

factors multiply to 1: s_j = ˜s_j exp₁

N

PN

l=1log(˜s_l)

1 2 3 4

0 250 500 750 1000

rank(mean) gene expression

count / geo. mean ^Samples

Sample 1 Sample 2

with

K˜_gj = ^K^gj R_g and

R_g =







N

Y

j=1

K_gj







1/N

2

(61)

Relative Log Expression (RLE)

Method:

0 5 10 15

0 250 500 750 1000

## with edgeR

calcNormFactors (... , method =" RLE ")

## with DESeq

estimateSizeFactors (...)

(62)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

w_g(j,r) withw_g(j,r) =

Dj−Kgj

DjKgj + ^D_D^r^−K^gr

rKgr

.

2

(63)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

⇒remove extreme data for fold-changed (M) and average intensity (A) M_g(j,r) = log₂ ^K^gj

D_j

!

−log₂ ^K^gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ ^K^gj D_j

!

+ log₂ ^K^gr D_r

!#

select as a reference sample, the sampler with the upper quartile closest to the average upper quartile

M- vs A-values −3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

None

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(64)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

D_j

!

−log₂ ^K^gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ ^K^gj D_j

!

+ log₂ ^K^gr D_r

!#

Trim 30% on M-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

None 30% of Mg

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

2

(65)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

D_j

!

−log₂ ^K^gr D_r

!

A_g(j,r) = ¹ 2

"

log₂ ^K^gj D_j

!

+ log₂ ^K^gr D_r

!#

Trim 5% on A-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

(66)

Trimmed Mean of M-values (TMM)

Assumptions behind the method

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

TMM(j,r) = P

g:not trimmed

w_g(j,r)M_g(j,r) P

g:not trimmed

Dj−Kgj

rKgr

.

2