Analyses biostatistiques de données RNA-seq
Nathalie Vialaneix (MIAT, INRA)
en collaboration avec Ignacio Gonzàles et Annick Moisan
nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu
Toulouse, 4/5 novembre 2020
Outline
1 Exploratory analysis Introduction
Experimental design
Data exploration and quality assessment
2 Normalization Raw data filtering Interpreting read counts
3 Differential Expression analysis
Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis
2
A typical transcriptomic experiment
A typical transcriptomic experiment
2
Outline
1 Exploratory analysis Introduction
Experimental design
Data exploration and quality assessment
2 Normalization Raw data filtering Interpreting read counts
3 Differential Expression analysis
Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis
A typical RNA-seq experiment
1 collect samples
2 generate libraries of cDNA fragments
3 affect material to one lane in one flow cell
2
Steps in RNAseq data analysis
Part I: Experimental design
2
Confounded effects: a simple example
Basic experiment: find differences between control/treated plants
control group plant treated group plant
In summary, what is a good experimental design?
Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).
⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.
Confounded effects: a simple example
Basic experiment: find differences between control/treated plants
control group plant treated group plant
A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field
Field 1 Field 2
because you can not differentiate between differences due to the field and differences due to the treatment⇒confounded effects
In summary, what is a good experimental design?
Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).
⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.
2
Confounded effects: a simple example
Basic experiment: find differences between control/treated plants
control group plant treated group plant
A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field)
Field 1 Field 2
differences due to the field and differences due to the treatment can be
In summary, what is a good experimental design?
Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).
⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.
Confounded effects: a simple example
Basic experiment: find differences between control/treated plants
control group plant treated group plant
In summary, what is a good experimental design?
Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).
⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.
2
Effect & Variation
2 conditions, 2 genes whose expression distribution is:
first gene: different median levels between the two groups butlarge variance: differences may be non significant
second gene: different median levels between the two groups butvery small variance: differences may be significant
Source of variation in RNA-seq experiments
1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)
2 at the middle layer:technical variations(library preparation effect)
3 at the bottom layer:technical variations(lane and cell flow effects)
lane effect<cell flow effect<library preparation effectbiological effect
2
Source of variation in RNA-seq experiments
1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)
2 at the middle layer:technical variations(library preparation effect)
3 at the bottom layer:technical variations(lane and cell flow effects)
lane effect<cell flow effect<library preparation effectbiological effect
Advised policy for biological/technical replicates
biological replicates: different biological samples, processed separately (advised: ≥3 per condition)
technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)
6 biological replicates are better than 3 biological replicates×2 technical replicates! See [Liu et al., 2014] for further discussions.
2
Advised policy for biological/technical replicates
biological replicates: different biological samples, processed separately (advised: ≥3 per condition)
technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)
6 biological replicates are better than 3 biological replicates×2 technical
In summary: a simple two-condition experimental design for RNA-seq
Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)
extractionRNA Library preparation
Lane 1
Lane 2
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:
Flow cell 1
C11 1
C21 2
T11 3
T21 4
Flow cell 2
C12 1
C22 2
T12 3
T22 4
Flow cell 3
C13 1
C23 2
T13 3
T23 4
Flow cell 4
C14 1
C24 2
T14 3
T24 4
Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane
extractionRNA
Library preparation and barcoding
Pooling
Lane 1
Lane 2
Barcoded adapters Barcode adapters
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:
Flow cell 1
C111 1
C211 T111 T211
C121 2
C221 T121 T221
C131 3
C231 T131 T231
C141 4
C241 T141 T241
Flow cell 2
C112 1
C212 T112 T212
C122 2
C222 T122 T222
C132 3
C232 T132 T232
C142 4
C242 T142 T242
Flow cell 3
C113 1
C213 T113 T213
C123 2
C223 T123 T223
C133 3
C233 T133 T233
C143 4
C243 T143 T243
Flow cell 4
C114 1
C214 T114 T214
C124 2
C224 T124 T224
C134 3
C234 T134 T234
C144 4
C244 T144 T244
2
In summary: a simple two-condition experimental design for RNA-seq
Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)
extractionRNA Library preparation
Lane 1
Lane 2
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:
Flow cell 1
C11 1
C21 2
T11 3
T21 4
Flow cell 2
C12 1
C22 2
T12 3
T22 4
Flow cell 3
C13 1
C23 2
T13 3
T23 4
Flow cell 4
C14 1
C24 2
T14 3
T24 4
Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane
extractionRNA
Library preparation and barcoding
Pooling
Lane 1
Lane 2
Barcoded adapters Barcode adapters
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:
Flow cell 1
C111 1
C211 T111 T211
C121 2
C221 T121 T221
C131 3
C231 T131 T231
C141 4
C241 T141 T241
Flow cell 2
C112 1
C212 T112 T212
C122 2
C222 T122 T222
C132 3
C232 T132 T232
C142 4
C242 T142 T242
Flow cell 3
C113 1
C213 T113 T213
C123 2
C223 T123 T223
C133 3
C233 T133 T233
C143 4
C243 T143 T243
Flow cell 4
C114 1
C214 T114 T214
C124 2
C224 T124 T224
C134 3
C234 T134 T234
C144 4
C244 T144 T244
In summary: a simple two-condition experimental design for RNA-seq
Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane
extractionRNA
Library preparation and barcoding
Pooling
Lane 1
Lane 2
Barcoded adapters Barcode adapters
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:
Flow cell 1
C111 1
C211 T111 T211
C121 2
C221 T121 T221
C131 3
C231 T131 T231
C141 4
C241 T141 T241
Flow cell 2
C112 1
C212 T112 T212
C122 2
C222 T122 T222
C132 3
C232 T132 T232
C142 4
C242 T142 T242
Flow cell 3
C113 1
C213 T113 T213
C123 2
C223 T123 T223
C133 3
C233 T133 T233
C143 4
C243 T143 T243
Flow cell 4
C114 1
C214 T114 T214
C124 2
C224 T124 T224
C134 3
C234 T134 T234
C144 4
C244 T144 T244
2
In summary: a simple two-condition experimental design for RNA-seq
Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane
extractionRNA
Library preparation and barcoding
Pooling
Lane 1
Lane 2
Barcoded adapters Barcode adapters
Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:
Flow cell 1
C111 1
C211 T111 T211
C121 2
C221 T121 T221
C131 3
C231 T131 T231
C141 4
C241 T141 T241
Flow cell 2
C112 1
C212 T112 T212
C122 2
C222 T122 T222
C132 3
C232 T132 T232
C142 4
C242 T142 T242
Flow cell 3
C113 1
C213 T113 T213
C123 2
C223 T123 T223
C133 3
C233 T133 T233
C143 4
C243 T143 T243
Flow cell 4
C114 1
C214 T114 T214
C124 2
C224 T124 T224
C134 3
C234 T134 T234
C144 4
C244 T144 T244
Part II: Exploratory analysis
2
Some features of RNAseq data
What must be taken into account?
discrete, non-negative data (total number of aligned reads)
skewed data
overdispersion (variancemean)
Some features of RNAseq data
What must be taken into account?
discrete, non-negative data (total number of aligned reads) skewed data
overdispersion (variancemean)
2
Some features of RNAseq data
What must be taken into account?
discrete, non-negative data (total number of aligned reads) skewed data
overdispersion (variancemean)
black line is
“variance = mean”
Dataset used in the examples
Dataset provided by courtesy of the transcriptomic platform of IPS2
Three files:D1-counts.txtcontains the raw counts of the experiment (13 columns: the first one contains the gene names, the others
correspond to 12 different samples; gene names have been shuffled);
D1-genesLength.txtcontains information about gene lengths;
D1-targets.txtcontains information about the sample and the experimental design.
2
Dataset used in the examples
These text files are loaded with:
raw_counts <- read.table("D1 - counts . txt " , header = TRUE , row.names = 1)
raw_counts <- as.matrix( raw_counts )
design <- read.table("D1 - targets . txt " , header = TRUE , stringsAsFactors = FALSE )
gene_lengths <- scan("D1 - genesLength . txt ")
Count distribution
Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):
This distribution is highly skewed and it is better visualized using alog2 transformationbefore it is displayed.
library size
Thelibrary sizeis the sum of all counts in a given sample.
2
Count distribution
Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):
This distribution is highly skewed and it is better visualized using alog2 transformationbefore it is displayed.
library size
Count distribution between samples
Thecount distributionbetween different samples can be compared with parallel boxplotsorviolin plots:
It is expected that, within a given condition (group), the count distributions are similar. The same is often also expected between conditions.
2
Check reproducibility between samples
MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):
M-values: log of ratio between counts between two samples:
Mg = log2(Kgj)−log2(Kgj0)
A-values: average log counts between two samples:
Ag = log2(Kgj) + log2(Kgj0) 2
whereKgj stands for the counts for genegin samplej.
Check reproducibility between samples
MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):
2
Check similarity between samples
Similarities between samplescan be visualized with aHACand a heatmap:
perform hierarchical ascending classification (HAC) using Euclidean distance between samples: δ(j,j0) =P
g
log2(Kgj)−log2(Kgj0)2
visualize the strength of the similarity with heatmap.
Search for the main structures in the data: PCA
PCA(on log2counts) can be used to project data into a small dimensional space and search for unexpected experimental effects in the data.
(MDS is equivalent to PCA when used with the standard Euclidean distance)
Remark: InDESeq, the functionplotPCAperforms PCA on the top genes with the highest variance.
2
Outline
1 Exploratory analysis Introduction
Experimental design
Data exploration and quality assessment
2 Normalization Raw data filtering Interpreting read counts
3 Differential Expression analysis
Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis
Raw data filtering
Filteringconsists in removing genes with low expression. Different strategies can be used:
[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;
[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;
[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.
WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).
More sophisticated filtering
To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.
2
Raw data filtering
Filteringconsists in removing genes with low expression. Different strategies can be used:
[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;
[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;
[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.
WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).
More sophisticated filtering
To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.
Part III: Normalization
2
Purpose of normalization
identify and correct technical biases (due to sequencing process) to make counts comparable
types of normalization: within sample normalization and between sample normalization
Within sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts for gene B are twice larger than counts for gene Abecause:
Purpose of within sample comparison: enabling comparisons of genes from a same sample
Sources of variability: gene length, sequence composition (GC content)
These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.
2
Within sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts for gene B are twice larger than counts for gene Abecause:
gene B is expressed with a number of transcripts twice larger than gene A
gene A gene B
Purpose of within sample comparison: enabling comparisons of genes from a same sample
Sources of variability: gene length, sequence composition (GC content)
These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.
Within sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts for gene B are twice larger than counts for gene Abecause:
both genes are expressed with the same number of transcripts but gene B is twice longer than gene A
gene A gene B
Purpose of within sample comparison: enabling comparisons of genes from a same sample
Sources of variability: gene length, sequence composition (GC content)
These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.
2
Within sample normalization
Purpose of within sample comparison: enabling comparisons of genes from a same sample
Sources of variability: gene length, sequence composition (GC content)
These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.
Between sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts in sample 3 are much larger than counts in sample 2because:
Purpose of between sample comparison: enabling comparisons of a gene in different samples
Sources of variability: library size, ...
These differencesmust be corrected for a differential analysisand for data interpretation.
2
Between sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts in sample 3 are much larger than counts in sample 2because:
gene A is more expressed in sample 3 than in sample 2
gene A in sample 2 gene A in sample 3
Purpose of between sample comparison: enabling comparisons of a gene in different samples
Sources of variability: library size, ...
These differencesmust be corrected for a differential analysisand for data interpretation.
Between sample normalization
Example: (read counts)
sample 1 sample 2 sample 3
gene A 752 615 1203
gene B 1507 1225 2455
counts in sample 3 are much larger than counts in sample 2because:
gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)
gene A in sample 2 gene A in sample 3
Purpose of between sample comparison: enabling comparisons of a gene in different samples
Sources of variability: library size, ...
These differencesmust be corrected for a differential analysisand for data interpretation.
2
Between sample normalization
Purpose of between sample comparison: enabling comparisons of a gene in different samples
Sources of variability: library size, ...
These differencesmust be corrected for a differential analysisand for data interpretation.
Principles for sequencing depth normalization
Basics
1 choose an appropriate baseline for each sample
2 for a given gene, compare counts relative to the baseline rather than raw counts
Consequences: Library sizes for normalized counts are roughly equal.
control treated
Gene 1 5.5 1.6 0 0 5.6 0 0
Gene 2 0 3.2 0.6 1.4 1.4 0 0
Gene 3 101.2 257.6 45.6 49 196 61.6 56
: : :
: : :
: : :
Gene G 16.5 40 5.4 5.5 28 9.8 13.6
+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
2
Principles for sequencing depth normalization
Basics
1 choose an appropriate baseline for each sample
2 for a given gene, compare counts relative to the baseline rather than raw counts
In practice: Raw counts correspond to different sequencing depths
control treated
Gene 1 5 1 0 0 4 0 0
Gene 2 0 2 1 2 1 0 0
Gene 3 92 161 76 70 140 88 70
: : :
: : :
: : :
Gene G 15 25 9 5 20 14 17
Consequences: Library sizes for normalized counts are roughly equal.
control treated
Gene 1 5.5 1.6 0 0 5.6 0 0
Gene 2 0 3.2 0.6 1.4 1.4 0 0
Gene 3 101.2 257.6 45.6 49 196 61.6 56
: : :
: : :
: : :
Gene G 16.5 40 5.4 5.5 28 9.8 13.6
+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
Principles for sequencing depth normalization
Basics
1 choose an appropriate baseline for each sample
2 for a given gene, compare counts relative to the baseline rather than raw counts
In practice: A correction multiplicative factor is computed for every sample
control treated
Gene 1 5 1 0 0 4 0 0
Gene 2 0 2 1 2 1 0 0
Gene 3 92 161 76 70 140 88 70
: : :
: : :
: : :
Gene G 15 25 9 5 20 14 17
1.1 1.6 0.6 0.7 1.4 0.7 0.8
Cj
Consequences: Library sizes for normalized counts are roughly equal.
control treated
Gene 1 5.5 1.6 0 0 5.6 0 0
Gene 2 0 3.2 0.6 1.4 1.4 0 0
Gene 3 101.2 257.6 45.6 49 196 61.6 56
: : :
: : :
: : :
Gene G 16.5 40 5.4 5.5 28 9.8 13.6
+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
2
Principles for sequencing depth normalization
Basics
1 choose an appropriate baseline for each sample
2 for a given gene, compare counts relative to the baseline rather than raw counts
In practice: Every counts is multiplied by the correction factor corresponding to its sample
Gene 3 92 161 76 70 140 88 70
1.1 1.6 0.6 0.7 1.4 0.7 0.8
x
Gene 3 101.2 257.6 45.6 49 196 61.6 56
Cj
Consequences: Library sizes for normalized counts are roughly equal.
control treated
Gene 1 5.5 1.6 0 0 5.6 0 0
Gene 2 0 3.2 0.6 1.4 1.4 0 0
Gene 3 101.2 257.6 45.6 49 196 61.6 56
: : :
: : :
: : :
Gene G 16.5 40 5.4 5.5 28 9.8 13.6
+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
Principles for sequencing depth normalization
Basics
1 choose an appropriate baseline for each sample
2 for a given gene, compare counts relative to the baseline rather than raw counts
Consequences: Library sizes for normalized counts are roughly equal.
control treated
Gene 1 5.5 1.6 0 0 5.6 0 0
Gene 2 0 3.2 0.6 1.4 1.4 0 0
Gene 3 101.2 257.6 45.6 49 196 61.6 56
: : :
: : :
: : :
Gene G 16.5 40 5.4 5.5 28 9.8 13.6
+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
2
Principles for sequencing depth normalization
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj= s106
jDj.
Three types of methods: distribution adjustment
method taking length into account the “effective library size” concept
Principles for sequencing depth normalization
Definition
IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:
Kegj = Kgj
sj×Dj ×106 in which: Dj =P
gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj= s106
jDj. Three types of methods:
distribution adjustment
method taking length into account the “effective library size” concept
2
Distribution adjustment
Total read count adjustment[Mortazavi et al., 2008]
sj=1 and thus:Kegj= Kgj Dj ×106 (Counts Per Million).
raw counts normalized counts
0 5 10 15
0 250 500 750 1000 0 250 500 750 1000
rank(mean) gene expression log2(count + 1)
Samples Sample 1 Sample 2
edgeR:
cpm (... ,
normalized . lib . sizes = TRUE )
Distribution adjustment
Total read count adjustment[Mortazavi et al., 2008]
(Upper) Quartile normalization[Bullard et al., 2010]
sj=
Qj(p)
1 N
PN
l=1Ql(p)
in whichN is the number of samples andQj(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.
raw counts normalized counts
0 5 10 15 20
0 250 500 750 1000 0 250 500 750 1000
rank(mean) gene expression log2(count + 1)
Samples Sample 1 Sample 2
raw counts normalized counts
0 5 10 15 20
0 250 500 750 1000 0 250 500 750 1000
rank(mean) gene expression log2(count + 1)
Samples Sample 1 Sample 2
edgeR:
calcNormFactors (... , method = " upperquartile " , p = 0.75)
2
Method using gene lengths (intra & inter sample normalization)
RPKM: Reads Per Kilobase per Million mapped Reads
Assumptions: read counts are proportional to expression level, transcript length and sequencing depth
sj = DjLg 103×106 in whichLg is gene length (bp).
edgeR:
rpkm (... , gene .length = ...)
Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009].
Relative Log Expression (RLE)
[Anders and Huber, 2010],edgeR-DESeq-DESeq2
Method:
1 compute apseudo-reference sample: geometric mean across samples
Rg =
YN
j=1
Kgj
1/N
(geometric mean is less sensitive to extreme values than standard mean)
0 5 10 15
0 250 500 750 1000
rank(mean) gene expression log2(count + 1)
Samples Sample 1 Sample 2
2 center samples compared to the reference
3 compute normalization factor: median of centered counts over the genes
2
Relative Log Expression (RLE)
[Anders and Huber, 2010],edgeR-DESeq-DESeq2
Method:
1 compute apseudo-reference sample
2 center samples compared to the reference K˜gj= Kgj
Rg with Rg =
N
Y
j=1
Kgj
1/N
1 2 3 4
0 250 500 750 1000
rank(mean) gene expression
count / geo. mean Samples
Sample 1 Sample 2
3 compute normalization factor: median of centered counts over the genes
Relative Log Expression (RLE)
[Anders and Huber, 2010],edgeR-DESeq-DESeq2
Method:
1 compute apseudo-reference sample
2 center samples compared to the reference
3 compute normalization factor: median of centered counts over the genes
˜
sj=mediang n K˜gjo
factors multiply to 1: sj = ˜sj exp1
N
PN
l=1log(˜sl)
1 2 3 4
0 250 500 750 1000
rank(mean) gene expression
count / geo. mean Samples
Sample 1 Sample 2
with
K˜gj = Kgj Rg and
Rg =
N
Y
j=1
Kgj
1/N
2
Relative Log Expression (RLE)
[Anders and Huber, 2010],edgeR-DESeq-DESeq2
Method:
1 compute apseudo-reference sample
2 center samples compared to the reference
3 compute normalization factor: median of centered counts over the genes
0 5 10 15
0 250 500 750 1000
rank(mean) gene expression log2(count + 1)
Samples Sample 1 Sample 2
## with edgeR
calcNormFactors (... , method =" RLE ")
## with DESeq
estimateSizeFactors (...)
Trimmed Mean of M-values (TMM)
[Robinson and Oshlack, 2010],edgeR
Assumptions behind the method
the total read count strongly depends on a few highly expressed genes
most genes are not differentially expressed
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
On remaining data, compute the weighted mean of M-values:
TMM(j,r) = P
g:not trimmed
wg(j,r)Mg(j,r) P
g:not trimmed
wg(j,r) withwg(j,r) =
Dj−Kgj
DjKgj + DDr−Kgr
rKgr
.
2
Trimmed Mean of M-values (TMM)
[Robinson and Oshlack, 2010],edgeR
Assumptions behind the method
the total read count strongly depends on a few highly expressed genes
most genes are not differentially expressed
⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj
Dj
!
−log2 Kgr Dr
!
Ag(j,r) = 1 2
"
log2 Kgj Dj
!
+ log2 Kgr Dr
!#
select as a reference sample, the sampler with the upper quartile closest to the average upper quartile
M- vs A-values −3
−2
−1 0 1 2
−20 −15 −10 −5
M(j,r)
Trimmed values
None
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
On remaining data, compute the weighted mean of M-values:
TMM(j,r) = P
g:not trimmed
wg(j,r)Mg(j,r) P
g:not trimmed
wg(j,r) withwg(j,r) =
Dj−Kgj
DjKgj + DDr−Kgr
rKgr
.
Trimmed Mean of M-values (TMM)
[Robinson and Oshlack, 2010],edgeR
Assumptions behind the method
the total read count strongly depends on a few highly expressed genes
most genes are not differentially expressed
⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj
Dj
!
−log2 Kgr Dr
!
Ag(j,r) = 1 2
"
log2 Kgj Dj
!
+ log2 Kgr Dr
!#
Trim 30% on M-values
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
On remaining data, compute the weighted mean of M-values:
TMM(j,r) = P
g:not trimmed
wg(j,r)Mg(j,r) P
g:not trimmed
wg(j,r) withwg(j,r) =
Dj−Kgj
DjKgj + DDr−Kgr
rKgr
.
2
Trimmed Mean of M-values (TMM)
[Robinson and Oshlack, 2010],edgeR
Assumptions behind the method
the total read count strongly depends on a few highly expressed genes
most genes are not differentially expressed
⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj
Dj
!
−log2 Kgr Dr
!
Ag(j,r) = 1 2
"
log2 Kgj Dj
!
+ log2 Kgr Dr
!#
Trim 5% on A-values
−3
−2
−1 0 1 2
−20 −15 −10 −5
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
On remaining data, compute the weighted mean of M-values:
TMM(j,r) = P
g:not trimmed
wg(j,r)Mg(j,r) P
g:not trimmed
wg(j,r) withwg(j,r) =
Dj−Kgj
DjKgj + DDr−Kgr
rKgr
.
Trimmed Mean of M-values (TMM)
[Robinson and Oshlack, 2010],edgeR
Assumptions behind the method
the total read count strongly depends on a few highly expressed genes
most genes are not differentially expressed
−3
−2
−1 0 1 2
−20 −15 −10 −5
A(j,r)
M(j,r)
Trimmed values
None 30% of Mg 5% of Ag
On remaining data, compute the weighted mean of M-values:
TMM(j,r) = P
g:not trimmed
wg(j,r)Mg(j,r) P
g:not trimmed
wg(j,r) withwg(j,r) =
Dj−Kgj
DjKgj + DDr−Kgr
rKgr
.
2