• Aucun résultat trouvé

Normalization and differential analysis of RNA-seq data

N/A
N/A
Protected

Academic year: 2022

Partager "Normalization and differential analysis of RNA-seq data"

Copied!
83
0
0

Texte intégral

(1)

Normalization and differential analysis of RNA-seq data

Nathalie Villa-Vialaneix INRA, Toulouse, MIAT

(Mathématiques et Informatique Appliquées de Toulouse)

nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org

SPS Summer School 2016

From gene expression to genomic network

(2)

Outline

1 Normalization

2 Differential expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data

(3)

A typical transcriptomic experiment

(4)

A typical transcriptomic experiment

(5)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads)

skewed data

overdispersion (variancemean)

(6)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

(7)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

black line is

“variance = mean”

(8)

Steps in RNAseq data analysis

(9)

Part I: Normalization

(10)

Purpose of normalization

identify and correct technical biases (due to sequencing process) to make counts comparable

types of normalization: within sample normalization and between sample normalization

(11)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library effectbiological effect

(12)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

(13)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(14)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

gene B is expressed with a number of transcripts twice larger than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(15)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

both genes are expressed with the same number of transcripts but gene B is twice longer than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(16)

Within sample normalization

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(17)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(18)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is more expressed in sample 3 than in sample 2

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(19)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(20)

Between sample normalization

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(21)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(22)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Raw counts correspond to different sequencing depths

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(23)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: A correction multiplicative factor is calculated for every sample

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

1.1 1.6 0.6 0.7 1.4 0.7 0.8

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(24)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Every counts is multiplied by the correction factor corresponding to its sample

Gene 3 92 161 76 70 140 88 70

1.1 1.6 0.6 0.7 1.4 0.7 0.8

x

Gene 3 101.2 257.6 45.6 49 196 61.6 56

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(25)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(26)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej.

Three types of methods: distribution adjustment

method taking length into account the “effective library size” concept

(27)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej.

Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(28)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

sj= Dj

1 N

PN l=1Dl in whichN is the number of samples andDj=P

gKgj.

raw counts normalized counts

0 5 10 15

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

cpm (... ,

normalized . lib . sizes = TRUE )

(29)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

(Upper) Quartile normalization[Bullard et al., 2010]

sj=

Qj(p)

1PN

l=1Ql(p)

in whichQj(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

calcNormFactors (... , method = " upperquartile " , p = 0.75)

(30)

Method using gene lengths (intra & inter sample normalization)

RPKM: Reads Per Kilobase per Million mapped Reads

Assumptions: read counts are proportional to expression level, transcript length and sequencing depth

sj = DjLg 103×106 in whichLg is gene length (bp).

edgeR:

rpkm (... , gene .length = ...)

(31)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample: geometric mean across samples

Rg =







 YN

j=1

Kgj









1/N

(geometric mean is less sensitive to extreme values than standard mean)

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

(32)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference K˜gj= Kgj

Rg with Rg =









N

Y

j=1

Kgj









1/N

1 2 3 4

0 250 500 750 1000

count / geo. mean Samples

Sample 1 Sample 2

3 calculate normalization factor: median of centered counts over the genes

(33)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

˜

sj=mediang n K˜gjo

factors multiply to 1: sj = ˜sj exp1

N

PN

l=1log(˜sl)

1 2 3 4

0 250 500 750 1000

count / geo. mean Samples

Sample 1 Sample 2

with

gj = Kgj Rg and

Rg =









N

Y

j=1

Kgj









1/N

(34)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

## with edgeR

calcNormFactors (... , method =" RLE ")

## with DESeq

estimateSizeFactors (...)

(35)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(36)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

select as a reference sample, the sampler with the upper quartile closest to the average upper quartile

−1 0 1 2

M(j,r)

Trimmed values

None

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(37)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

Trim 30% on M-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

Trimmed values

None 30% of Mg

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(38)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

Trim 5% on A-values −1

0 1 2

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(39)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(40)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed Correction factors:

˜

sj=2TMM(j,r) factors multiply to 1: sj= ˜sj exp1

N

PN

l=1log(˜sl) calcNormFactors (... , method =" TMM ")

(41)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Purpose of the comparison:

finding the “best” method for all cases is not a realistic purpose

find an approach which isrobust enoughto provide relevant results in all cases

Method: comparison based on several criteria to select a method which is valid for multiple objectives

(42)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Effect on count distribution:

(43)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Effect on differential analysis(DESeq v. 1.6):

Inflated FPR for all methods except for TMM and DESeq (RLE).

(44)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Conclusion: Differences appear based on data characteristics

TMM and DESeq (RLE) are performant in a differential analysis context.

(45)

Practical session

import and understand data;

run different types of normalization;

compare the results...

(46)

Part II: Differential expression analysis

(47)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

which is tested against an alternative H1: the average count for gene gin the control samples is different from the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(48)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(49)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(50)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(51)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”.

We have that:α=PH0(H0is rejected).

(52)

Summary of the possible decisions

(53)

Types of errors in tests

Reality

H0 is true H0 is false

Decision

Do not reject H0 Correct decision Type II error ,(True Negative) /(False Negative) Reject H0 Type I error Correct decision

/(False Positive) ,(True Positive)

P(Type I error) =α(risk) P(Type II error) =1−β(β: power)

(54)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H0is always true) =1−(1−α)G Ex: forα=5% andG=20,

P(at least one FP if H0is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H0is true for allGtests

0.25 0.50 0.75 1.00

0 25 50 75 100

Number of tests

Probability

For more than 75 tests and if H0 is always true, the probability to have at least one false positive is very close to 100%!

(55)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H0is always true) =1−(1−α)G Ex: forα=5% andG=20,

P(at least one FP if H0is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H0is true for allGtests

0.25 0.50 0.75 1.00

Probability

For more than 75 tests and if H0 is always true, the probability to have at least one false positive is very close to 100%!

(56)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G0−U G1−V G−R

Total G0 G1 G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q=









U/R ifR >0 0 otherwise

(57)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G0−U G1−V G−R

Total G0 G1 G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q =









U/R ifR >0 0 otherwise

(58)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α

Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(59)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(60)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(61)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-values˜p(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(62)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg ngA ngB ng

other genes NA−ngA NB−ngB N−ng

Total NA NB N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables; the same marginals (“Total”).

(63)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg ngA ngB ng

other genes NA−ngA NB−ngB N−ng

Total NA NB N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables;

the same marginals (“Total”).

(64)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(65)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(66)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples)

estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(67)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(68)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(69)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value⇒Test H0:{λg1g2}

(70)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(71)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(72)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(73)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(74)

Estimating the dispersion parameter φ

g

Some methods:

DESeq,DESeq2: φg is a smooth function ofλgg1g2

dge <- estimateDispersion ( dge )

5e−01 5e+00 5e+01 5e+02

0.010.050.505.00

mean of normalized counts

dispersion

gene−est fitted final

edgeR: estimate a common dispersion parameter for all genes and use it as a prior in a Bayesian approach to estimate a gene specific dispersion parameter

dge <- estimateCommonDisp ( dge ) dge <- estimateTagwiseDisp ( dge )

(75)

Perform the test

Some methods:

DESeq,DESeq2: exact (DESeq) or approximate (Wald and LR in DESeq2) tests

res <- nbinomWaldTest ( dge ) results ( res )

res <- nbinomLR ( dge ) results ( res )

edgeR: exact tests

res <- exactTest ( dge ) topTags ( res )

(comparison between methods in [Zhang et al., 2014])

(76)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

their interactions

(77)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

their interactions

(78)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

(79)

More complex experiments: GLM in practice

edgeR

dge <- estimateDisp (dge , design ) fit <- glmFit (dge , design )

res <- glmRT (fit , ...) topTags ( res )

DESeq,DESeq2

dge <- newCountDataSet ( counts , design ) dge <- estimateSizeFactors ( dge )

dge <- estimateDispersions ( dge )

fit <- fitNbinomGLMs (dge , count ~ ...) fit0 <- fitNbinomGLMs (dge , count ~ 1) res <- nbinomGLMTest (fit , fit0 )

p. adjust (res , method = " BH ")

Références

Documents relatifs

The main features of the tool that make it different from what is already available are the possibility of directly processing the results of the differential expression analysis,

This work features 4 important novelties: i) we propose a statistically sound nonparametric estimation of the inherent mean-variance relationship in RNA- seq data; ii) we present

In addition to the intersection of independent per-study analyses (where genes were declared to be differentially expressed if identified in more than half of the studies with

performed survival treatment of the Kras;p53 mouse model of PDAC, contributed to MRI animal imaging and tumor volume analysis, contributed to analysis of PDAC xenografts and Myc

Même si l’al- lure globale est similaire à celle que l’on a obtenu lors des simulations par éléments finis (voir chapitre 3), les champs de gradients sont très bruités, ceci

From the temperature measurements, the long time constants for trapping are found to be due to a high energy barrier for carriers to be trapped. The dependency

r Basic Protocol 4: integration of ChIP-seq and RNA-seq results: comparison be- tween genes associated with the ChIP-seq peaks, differentially expressed genes reported by

In patients with leptospirosis-related acute lung injury who are mechanically ventilated, three variables independently predicted mortality: hemodynamic disorders, serum creatinine