• Aucun résultat trouvé

Normalization and differential analysis of RNA-seq data

N/A
N/A
Protected

Academic year: 2022

Partager "Normalization and differential analysis of RNA-seq data"

Copied!
83
0
0

Texte intégral

(1)

Normalization and differential analysis of RNA-seq data

Nathalie Villa-Vialaneix INRA, Toulouse, MIAT

(Mathématiques et Informatique Appliquées de Toulouse)

[email protected] http://www.nathalievilla.org

SPS Summer School 2016

From gene expression to genomic network

(2)

Outline

1 Normalization

2 Differential expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data

(3)

A typical transcriptomic experiment

(4)

A typical transcriptomic experiment

(5)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads)

skewed data

overdispersion (variancemean)

(6)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

(7)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

black line is

“variance = mean”

(8)

Steps in RNAseq data analysis

(9)

Part I: Normalization

(10)

Purpose of normalization

identify and correct technical biases (due to sequencing process) to make counts comparable

types of normalization: within sample normalization and between sample normalization

(11)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library effectbiological effect

(12)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

(13)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(14)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

gene B is expressed with a number of transcripts twice larger than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(15)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

both genes are expressed with the same number of transcripts but gene B is twice longer than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(16)

Within sample normalization

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(17)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(18)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is more expressed in sample 3 than in sample 2

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(19)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(20)

Between sample normalization

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(21)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(22)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Raw counts correspond to different sequencing depths

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(23)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: A correction multiplicative factor is calculated for every sample

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

1.1 1.6 0.6 0.7 1.4 0.7 0.8

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(24)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Every counts is multiplied by the correction factor corresponding to its sample

Gene 3 92 161 76 70 140 88 70

1.1 1.6 0.6 0.7 1.4 0.7 0.8

x

Gene 3 101.2 257.6 45.6 49 196 61.6 56

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(25)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(26)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej.

Three types of methods: distribution adjustment

method taking length into account the “effective library size” concept

(27)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj sj

in whichsj=Cj−1 is the scaling factor for samplej.

Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(28)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

sj= Dj

1 N

PN l=1Dl in whichN is the number of samples andDj=P

gKgj.

raw counts normalized counts

0 5 10 15

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

cpm (... ,

normalized . lib . sizes = TRUE )

(29)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

(Upper) Quartile normalization[Bullard et al., 2010]

sj=

Qj(p)

1PN

l=1Ql(p)

in whichQj(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

calcNormFactors (... , method = " upperquartile " , p = 0.75)

(30)

Method using gene lengths (intra & inter sample normalization)

RPKM: Reads Per Kilobase per Million mapped Reads

Assumptions: read counts are proportional to expression level, transcript length and sequencing depth

sj = DjLg 103×106 in whichLg is gene length (bp).

edgeR:

rpkm (... , gene .length = ...)

(31)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample: geometric mean across samples

Rg =







 YN

j=1

Kgj









1/N

(geometric mean is less sensitive to extreme values than standard mean)

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

(32)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference K˜gj= Kgj

Rg with Rg =









N

Y

j=1

Kgj









1/N

1 2 3 4

0 250 500 750 1000

count / geo. mean Samples

Sample 1 Sample 2

3 calculate normalization factor: median of centered counts over the genes

(33)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

˜

sj=mediang n K˜gjo

factors multiply to 1: sj = ˜sj exp1

N

PN

l=1log(˜sl)

1 2 3 4

0 250 500 750 1000

count / geo. mean Samples

Sample 1 Sample 2

with

gj = Kgj Rg and

Rg =









N

Y

j=1

Kgj









1/N

(34)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 calculate normalization factor: median of centered counts over the genes

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

## with edgeR

calcNormFactors (... , method =" RLE ")

## with DESeq

estimateSizeFactors (...)

(35)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(36)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

select as a reference sample, the sampler with the upper quartile closest to the average upper quartile

−1 0 1 2

M(j,r)

Trimmed values

None

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(37)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

Trim 30% on M-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

Trimmed values

None 30% of Mg

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(38)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) =log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+log2 Kgr Dr

!#

Trim 5% on A-values −1

0 1 2

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(39)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, calculate the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(40)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed Correction factors:

˜

sj=2TMM(j,r) factors multiply to 1: sj= ˜sj exp1

N

PN

l=1log(˜sl) calcNormFactors (... , method =" TMM ")

(41)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Purpose of the comparison:

finding the “best” method for all cases is not a realistic purpose

find an approach which isrobust enoughto provide relevant results in all cases

Method: comparison based on several criteria to select a method which is valid for multiple objectives

(42)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Effect on count distribution:

(43)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Effect on differential analysis(DESeq v. 1.6):

Inflated FPR for all methods except for TMM and DESeq (RLE).

(44)

Comparison of the different approaches

[Dillies et al., 2013], (6 simulated datasets)

Conclusion: Differences appear based on data characteristics

TMM and DESeq (RLE) are performant in a differential analysis context.

(45)

Practical session

import and understand data;

run different types of normalization;

compare the results...

(46)

Part II: Differential expression analysis

(47)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

which is tested against an alternative H1: the average count for gene gin the control samples is different from the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(48)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(49)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(50)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”. We have that:α=PH0(H0is rejected).

(51)

Different steps in hypothesis testing

1 formulate anhypothesis H0:

H0: the average count for genegin the control samples is the same that the average count in the treated samples

2 from observations, calculate atest statistics(e.g., the mean in the two samples)

3 find thetheoretical distribution of the test statistics under H0

4 deduce the probability that the observations occur under H0: this is called thep-value

5 conclude: if the p-value is low (usually belowα=5% as a convention), H0is unlikely: we say that “H0 is rejected”.

We have that:α=PH0(H0is rejected).

(52)

Summary of the possible decisions

(53)

Types of errors in tests

Reality

H0 is true H0 is false

Decision

Do not reject H0 Correct decision Type II error ,(True Negative) /(False Negative) Reject H0 Type I error Correct decision

/(False Positive) ,(True Positive)

P(Type I error) =α(risk) P(Type II error) =1−β(β: power)

(54)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H0is always true) =1−(1−α)G Ex: forα=5% andG=20,

P(at least one FP if H0is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H0is true for allGtests

0.25 0.50 0.75 1.00

0 25 50 75 100

Number of tests

Probability

For more than 75 tests and if H0 is always true, the probability to have at least one false positive is very close to 100%!

(55)

Why performing a large number of tests might be a problem?

Framework: Suppose you are performingGtests at levelα. P(at least one FP if H0is always true) =1−(1−α)G Ex: forα=5% andG=20,

P(at least one FP if H0is always true)'64%!!!

Probability to have at least one false positive versus the number of tests performed when H0is true for allGtests

0.25 0.50 0.75 1.00

Probability

For more than 75 tests and if H0 is always true, the probability to have at least one false positive is very close to 100%!

(56)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G0−U G1−V G−R

Total G0 G1 G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q=









U/R ifR >0 0 otherwise

(57)

Notation for multiple tests

Number of decisions forGindependent tests:

True null False null Total hypotheses hypotheses

Rejected U V R

Not rejected G0−U G1−V G−R

Total G0 G1 G

Instead of the riskα, control:

familywise error rate (FWER): FWER=P(U >0)(i.e., probability to have at least one false positive decision)

false discovery rate (FDR): FDR=E(Q)with Q =









U/R ifR >0 0 otherwise

(58)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α

Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(59)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(60)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-valuesp˜(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(61)

Adjusted p-values

Settings: p-valuesp1, . . . ,pG (e.g., corresponding toGtests onG different genes)

Adjusted p-values

adjusted p-values are˜p1, . . . ,˜pG such that

Rejecting tests such that p˜g < α ⇐⇒ P(U >0)≤α or E(Q)≤α Calculating p-values

1 order the p-valuesp(1)≤p(2)≤. . .≤p(G)

2 calculate˜p(g)=agp(g)

I withBonferronimethod:ag=G(FWER)

I withBenjamini & Hochbergmethod:ag=G/g(FDR)

3 if adjusted p-values˜p(g)are larger than 1, correct˜p(g)←min{˜p(g),1}

(62)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg ngA ngB ng

other genes NA−ngA NB−ngB N−ng

Total NA NB N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables; the same marginals (“Total”).

(63)

Fisher’s exact test for contingency tables

After normalization, one may build a contingency table like this one:

treated control Total

geneg ngA ngB ng

other genes NA−ngA NB−ngB N−ng

Total NA NB N

Question: is the number of reads of genegin the treated sample significatively different than in the control sample?

Method

Direct calculation of the probability to obtain such a contingency table (or a

“more extreme” contingency table) with:

independency between the two columns of the contingency tables;

the same marginals (“Total”).

(64)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(65)

Example of results obtained with the Fisher test

Genes declared significantly differentially expressed are in pink:

−6

−3 0 3 6

0 5 10 15

mean log2 expression log2 fold change

Main remark: more

conservative for genes with a low expression

Limitation of Fisher test

Highly expressed genes have a very large variance! As Fisher test does not estimate variance, ittends to detect false positives among highly expressed genes⇒do not use it!

(66)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples)

estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(67)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(68)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value

⇒Test H0:{λg1g2}

(69)

Basic principles of tests for count data: 2 conditions and replicates

Notations: for geneg,Kg11 , ...,Kgn1

1(condition 1) andKg12 , ...,Kgn2

2

(condition 2)

choose an appropriate distributionto model count data (discrete data, overdispersion)

Kgjk ∼NB(sjkλgk, φg) in which:

I sjkis library size of samplejin conditionk

I λgk is the proportion of counts for genegin conditionk

I φgis the dispersion of geneg(supposed to be identical for all samples) estimate its parametersfor both conditions

λg1 λg2 φg

concludeby calculating p-value⇒Test H0:{λg1g2}

(70)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(71)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(72)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(73)

First method: Exact Negative Binomial test

[Robinson and Smyth, 2008]

Normalizationis performed to get equal size librairies⇒s

Kg11 +. . .+Kgn11∼NB(sλg1, φg/n1)(and similarly for the second condition)

1 λg1 andλg2 are estimated (mean of the distributions)

2 φg is estimated independently ofλg1andλg2, using different approaches to account for small sample size

3 The test is performed similarly as for Fisher test (exact probability calculation according to estimated paramters)

(74)

Estimating the dispersion parameter φ

g

Some methods:

DESeq,DESeq2: φg is a smooth function ofλgg1g2

dge <- estimateDispersion ( dge )

5e−01 5e+00 5e+01 5e+02

0.010.050.505.00

mean of normalized counts

dispersion

gene−est fitted final

edgeR: estimate a common dispersion parameter for all genes and use it as a prior in a Bayesian approach to estimate a gene specific dispersion parameter

dge <- estimateCommonDisp ( dge ) dge <- estimateTagwiseDisp ( dge )

(75)

Perform the test

Some methods:

DESeq,DESeq2: exact (DESeq) or approximate (Wald and LR in DESeq2) tests

res <- nbinomWaldTest ( dge ) results ( res )

res <- nbinomLR ( dge ) results ( res )

edgeR: exact tests

res <- exactTest ( dge ) topTags ( res )

(comparison between methods in [Zhang et al., 2014])

(76)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

their interactions

(77)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

their interactions

(78)

More complex experiments: GLM

Framework:

Kgj ∼NB(µgj, φg) with log(µgj) =log(sj) +log(λgj) in which:

sjis the library size for samplej;

log(λgj)is estimated (for instance) by aGeneralized Linear Model (GLM):

log(λgj) =λ0+x>j βg in whichxiis a vector of covariates.

GLM allows to decompose the effects on the mean of different factors

(79)

More complex experiments: GLM in practice

edgeR

dge <- estimateDisp (dge , design ) fit <- glmFit (dge , design )

res <- glmRT (fit , ...) topTags ( res )

DESeq,DESeq2

dge <- newCountDataSet ( counts , design ) dge <- estimateSizeFactors ( dge )

dge <- estimateDispersions ( dge )

fit <- fitNbinomGLMs (dge , count ~ ...) fit0 <- fitNbinomGLMs (dge , count ~ 1) res <- nbinomGLMTest (fit , fit0 )

p. adjust (res , method = " BH ")

Références

Documents relatifs

In order to identify change motifs, we realise a differential network analysis from gene networks inferred by Extended Core Network on Sense data in one hand, and Sense and

and SGNET honeypot deployments [5]. By automating our analysis, it should help us to integrate the use of honeypot data into daily network security monitoring. To achieve this we

Keywords: analytical programming, spectra analysis, CUDA, evolu- tionary algorithm, differential evolution, parallel implementation, sym- bolic regression, random decision forest,

major contribution of this paper is to show how a mix- ture model framework that explicitly account for ChIP efficiencies can be used to perform a joint analysis of ChIP data

Data workers are people who perform data analysis activities as a part of their daily work but do not formally identify as data scientists. They come from various domains and often

This work features 4 important novelties: i) we propose a statistically sound nonparametric estimation of the inherent mean-variance relationship in RNA- seq data; ii) we present

These data represent the first transcriptomic analysis of microglia at multiple time-points after different spinal cord lesion severities and thus provide novel insights into

In contrast, the first step of FaRLine is to map the reads on the ref- erence genome. From this mapping, annotated and quantified events are extracted. Finally, the