• Aucun résultat trouvé

Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA)

N/A
N/A
Protected

Academic year: 2022

Partager "Analyses biostatistiques de données RNA-seq Nathalie Vialaneix (MIAT, INRA)"

Copied!
130
0
0

Texte intégral

(1)

Analyses biostatistiques de données RNA-seq

Nathalie Vialaneix (MIAT, INRA)

en collaboration avec Ignacio Gonzàles et Annick Moisan

nathalie.vialaneix@inrae.fr http://www.nathalievialaneix.eu

Toulouse, 4/5 novembre 2020

(2)

Outline

1 Exploratory analysis Introduction

Experimental design

Data exploration and quality assessment

2 Normalization Raw data filtering Interpreting read counts

3 Differential Expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

2

(3)

A typical transcriptomic experiment

(4)

A typical transcriptomic experiment

2

(5)

Outline

1 Exploratory analysis Introduction

Experimental design

Data exploration and quality assessment

2 Normalization Raw data filtering Interpreting read counts

3 Differential Expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

(6)

A typical RNA-seq experiment

1 collect samples

2 generate libraries of cDNA fragments

3 affect material to one lane in one flow cell

2

(7)

Steps in RNAseq data analysis

(8)

Part I: Experimental design

2

(9)

Confounded effects: a simple example

Basic experiment: find differences between control/treated plants

control group plant treated group plant

In summary, what is a good experimental design?

Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).

⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.

(10)

Confounded effects: a simple example

Basic experiment: find differences between control/treated plants

control group plant treated group plant

A bad experimental design: grow all control group plants in one field and grow all treated group plants in another field

Field 1 Field 2

because you can not differentiate between differences due to the field and differences due to the treatment⇒confounded effects

In summary, what is a good experimental design?

Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).

⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.

2

(11)

Confounded effects: a simple example

Basic experiment: find differences between control/treated plants

control group plant treated group plant

A good experimental design: grow half control group plants (chosen at random) and half treated group plants in one field (and the rest in the other field)

Field 1 Field 2

differences due to the field and differences due to the treatment can be

In summary, what is a good experimental design?

Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).

⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.

(12)

Confounded effects: a simple example

Basic experiment: find differences between control/treated plants

control group plant treated group plant

In summary, what is a good experimental design?

Experimental design are usually not as simple as this example: they can includemultiple experimental factors(day of experiment, flow cell, . . . ) and multiple covariates(sex, parents, . . . ).

⇒The experimental design must becarefullythoughtbeforestarting the experiment and confounded effects must be searched for in a systematic manner.

2

(13)

Effect & Variation

2 conditions, 2 genes whose expression distribution is:

first gene: different median levels between the two groups butlarge variance: differences may be non significant

second gene: different median levels between the two groups butvery small variance: differences may be significant

(14)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library preparation effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library preparation effectbiological effect

2

(15)

Source of variation in RNA-seq experiments

1 at the top layer:biological variations(i.e., individual differences due toe.g., environmental or genetic factors)

2 at the middle layer:technical variations(library preparation effect)

3 at the bottom layer:technical variations(lane and cell flow effects)

lane effect<cell flow effect<library preparation effectbiological effect

(16)

Advised policy for biological/technical replicates

biological replicates: different biological samples, processed separately (advised: ≥3 per condition)

technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)

6 biological replicates are better than 3 biological replicates×2 technical replicates! See [Liu et al., 2014] for further discussions.

2

(17)

Advised policy for biological/technical replicates

biological replicates: different biological samples, processed separately (advised: ≥3 per condition)

technical replicates: same biological material but independent replications of the technical steps (from library preparation; not mandatory in most cases)

6 biological replicates are better than 3 biological replicates×2 technical

(18)

In summary: a simple two-condition experimental design for RNA-seq

Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)

extractionRNA Library preparation

Lane 1

Lane 2

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:

Flow cell 1

C11 1

C21 2

T11 3

T21 4

Flow cell 2

C12 1

C22 2

T12 3

T22 4

Flow cell 3

C13 1

C23 2

T13 3

T23 4

Flow cell 4

C14 1

C24 2

T14 3

T24 4

Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane

extractionRNA

Library preparation and barcoding

Pooling

Lane 1

Lane 2

Barcoded adapters Barcode adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

2

(19)

In summary: a simple two-condition experimental design for RNA-seq

Typical RNA-seq design: independent samples are sequenced into different lanes (lane effect cannot be estimated but within sample comparison is preserved)

extractionRNA Library preparation

Lane 1

Lane 2

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate:

Flow cell 1

C11 1

C21 2

T11 3

T21 4

Flow cell 2

C12 1

C22 2

T12 3

T22 4

Flow cell 3

C13 1

C23 2

T13 3

T23 4

Flow cell 4

C14 1

C24 2

T14 3

T24 4

Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane

extractionRNA

Library preparation and barcoding

Pooling

Lane 1

Lane 2

Barcoded adapters Barcode adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

(20)

In summary: a simple two-condition experimental design for RNA-seq

Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane

extractionRNA

Library preparation and barcoding

Pooling

Lane 1

Lane 2

Barcoded adapters Barcode adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

2

(21)

In summary: a simple two-condition experimental design for RNA-seq

Multiplex RNA-seq design: DNA fragments are barcoded so that multiple samples can be included in the same lane

extractionRNA

Library preparation and barcoding

Pooling

Lane 1

Lane 2

Barcoded adapters Barcode adapters

Example with 2 biological replicates per condition and 4 technical replicates per biological replicate, splitted on 4 flow cells:

Flow cell 1

C111 1

C211 T111 T211

C121 2

C221 T121 T221

C131 3

C231 T131 T231

C141 4

C241 T141 T241

Flow cell 2

C112 1

C212 T112 T212

C122 2

C222 T122 T222

C132 3

C232 T132 T232

C142 4

C242 T142 T242

Flow cell 3

C113 1

C213 T113 T213

C123 2

C223 T123 T223

C133 3

C233 T133 T233

C143 4

C243 T143 T243

Flow cell 4

C114 1

C214 T114 T214

C124 2

C224 T124 T224

C134 3

C234 T134 T234

C144 4

C244 T144 T244

(22)

Part II: Exploratory analysis

2

(23)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads)

skewed data

overdispersion (variancemean)

(24)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

2

(25)

Some features of RNAseq data

What must be taken into account?

discrete, non-negative data (total number of aligned reads) skewed data

overdispersion (variancemean)

black line is

“variance = mean”

(26)

Dataset used in the examples

Dataset provided by courtesy of the transcriptomic platform of IPS2

Three files:

D1-counts.txtcontains the raw counts of the experiment (13 columns: the first one contains the gene names, the others

correspond to 12 different samples; gene names have been shuffled);

D1-genesLength.txtcontains information about gene lengths;

D1-targets.txtcontains information about the sample and the experimental design.

2

(27)

Dataset used in the examples

These text files are loaded with:

raw_counts <- read.table("D1 - counts . txt " , header = TRUE , row.names = 1)

raw_counts <- as.matrix( raw_counts )

design <- read.table("D1 - targets . txt " , header = TRUE , stringsAsFactors = FALSE )

gene_lengths <- scan("D1 - genesLength . txt ")

(28)

Count distribution

Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):

This distribution is highly skewed and it is better visualized using alog2 transformationbefore it is displayed.

library size

Thelibrary sizeis the sum of all counts in a given sample.

2

(29)

Count distribution

Thecount distribution(i.e., the number of times a given count is obtained in the data) can be visualized withhistograms(boxplotsorviolin plotscan also be used):

This distribution is highly skewed and it is better visualized using alog2 transformationbefore it is displayed.

library size

(30)

Count distribution between samples

Thecount distributionbetween different samples can be compared with parallel boxplotsorviolin plots:

It is expected that, within a given condition (group), the count distributions are similar. The same is often also expected between conditions.

2

(31)

Check reproducibility between samples

MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):

M-values: log of ratio between counts between two samples:

Mg = log2(Kgj)−log2(Kgj0)

A-values: average log counts between two samples:

Ag = log2(Kgj) + log2(Kgj0) 2

whereKgj stands for the counts for genegin samplej.

(32)

Check reproducibility between samples

MA plotscan be used to visualize reproducibility between samples of an experiment (and thus check if normalization is needed). They plot the log-fold change (M-values) against the log-average (A-values):

2

(33)

Check similarity between samples

Similarities between samplescan be visualized with aHACand a heatmap:

perform hierarchical ascending classification (HAC) using Euclidean distance between samples: δ(j,j0) =P

g

log2(Kgj)−log2(Kgj0)2

visualize the strength of the similarity with heatmap.

(34)

Search for the main structures in the data: PCA

PCA(on log2counts) can be used to project data into a small dimensional space and search for unexpected experimental effects in the data.

(MDS is equivalent to PCA when used with the standard Euclidean distance)

Remark: InDESeq, the functionplotPCAperforms PCA on the top genes with the highest variance.

2

(35)

Outline

1 Exploratory analysis Introduction

Experimental design

Data exploration and quality assessment

2 Normalization Raw data filtering Interpreting read counts

3 Differential Expression analysis

Hypothesis testing and correction for multiple tests Differential expression analysis for RNAseq data Interpreting and improving the analysis

(36)

Raw data filtering

Filteringconsists in removing genes with low expression. Different strategies can be used:

[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;

[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;

[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.

WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).

More sophisticated filtering

To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.

2

(37)

Raw data filtering

Filteringconsists in removing genes with low expression. Different strategies can be used:

[Sultan et al., 2008]: filter out genes with a total read count smaller than a given threshold;

[Bottomly et al., 2011]: filter out genes with zero count in an experimental condition;

[Robinson and Oshlack, 2010]: filter out genes such that the number of samples with a CPM value (for this gene) smaller than a given threshold is larger than the smallest number of samples in a condition.

WithCPM: Count Per Million (i.e., raw count divived by library size, this strategy takes into account differences in library sizes).

More sophisticated filtering

To account for the fact that lowly expressed genes are almost never found differentially expressed, a more sophisticated filtering can be performed.

(38)

Part III: Normalization

2

(39)

Purpose of normalization

identify and correct technical biases (due to sequencing process) to make counts comparable

types of normalization: within sample normalization and between sample normalization

(40)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

2

(41)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

gene B is expressed with a number of transcripts twice larger than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(42)

Within sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts for gene B are twice larger than counts for gene Abecause:

both genes are expressed with the same number of transcripts but gene B is twice longer than gene A

gene A gene B

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

2

(43)

Within sample normalization

Purpose of within sample comparison: enabling comparisons of genes from a same sample

Sources of variability: gene length, sequence composition (GC content)

These differencesneed not to be corrected for a differential analysisand are not really relevant for data interpretation.

(44)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

2

(45)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is more expressed in sample 3 than in sample 2

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(46)

Between sample normalization

Example: (read counts)

sample 1 sample 2 sample 3

gene A 752 615 1203

gene B 1507 1225 2455

counts in sample 3 are much larger than counts in sample 2because:

gene A is expressed similarly in the two samples but sequencing depth is larger in sample 3 than in sample 2 (i.e., differences in library sizes)

gene A in sample 2 gene A in sample 3

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

2

(47)

Between sample normalization

Purpose of between sample comparison: enabling comparisons of a gene in different samples

Sources of variability: library size, ...

These differencesmust be corrected for a differential analysisand for data interpretation.

(48)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

2

(49)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Raw counts correspond to different sequencing depths

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(50)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: A correction multiplicative factor is computed for every sample

control treated

Gene 1 5 1 0 0 4 0 0

Gene 2 0 2 1 2 1 0 0

Gene 3 92 161 76 70 140 88 70

: : :

: : :

: : :

Gene G 15 25 9 5 20 14 17

1.1 1.6 0.6 0.7 1.4 0.7 0.8

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

2

(51)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

In practice: Every counts is multiplied by the correction factor corresponding to its sample

Gene 3 92 161 76 70 140 88 70

1.1 1.6 0.6 0.7 1.4 0.7 0.8

x

Gene 3 101.2 257.6 45.6 49 196 61.6 56

Cj 

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

(52)

Principles for sequencing depth normalization

Basics

1 choose an appropriate baseline for each sample

2 for a given gene, compare counts relative to the baseline rather than raw counts

Consequences: Library sizes for normalized counts are roughly equal.

control treated

Gene 1 5.5 1.6 0 0 5.6 0 0

Gene 2 0 3.2 0.6 1.4 1.4 0 0

Gene 3 101.2 257.6 45.6 49 196 61.6 56

: : :

: : :

: : :

Gene G 16.5 40 5.4 5.5 28 9.8 13.6

+ Lib. size 13.1 13.0 13.2 13.1 13.2 13.0 13.1 x 105

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj = s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

2

(53)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj= s106

jDj.

Three types of methods: distribution adjustment

method taking length into account the “effective library size” concept

(54)

Principles for sequencing depth normalization

Definition

IfKgjis the raw count for genegin samplej then, the normalized counts is defined as:

Kegj = Kgj

sj×Dj ×106 in which: Dj =P

gKgjis the library size of samplej,sjis the correction factor of the library size for samplej and thusCj= s106

jDj. Three types of methods:

distribution adjustment

method taking length into account the “effective library size” concept

2

(55)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

sj=1 and thus:Kegj= Kgj Dj ×106 (Counts Per Million).

raw counts normalized counts

0 5 10 15

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

cpm (... ,

normalized . lib . sizes = TRUE )

(56)

Distribution adjustment

Total read count adjustment[Mortazavi et al., 2008]

(Upper) Quartile normalization[Bullard et al., 2010]

sj=

Qj(p)

1 N

PN

l=1Ql(p)

in whichN is the number of samples andQj(p)is a given quantile (generally 3rd quartile) of the count distribution in samplej.

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

raw counts normalized counts

0 5 10 15 20

0 250 500 750 1000 0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

edgeR:

calcNormFactors (... , method = " upperquartile " , p = 0.75)

2

(57)

Method using gene lengths (intra & inter sample normalization)

RPKM: Reads Per Kilobase per Million mapped Reads

Assumptions: read counts are proportional to expression level, transcript length and sequencing depth

sj = DjLg 103×106 in whichLg is gene length (bp).

edgeR:

rpkm (... , gene .length = ...)

Unbiaised estimation of number of reads but affect variability [Oshlack and Wakefield, 2009].

(58)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample: geometric mean across samples

Rg =







 YN

j=1

Kgj









1/N

(geometric mean is less sensitive to extreme values than standard mean)

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

2 center samples compared to the reference

3 compute normalization factor: median of centered counts over the genes

2

(59)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference K˜gj= Kgj

Rg with Rg =









N

Y

j=1

Kgj









1/N

1 2 3 4

0 250 500 750 1000

rank(mean) gene expression

count / geo. mean Samples

Sample 1 Sample 2

3 compute normalization factor: median of centered counts over the genes

(60)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 compute normalization factor: median of centered counts over the genes

˜

sj=mediang n K˜gjo

factors multiply to 1: sj = ˜sj exp1

N

PN

l=1log(˜sl)

1 2 3 4

0 250 500 750 1000

rank(mean) gene expression

count / geo. mean Samples

Sample 1 Sample 2

with

gj = Kgj Rg and

Rg =









N

Y

j=1

Kgj









1/N

2

(61)

Relative Log Expression (RLE)

[Anders and Huber, 2010],edgeR-DESeq-DESeq2

Method:

1 compute apseudo-reference sample

2 center samples compared to the reference

3 compute normalization factor: median of centered counts over the genes

0 5 10 15

0 250 500 750 1000

rank(mean) gene expression log2(count + 1)

Samples Sample 1 Sample 2

## with edgeR

calcNormFactors (... , method =" RLE ")

## with DESeq

estimateSizeFactors (...)

(62)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

2

(63)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+ log2 Kgr Dr

!#

select as a reference sample, the sampler with the upper quartile closest to the average upper quartile

M- vs A-values −3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

Trimmed values

None

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(64)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+ log2 Kgr Dr

!#

Trim 30% on M-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

2

(65)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

⇒remove extreme data for fold-changed (M) and average intensity (A) Mg(j,r) = log2 Kgj

Dj

!

−log2 Kgr Dr

!

Ag(j,r) = 1 2

"

log2 Kgj Dj

!

+ log2 Kgr Dr

!#

Trim 5% on A-values

−3

−2

−1 0 1 2

−20 −15 −10 −5

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

(66)

Trimmed Mean of M-values (TMM)

[Robinson and Oshlack, 2010],edgeR

Assumptions behind the method

the total read count strongly depends on a few highly expressed genes

most genes are not differentially expressed

−3

−2

−1 0 1 2

−20 −15 −10 −5

A(j,r)

M(j,r)

Trimmed values

None 30% of Mg 5% of Ag

On remaining data, compute the weighted mean of M-values:

TMM(j,r) = P

g:not trimmed

wg(j,r)Mg(j,r) P

g:not trimmed

wg(j,r) withwg(j,r) =

Dj−Kgj

DjKgj + DDr−Kgr

rKgr

.

2

Références

Documents relatifs

We show experimentally, on simulated and real data, that FlipFlop is more accurate than simple pooling strategies and than other existing methods for isoform deconvolution from

Pour deux jeux de donn´ees RNA-seq, nous effectuons la classification des g`enes `a l’aide du mod`ele de m´elange de Poisson sur les donn´ees de comptage brutes, et `a l’aide

La technologie RNA-seq, étant relativement récente, peu de packages proposent de simuler directement des données de comptages à partir d'un réseau donné. Il est possible néanmoins

Germline SMARCB1 mutations have recently been identified as a pathogenic cause of a subset of familial schwannomatosis cases and is a candidate gene for causation of both

- Et aussi sans oublié, du cotés des RadioAmateurs : Si tous les gars du monde (1956). =&gt; http://www.youtube.com/watch?v=D_FysnGt0w0 Contact avec Fodie

  Cette  didascalie  inscrite  en  marge  de  sa  recherche  nous  révèle  la  motivation  profonde  de  Perec :  sa  recherche  s’inscrit  à 

صﺧﻠﻣﻟا (ﺔﯾﺑرﻌﻟﺎﺑ) : ﻞﻳوﺄّﺘﻟا ﻓ ﻰﻠﻋ لﻮﺼﳊا فﺪ اﲑﺴﻔﺗو ﺎﻤﻬﻓو ﺎﺣﺮّﺷ ﺺّﻨﻟا ﻢﻬﻔﻟ تﺎﻴﻟﻵا ﻊﻤﳚ ﻲﺋاﺮﺟإ ﻞﻌ ﺎﻤﻛ ،ﻢﻬﻔﻟا ﺔّﻴﺤﻄﺴﻟ ءﺎﺼﻗإو ،ﻊﻗﻮﺘﳌا ﲎﻌﳌا ﻞﻳوﺄّﺘﻟا

Separate regressions of treatment on the outcome using the reduced sample size for each interaction term confirm that differences in the coefficient on treatment are not due to