• Aucun résultat trouvé

On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments

N/A
N/A
Protected

Academic year: 2022

Partager "On the utility of RNA sample pooling to optimize cost and statistical power in RNA sequencing experiments"

Copied!
19
0
0

Texte intégral

(1)

Supplementary results: On the utility of RNA sample pooling to optimize cost and statistical power in RNA

sequencing experiments

Alemu Takele Assefa, Jo Vandesompele, and Olivier Thas October 28, 2019

Contents

1 Supplementary results 1: theoretical results 1

1.1 Proofs for the mean and variance of Yk . . . 1 1.2 Estimation of the relative abundance and log-fold-change . . . 4 1.3 Power calculation . . . 6 2 Supplementary results 2: sample pooling results using the Zhang data 11 3 Supplementary results 3: sample pooling results using the NGP nutlin data 16

4 Supplementary results 4: additional results 18

References 19

1 Supplementary results 1: theoretical results

1.1 Proofs for the mean and variance of Y

k

We present here the proof for Var{Yk} in equation (3) in the manuscript. We first prove the expressions analytically and later we present the empirical confirmation using Monte-Carlo simulations. For a reminder, nis the number of biological samples,mis the number of pools,qis the number of biological samples per pool (q=n/m), Uj is the read counts of a given gene in biological samplej = 1,2, . . . , n,Yk is the gene expression level in poolk= 1,2, . . . , m,Wjk denote the mixing weight for biological samplej in poolk, and Ajk is an indicator defined as 1 if biological samplej is in poolk, and 0 otherwise.

Proof. The data generating model in (1) is conditional on the pool sizeq, which is assumed to be fixed. This implies thatAjk is subject to the additional constraintPn

j=1Ajk=q, which affects the variance calculation.

Without the constraint, if we let Q=Pn

j=1Ajk∈ {0,1, . . . , n}, thenQ∼Binomial(n,1/m). Similarly, let Q(j)=P

i6=jAik∈ {0,1, . . . , n−1}, thenQ(j)∼Binomial(n−1,1/m). Therefore, P(Ajk= 1|Q=q) =P(Q=q|Ajk= 1)P(Ajk= 1)

P(Q=q) =P(Q(j)=q−1)m1 P(Q=q) . This is because P(Q=q|Ajk= 1) = P(Q(j)=q−1). Therefore,

P(Ajk= 1|Q=q) =P(Q(j)=q−1)m1

P( = ) =

n-1 q-1

1 m

q−1

1−m1n−q 1 m

n = q

.

(2)

Consequently,

E{Ajk|Q=q}= P(Ajk= 1|Q=q) = q n,

Var{Ajk|Q=q}= (1−P(Ajk= 1|Q=q))×P(Ajk= 1|Q=q) = q(nq) n2 .

The same result can also be obtained if we translate the problem to anm×ncontingency table with fixed row and column totals (similar constraints we have). That is, if pools (k= 1,2, . . . , m) are in the rows and the biological samples (j= 1,2, . . . , n) are in the columns, then all the row totals will beqand all the column totals will be 1, andAjkare the (ij)thellements of the table. This setting will give exactly the same mean and variance ofAjk.

In addition, under the assumption that the pooling weightsWk∼Dirichlet(1,1, . . . ,1) in a given poolk, for the biological samplej (withAjk= 1), E{Wjk}= 1/q and Var{Wjk}= q2q−1(q+1).

Therefore, based on these results E{Yk}=

n

X

j=1

E{Ajk}E{Wjk}E{Uj}+ E{εk}

=

n

X

j=1

q n 1 j= 1

n

n

X

j=1

µj.

For the variance, upon using the result that for n independent random variables X1, X2, . . . Xn, Var{Qn

i=1Xi}=Qn i=1

Var{Xi}+ E{Xi}2

−Qn

i=1E{Xi}2, we find

Var{Yk}=

n

X

j=1

Var{Ajk×Wjk×Uj}+ Var{εk}

=

n

X

j=1

hnVar{Ajk}+ E{Ajk}2o n

Var{Wjk}+ E{Wjk}2o n

Var{Uj}+ E{Uj}2o

− E{Ajk}2E{Wjk}2E{Uj}2i

+σ2

=

n

X

j=1

q(nq) n2 + q2

n2

q−1 q2(q+ 1) + 1

q2

σj2+µ2jq2 n2

1 q2µ2j

+σ2

= 2

n(q+ 1)

n

X

j=1

(σj2+µ2j)− 1 n2

n

X

j=1

µ2j+σ2.

If Uj ∼Negative Binomial(µj, φ), where µj = ρL0j, ρis the relative abundance, L0j is the library size in biological samplej (virtual library size),φis the over-dispersion parameter, then, Var{Uj}=σj2=µj+φµ2j. Therefore, the mean and variance of Yk becomes,

E{Yk}= 1 n

n

X

j=1

µj (1)

Var{Yk}= 2 n(q+ 1)

n

X

j=1

(µj+ (φ+ 1)µ2j)− 1 n2

n

X

j=1

µ2j+σ2. (2)

(3)

To verify the mean and variance ofYk (also the coefficient of variation) based on the expressions in (1) and (2), we set up a Monte-Carlo (MC) simulation with 2000 runs. In a given MC simulationi, i= 1,2, . . . ,2000, generate n = 60 read counts from negative binomial distribution Uj(i) ∼ NB(µj, φ) and subsequently generates Yk(i) using the data generating model (see equation (1) in the manuscript) for a pool size q. In a single MC simulation run i, the mean and variance of Yk(i) are estimated by ¯Yi = m−1Pm

k=1Yk(i) and Si2 = (m−1)−1Pm

k=1

Yk(i)Y¯i2

, respectively. Afterwards, the E{Yk} ≈ 2000−1P2000 i=1 Y¯i and Var{Yk} ≈2000−1P2000

i=1 Si2. Different choices of µj, φ, andq were considered. The results in Figure S1 show that the expressions in (1) and (2) are equivalent to the their corresponding MC approximations. This confirms that the expressions in (1) and (2) describe the true mean and variance ofYk, respectively.

(4)

2 3 4 5 6 10 12 15 20 30

510152025

µj ~ Γ(1, 0.1) φ =0.5

q µYk

MC approximation Theoretical

2 3 4 5 6 10 12 15 20 30

02004006008001000

µj ~ Γ(1, 0.1) φ =0.5

q σYk

2

2 3 4 5 6 10 12 15 20 30

10203040

µj ~ Γ(1, 0.1) φ =2

q µYk

2 3 4 5 6 10 12 15 20 30

02004006008001000

µj ~ Γ(1, 0.1) φ =2

q σYk

2

2 3 4 5 6 10 12 15 20 30

020406080

µj ~ Γ(1, 0.1) φ =5

q µYk

2 3 4 5 6 10 12 15 20 30

0500100015002000

µj ~ Γ(1, 0.1) φ =5

q σYk

2

Figure S1: The Monte-Carlo and the analytical estimate of the mean (µY) and variance of (σY) ofYk at different pool size (q). The solid red line indicates the analytical estimates, whereas the boxplots show the distribution of the the sample estimates in each Monte-Carlo simulation at eachq. The Monte-Carlo estimates are the average across simulations in eachqand they are indicated by solid black points on each boxplot.

1.2 Estimation of the relative abundance and log-fold-change

The moment estimator of the relative abundanceρof a particular gene based on the gene expressions Uj from thenindividual biological samples, is given by

ˆ ρ=

Pn j=1Uj

L0 , (3)

(5)

where L0=Pn

j=1Loj is the total virtual library sizes, and Loj is the virtual library size in biological sample j. Similarly, we can drive the moment estimator ofρbased on gene expressions from the pooled samples Yk, k= 1,2, . . . , mstarting from the sample mean of Yk. That is, ¯Y = m1 Pm

k=1Yk ⇒EY¯ = n1ρL0, and hence

ˆ ρ= q

L0 m

X

k=1

Yk. (4)

Note that, E{Pm

k=1Lk}=L0/q. Consequenty, ˆρin equation (4) can be rewritten as ˆρ= Pm k=1Yk

E{Pm k=1Lk}. Now we compare the expectation and variance of ˆρfrom the the standard experiment (3) and the pooled experiments (4). For this purpose, let ˆρ and ˆρ denote the estimates of the relative abundance from the standard and pooled experiment, respectively. Since we have used the moment estimators of ρfor both settings, it immediately follows that E{ρ}ˆ = E{ρˆ}=ρ.

It can be shown that

Var{ρ}ˆ = 1 L20

n

X

j=1

σj2. (5)

Using equation (3) of the main manuscript, it follows that Var{ρˆ}= q2

L20

m

X

k=1

Var{Yk}

= 2q

q+ 1Var{ρ}ˆ +2nqq(q+ 1) n(q+ 1)L20

n

X

j=1

µ2j+nq L20σ2.

(6)

The right two terms of (6) are nearly 0 (division by very large numberL20) and have a negligible contribution to Var{ρˆ}. Consequently, we find

Var{ρˆ} Var{ρ}ˆ ≥ 2q

q+ 1. (7)

The expression in (7) implies that pooling leads to an estimate of the relative abundance which is at least 2q/(q+ 1) times more variable than the estimate we can obtain without pooling.

In DGE analysis, one essential statistic is the estimate of the biological effect (effect size). In many parametric methods, the log-fold-change (LFC) is commonly used to caliberate the biological effect size. As a result, we compare the LFC estimates from the standard and the pooled experiments. For testing DGE between two independent groups, the LFC of a particular gene is defined asθ= logρρ21, whereρk is the relative abundance in groupk∈ {1,2}. The estimate ofθ for the standard experiment is given by ˆθ= logρρˆˆ21 and for the pooled experiments ˆθ= logρρˆˆ2

1.

Using the second-order Taylor expansion (the Delta method), we can approximate the variance of ˆθ and ˆθ as

Varn θˆo

≈Var{ρˆ2}

ρ22 +Var{ρˆ1}

ρ21 , (8)

and

Varn θˆo

≈ Var{ρˆ2}

ρ22 +Var{ρˆ1}

ρ21 . (9)

(6)

Therefore, it follows that

Varn θˆo Varn

θˆo ≥ 2q

q+ 1 (10)

This also indicates that the LFC estimate from pooled experiments is at least 2q/(q+ 1) times more variable than that of the standard experiment. This is an important characteristic that affects the statistical power of a DGE test as shown in the next section.

1.3 Power calculation

Assume there is no pooling and we want to test for DGE between two independent groups of biological samples. Let Ujk denotes the read counts in biological samplej = 1,2, . . . , nk of group k ∈1,2. Again we assume thatUjk ∼Negative Binomial(µjk, φ), whereφis the over-dispersion parameter (assumed to be constant for all samples and all groups), andµjk= E{Ujk}=ρkL0jk, whereρk is the relative abundance in groupkandL0jk is the library size of biological sample j in groupk. LetAjk be the group label ofUjk, such thatAjk= 0 ifk= 1 andAjk = 1 ifk= 2. nk denotes the number of biological samples in groupk, with n=n1+n2. We want to test the null hypothesisH0:ρ1=ρ2against the alternative HA:ρ16=ρ2 at theα level of significance. In this section, we will calculate the statistical power of testing this hypothesis based on the method discussed in Zhu and Lakkis (2014).

We can fit the following negative binomial regression withL0jk as offset,

logµjk= log{ρkL0jk}=β0+β1Ajk+ logL0jk, (11) whereβ0 is the intercept andβ1 is the coefficient of the factorA. In this model, the parameterβ1 represents the LFC between the two groups, that isβ1= logρρ21. This means,β1 is equivalent to the LFC parameter introduced earlier asθ. Therefore, we can rewrite the hypothesis of DE asH0:β1= 0 against the alternative HA:β16= 0.

If ˆβ1 is the maximum-likelihood estimator ofβ1(under HA), then the variance of ˆβ1 is given by

Varn βˆ1

o= 1 n1

1 L¯o

1 ρ1+ 1

2

+(1 +R)φ R

= 1 n1

VA, (12)

whereR=n2/n1, ¯Lo=n−1L0(the mean library size across all biological samples). It is also easy to show that Varn

βˆ1

o= Varn θˆo

, shown in (8). Under the null hypotsis, Varn

βˆ1

o= 1 n1

1 L¯o

1

˜ ρ1 + 1

˜2

+(1 +R)φ R

= 1

n1V0, (13)

where ˜ρ1 and ˜ρ2 are the true relative abundances underH0, such that ˜ρ1= ˜ρ2=ρ1.

Recall that our objective is to determine the power of testing the above hypothesis using the pooled experiment. Therefore, let ˆβ1 is the estimate ofβ1 using the gene expression data from the pooled samples.

βˆ1 is the equivalent LFC in the pooled experiment, which was denoted by ˆθ earlier, i.e ˆβ1 = ˆθ. In (10), we have established that Varn

θˆo

≥ Varn θˆo 2q

q+1. Consequently, under the alternative hypothesis Varn

βˆ1o

q+12q Varn βˆ1

oand under the null hypothesis Varn βˆ1o

|H0q+12q Varn βˆ1

o|H0.

Therefore, given the pool size (q), the number of RNA samples in groups 1 and 2 (n1 andn2, respectively), the effect size to be detected θ, and over-dispersion φ, the power of the two-sided likelihood ratio test at significance levelαcan be calculated as,

(7)

power≤Φ

pn1(q+ 1)|θ| −Zα/2√ 2qV0

√2qVA

!

, (14)

whereΦ(.) is the cumulative standard normal distribution, andZα/2 is the (1−α/2)100% quantile of the standard normal distribution. Note that in pooled experiments,n1 andn2are the number of RNA samples before library prepartion.

In Figure S2 and S3, we present the relationship between the power and the total cost of data generation for different experimental design, including the sample pooling. In particular, we compare three cost-saving strategies and a reference scenario (the full budget experiment). These are

reference: contains a total ofnbiological samples from two groups (each withn/2 samples) and there is no pooling. The average library size per sample is 20×106. The total cost isCt=CSP×n+CLP×n+CS×L0, whereCSP, CLP, &CS are sample preparation cost, library preparation cost and sequencing cost per 106, respectively.

Strategy A: pooling experiment with pool size q. The n/2 RNA samples in each group are pooled to m pools m = n/2q with average library size per pool is 20×106. Hence, the total cost is Ct=CSP×n+CLP×2m+CS×L0/q. This strategy reduces the library preparation and sequencing costs.

Strategy B: reducing the number of biological samples (n) without pooling. Instead of the n total number of samples (in the reference design) we usenssamples withns/2 per group with average library size per samples is 20×106. Hence, the total cost isCt=CSP×ns+CLP×ns+CS×(nL0/ns). This strategy reduces the sample preparation, library preparation and sequencing costs.

Strategy C: reducing the sequencing depth. This is similar to the reference scenario, except that the average library size is reduced to L, whereL <20×106. Hence, this strategy reduces only the sequencing cost by a factorl,l= 20×106/L.

(8)

0.0 0.2 0.4 0.6 0.8 1.0 0.0

0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =0.5

q=2 q=4 q=3

q=6

n_s=60

n_s=40 n_s=30 n_s=20

L=0.5M L=1M

L=5M L=10M

reference strategy A strategy B strategy C

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =2

q=2 q=4 q=3

q=6

n_s=60 n_s=40 n_s=30

n_s=20 L=0.5M

L=1M L=5M

L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =0.5

q=2 q=3 q=4 q=6

n_s=60 n_s=40 n_s=30

n_s=20

L=0.5M L=1M

L=5M L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =2

q=2 q=4 q=3

q=6

n_s=60

n_s=40

n_s=30

n_s=20 L=0.5M

L=1M L=5M

L=10M

relative cost

power

Figure S2: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation.

The gene expression levels are generated from NB(ρLj, φ). This particular plot is for a gene with relative abundace of ρ= 10−7 (low–abundance gene) in one of the groups. The reference strategy (denoted by a diamond shape) containsn= 120 biological samples (without pooling) with a mean library size of 20×106 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ(reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design.

(9)

0.0 0.2 0.4 0.6 0.8 1.0 0.0

0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =0.5

q=2 q=4 q=3

q=6

n_s=60

n_s=40

n_s=30

n_s=20

L=0.5M L=1M

L=5M L=10M

reference strategy A strategy B strategy C

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =2

q=2 q=4 q=3

q=6

n_s=60 n_s=40 n_s=30 n_s=20

L=0.5M L=1M

L=5M L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =0.5

q=2 q=3 q=4 q=6

n_s=60 n_s=40 n_s=30 n_s=20

L=0.5ML=1M L=5M L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =2

q=2 q=4 q=3

q=6 n_s=60

n_s=40

n_s=30

n_s=20

L=0.5M L=1M

L=5M L=10M

relative cost

power

Figure S3: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation.

The gene expression levels are generated from NB(ρLj, φ). This particular plot is for a gene with relative abundace ofρ= 10−6(moderate level of expression) in one of the groups. The reference strategy (denoted by a diamond shape) containsn= 120 biological samples (without pooling) with a mean library size of 20×106 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ(reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design.

(10)

0.0 0.2 0.4 0.6 0.8 1.0 0.0

0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =0.5

q=2 q=4 q=3

q=6 n_s=60

n_s=40

n_s=30

n_s=20

L=0.5ML=1M L=5M L=10M

reference strategy A strategy B strategy C

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =0.5 φ =2

q=2 q=4 q=3

q=6

n_s=60 n_s=40 n_s=30 n_s=20

L=0.5ML=1M L=5M L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =0.5

q=2 q=3 q=4 q=6

n_s=60 n_s=40 n_s=30 n_s=20

L=0.5ML=1M L=5M L=10M

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

θ =1 φ =2

q=2 q=4 q=3

q=6 n_s=60

n_s=40

n_s=30

n_s=20

L=0.5ML=1M L=5M L=10M

relative cost

power

Figure S4: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation.

The gene expression levels are generated from NB(ρLj, φ). This particular plot is for a gene with relative abundace ofρ= 10−5 (high level of expression) in one of the groups. The reference strategy (denoted by a diamond shape) containsn= 120 biological samples (without pooling) with a mean library size of 20×106 per sample. Each panel represents a different LFC (θ) between the two groups and over-dispersion parameter φ(reflecting biological variability). The relative cost is determined as the total cost of each strategy divided by the total cost of the reference design.

(11)

2 Supplementary results 2: sample pooling results using the Zhang data

2627282930

A

scenario

number of genes/1000

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

size ~ depth/smmple

pool size

1 2 4

01020304050

B

scenario

number of libraries per group

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

size ~ cost

Figure S5: Sample level summaries. A) the number of genes with non-zero expressions in at least 3 libraries versus sequencing depth per library (symbol size); B) total cost (symbol size) versus number of libraries for each scenario.

(12)

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 0.4

0.6 0.8 1.0

Distribution of correlations between samples

correlation

pool size 1 2 4

5 10 15 20

0.82 0.84 0.86 0.88 0.90 0.92

median library size

median correlation

1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.82

0.84 0.86 0.88 0.90 0.92

pool size (q)

median correlation

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

0.25 0.30 0.35 0.40

Distribution of fraction of zero counts per samples

fraction

5 10 15 20

0.25 0.30 0.35

median library size

median fraction of zeroes

1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.25

0.30 0.35

pool size (q)

median fraction of zeroes

Figure S6: Sample level summaries of the observed data in each scenario in terms of (1) the distribution of the pairwise correlation coefficients between samples within a condition (MYCN status), and (2) the distribution of the fraction of zero counts observed in each sample. These summaries are also plotted as a function of the median library size and pool size in each scenario.

(13)

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 18%

PC2: 11% group

0 1

scenario A0

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 20%

PC2: 12% group

0 1

scenario A1

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 19%

PC2: 10% group

0 1

scenario A2

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 18%

PC2: 11% group

0 1

scenario A3

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 18%

PC2: 10% group

0 1

scenario A4

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 23%

PC2: 11% group

0 1

scenario B1

−1 0 1

−1 0 1

PC1: 24%

PC2: 14% group

0 1

scenario B2

−2

−1 0 1 2

−2 −1 0 1 2

PC1: 23%

PC2: 11% group

0 1

scenario B3

−1 0 1

−1 0 1

PC1: 24%

PC2: 14% group

0 1

scenario B4

−1 0 1

−1 0 1

PC1: 33%

PC2: 13% group

0 1

scenario C1

−1 0 1

−1 0 1

PC1: 38%

PC2: 13% group

0 1

scenario C2

−1 0 1

−1 0 1

PC1: 33%

PC2: 13% group

0 1

scenario C3

−1 0 1

−1 0 1

PC1: 37%

PC2: 13% group

0 1

scenario C4

Figure S7: Two-dimensional visualization of neuroblastoma RNA samples (before and after pooling) using principal component analysis. The groups are defined as the MYCN status (group=1 for MYCN amplified and group=0 for MYCN non-amplified samples). In particular, the PCA was applied on the log-CPM transforemated read counts.

(14)

MYCN pathway top 200 DE genes

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

5 10 15 20 25

0.0 2.5 5.0 7.5 10.0

standardized LFC

q

1 2 4

Figure S8: Standardized log-fold-change (LFC) for MYCN pathway genes and the top 200 DE genes detected in the reference scenario A0 using limma-voom

edgeR limma

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 0

1000 2000 3000 4000

#DE genes (5% FDR)

pool.size 1 2 4

A

edgeR limma

A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 0.00

0.25 0.50 0.75 1.00

fraction of overlap with A0

B

Figure S9: Differential gene expression results for pooling scenarios generated using the Zhang RNA-seq dataset. A) The number of DE genes detected at 5% FDR; B) The fraction of overlap (concordance) defined as the fraction of DE genes detected in a test scenario that are also detected in the reference scenario.

(15)

coefficients of variation log2−mean expression

A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 A0 A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4

0 1 2 3

0.5 1.0 1.5 2.0

q

1 2 4

Figure S10: The distribution of the mean and coefficients of variation of the normalied read counts of the set of DE genes unquely detected by limma in each scenario.

(16)

3 Supplementary results 3: sample pooling results using the NGP nutlin data

LFC log−variance log2−mean expression

A0 A B C A0 A B C A0 A B C

0 1 2 3 4

0 1 2 3 4

0 2 4 6 8

log2−mean expression

A

0.00 0.05 0.10 0.15

A B C

LFC bias: mean absolute bias

B

Figure S11: Summary of gene level characteristics for pooling scenarios generated using the NGP nutlin dataset. A) The distribution of log2-mean normalized expression of genes, the log2-vriance of normalized expression of genes, and the log-fold-change (LFC) between nutlin-3 and control; B) the estimated bias (mean absolute bias) of the three test scenarios relative to the reference scenario A0.

edgeR limma

A0 A B C A0 A B C

0 1000 2000 3000

#DE genes (5% FDR)

A

edgeR limma

A B C A B C

0.00 0.25 0.50 0.75 1.00

fraction of overlap with A0

B

Figure S12: Differential gene expression results for pooling scenarios generated using the NGP nutlin RNA-seq dataset. A) The number of DE genes detected at 5% FDR; B) The fraction of overlap (concordance) defined as the fraction of DE genes detected in a test scenario that are also detected in the reference scenario.

(17)

q=1 q=2 q=3 q=1

q=2

q=3

q=1

q=2

q=1 q=2 q=3q=3 q=1q=1 q=2q=2 q=3q=3

ρ =1e−07 ρ =1e−06 ρ =1e−05

0.65 0.70 0.75 0.80 0.85 0.65 0.70 0.75 0.80 0.85 0.65 0.70 0.75 0.80 0.85

0.00 0.25 0.50 0.75 1.00

relative cost

power

LFC

a a

θ ≥0.5 θ ≥1.0

Figure S13: Zodiac plot representing power (at 5% significance level) versus the total cost of data generation for NGP cell line data. The gene expression levels are generated from NB(ρLj, φ), whereφis the common over- dispersion parameter for the NGP nutlin data (estimated using the edgeR package). The plots are generated for low, medium and high abundance genes with relative abundace of ρ = 10−7, ρ = 10−6andρ = 10−5, respectively. One unpooled design (q= 1) and two pooled designs (q= 2 andq= 3) were compared. These designs have equal number of replicates (3 replicates per group). That is, 3 individual cell lines (q= 1), 3 pools of 2 cell lines (q= 2) and 3 pools of 3 cell lines (q= 3). The mean library size per cell line is 15×106. The curves are generated for two different minimum LFCs (θ) between the two groups,θ≥0.5 andθ≥1.

The relative cost is determined as the total cost of each strategy divided by the maximum total cost without pooling (9 cell lines per group).

(18)

4 Supplementary results 4: additional results

A B C

0.0 0.3 0.6 0.9 1.2 0.0 0.3 0.6 0.9 1.2 0.0 0.3 0.6 0.9 1.2

C3 − B4 C1 − B2 B3 − A2 B1 − A1

A0 − A4 A3 − A4 C2 − C4 A0 − A3 B2 − B4 C1 − C3 B1 − B3

A3 − A2 A0 − A1 B3 − B4 B1 − B2 C3 − C4 C1 − C2

score difference

Figure S14: Pairwise comparison of scenarios based on the overall score for different characteristics. A) scenarios with different number of libraries but equal sequencing depth per library (demonstrating a sample size driven effect), B) scenarios with equal number of libraries but different sequencing depth per library (demonstrating a minor effect of sequencing depth), and C) scenarios with equal number of libraries and

equal sequencing depth per library but different pool size (demonstrating a large pooling effect).

(19)

References

Zhu, Haiyuan, and Hassan Lakkis. 2014. “Sample Size Calculation for Comparing Two Negative Binomial Rates.” Statistics in Medicine33 (3): 376–87.

Références

Documents relatifs

These studies show that we obtain a high user cost elasticity if (1) the sample period is short and/or (2) the cash flow or the growth of sales is omitted from the regression and/or

The criteria for selecting the VC methods in our study are that they must be 1) non-parallel, i.e., do not require a parallel corpus of sen- tences uttered by both the source and

Higher is the market price of risk in the insurance linked securities market, higher is the cost of capital rate, whatever the return period.. Our results differ

• R ESULTS Phenotypic noise promotes adapta2ve evolu2on under direc2onal and/or stabilizing selec2on if the logarithmic fitness plateaus. For mul2ple phenotypic characters

To summarize, phenotypic noise reduces the cost of complexity on the mean phenotype and promotes adaptive evolution of the mean phenotype as long as it does not suffer too much from

Nev- ertheless, this work was influential as it firmly established MITL as the most important fragment of Metric Temporal Logic over dense-time having a feasible

masculine sur laquelle il est fondé : C’est la division sexuelle du travail, distribution très stricte des activités imparties à chacun des deux sexes, de leur lieu, leur moment,

Brunschot S, Geskus RB, Besselink MG, Bollen TL, van Eijck CH, Fockens P, Hazebroek EJ, Nijmeijer RM, Poley JW, van Ramshorst B, Vleggaar FP, Boermeester MA, Gooszen HG, Weusten