Diagnostic plots for multiple testing - Statistical analysis of RNA-Seq data

For diagnostic of multiple testing results it is instructive to look at the histogram of values. A histogram of p-values is a bar graph of the number or proportion of p-p-values that fall within certain non-overlapping sub-intervals of [0, 1]. This simple graphic assessment can indicate when problems are present in the analysis. The most desirable shape for the p-value histogram is one in which the p-values are most dense near zero and become less dense as the p-values increase (Figure19(a)). This shape does not indicate any problem of the methods operating on p-values and suggests that several genes are differentially expressed, though they may not be statistically significant after adjusting for multiple testing. An U-shaped histogram is also desirable (Figure19(b)). The enrichment of low p-values stems from the differentially expressed genes, while those not differentially expressed are spread uniformly over the range from zero to one (except for the p-values from genes with very low counts, which take discrete values and so give rise to high counts for some bins at the right).

0 1000 2000 3000

0.00 0.25 0.50 0.75 1.00

p−values

0.00 0.25 0.50 0.75 1.00

p−values

Frequency

(b)

Figure 19: Desirable shape histograms of the p-values returned by the test for differential expression.

Observed p-values can be considered as mix of samples from:

• a uniform distribution (from true nulls); and

• from distributions concentrated at 0 (from true alternatives).

0 200 400 600 800

0.00 0.25 0.50 0.75 1.00

Frequency

0 1000 2000 3000

0.00 0.25 0.50 0.75 1.00

0 1000 2000 3000

0.00 0.25 0.50 0.75 1.00

+ =

Distribution under H0 Distribution under H1 Resulting mixture distribution

If the histogram of the p-values does not match a profile as shown above (Figure19), the multiple testing approach is not reliable. A flat p-value histogram (Figure 20(a)) does not indicate any problem of the methods that operate on p-values but suggests the disappointing result that very few, if any, genes are differentially expressed. Depletion of small p-values can indicate the presence of confounding hidden variables – “batch effect” (Figure20(b)). A p-value histogram with one or more humps (Figure 20(c)) can indicate that an inappropriate statistical test was used to compute the p-values, or a strong and extensive correlation structure is present in the data set. A p-value histogram with a choppy appearance (Figure20(d)) indicates that the p-values are discretely distributed.

0 200 400 600 800

0.00 0.25 0.50 0.75 1.00

p−values

Frequency

(a)

0 200 400 600 800

0.00 0.25 0.50 0.75 1.00

p−values

Frequency

(b)

0 1000 2000 3000

0.00 0.25 0.50 0.75 1.00

p−values

Frequency

(c)

0 500 1000 1500

0.00 0.25 0.50 0.75 1.00

p−values

Frequency

(d)

Figure 20: Examples of p-value histograms from multiple testing approach not reliable.

10 Differential expression analysis

10.1 Modeling count data: the Binomial and Poisson distribution

In high-throughput sequencing experiments, the raw data are millions of reads which are typically aligned to locations (regions) in the genome.

190 mb

190.01 mb

190.02 mb

190.03 mb

Reads

Let’s suppose there are as many colors as regionsgin the genome

genome

and reads which are aligned to regiong take the same (light) color.

genomereads

A sequenced read can be thought of as a draw of a colored ball from a bag, where there are as many colors as regions g and the bag represents a large pool of DNA fragments.

This bag containsncolored balls, the total number of sequenced reads. For each regiong, independent trials consist in to draw a ball out of the bag, without looking.

genomereads

ConsiderKg, the number of sequenced reads which can be assigned to a particular regiong.

The probability pg (simply p to simplify notation) for drawing a ball of one color from the bag is given by the proportion of DNA fragments arising from the genomic regiong.

We now derive the probability function ofKg. First we note thatKgcan take the values0,1,2, . . . , nwherenis the total number of sequenced reads. Hence, there aren+ 1possible discrete values forK_g. We are interested in the probability thatK_g takes a particular valuek. If there arek aligned reads to regiong (successes), then there must ben−knot aligned reads to regiong(failures). One way for this to happen is for the first ktrials to be successes, and the remainingn−k to be failures. The probability of this is given by

p×p× · · · ×p

| {z }

kof these

×(1−p)×(1−p)× · · · ×(1−p)

| {z }

n−kof these

=p^k(1−p)^n−k

However, this is only one of the many ways that we can obtain k successes and n−k failures. They could occur in exactly the opposite order: all n−kfailures first, thenk successes. This outcome, which has the same number of successes as the first outcome but with a different arrangement, also has probability p^k(1−p)^n−k. There are many other ways to arrangek successes andn−kfailures among thenpossible positions. Once we position thek successes, the positions of then−kfailures are unavoidable.

The number of ways of arrangingksuccesses inntrials is equal to the number of ways of choosing kobjects from nobjects, which is equal to

We can now determine the total probability of obtaining k successes. Each outcome with k successes and n−k failures has individual probabilityp^k(1−p)^n−k, and there are ⁿ_k

possible arrangements. Hence, the total probability ofksuccesses in nindependent trials is

P(Kg=k) = n

p^k(1−p)^n−k, fork= 0,1,2, . . . , n This is theBinomial distribution with parametersnandp. We writeKg ∼Bin(n, p).

Poisson approximation

As the number of sequenced reads becomes large and the probability p shrinks, the Binomial distribution can be approximated by the Poisson distribution:

• split the pool of DNA fragments into a large number of nintervals, each with a small probabilitypof success and such that the total number of successes Kg∼Bin(n, p)

0 1 2 3 4 5 n – 1 n

p p p p p

Pool of ADN

• the number of successes, occurring in the interval[0, n], occur independently and at a constant average rate λ=np. Then

This is thePoisson distributionwith parametersλ. We writeKg∼Pois(λ).

Example

Suppose that there is a constant probabilityp= 0.0001that an read is aligned at the region g. Then the number of sequencing reads which can be assigned to region g, K_g, in a pool of DNA of 13000 fragments is Bin(13000, 0.0001).

P(Kg=k) 0 1 5 10 15 20 25

n k

p^k(1−p)^n−k 0.000074 0.000708 0.048227 0.123566 0.026498 0.001098 0.000013

λ^k

k!e^−λ 0.000075 0.000711 0.048266 0.123502 0.026519 0.001103 0.000013

As shown in Figure21, these two distributions are very similar.

Figure 21: With a large number of trials and small probabilities, the Binomial is well approximated by a Poisson. The discrete probabilities are jointed by lines for easing the visualization.

Dans le document Statistical analysis of RNA-Seq data (Page 41-46)