Independent filtering - Statistical analysis of RNA-Seq data

By removing the weakly-expressed genes from the input to the FDR procedure, you can find more genes to be significant among those which are kept, and so improve the power of your test.

Exercise 4.17 Load theHTSFilterpackage and perform independent filtering from results of the exact test. Use the HTSFilter()function on thedgeTestdata object and create an objectdgeTestFiltof the same class asdgeTest containing the data that pass the filter.

Answer:

library(HTSFilter)

dgeTestFilt

An object of class "DGEExact"

$table

logFC logCPM PValue FBgn0000008 -0.054919 3.036 8.240e-01 FBgn0000017 -0.248031 8.440 4.289e-02 FBgn0000018 -0.036550 4.940 7.893e-01 FBgn0000032 0.004668 6.172 9.729e-01 FBgn0000042 0.516376 12.711 6.202e-08 6614 more rows ...

$comparison

[1] "control" "treated"

4.7 Diagnostic plot for multiple testing

For diagnostic of multiple testing results it is instructive to look at the histogram of p-values.

Exercise 4.18 Use thehist()function to plot a histogram from (unadjusted) p-values in thedgeTestdata object.

Answer: Usebreaks = 50 inhist()to generate this plot.

Histogram of dgeTest$table$PValue

dgeTest$table$PValue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0500100015002000

Exercise 4.19 Plot a histogram from (unadjusted) p-values after independent filtering.

Answer: Usebreaks = 50 inhist()to generate this plot.

Histogram of dgeTestFilt$table$PValue

dgeTestFilt$table$PValue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050010001500

4.8 Inspecting the results

Results tables are generated using the functiontopTags(), which extracts a table with log2 fold changes, p-values and adjusted p-values.

Exercise 4.20 Use the topTags() function to extract a tabular summary of the differential expression statistics from test results before and after independent filtering (check the nargument). Create variables resNoFiltand resFiltfrom it. Visualize and inspect these variables. Are genes sorted? If yes, these are sorted by?

Answer:

head(resNoFilt)

logFC logCPM PValue FDR FBgn0039155 -4.378 5.588 4.293e-184 4.937e-180 FBgn0025111 2.943 7.159 2.758e-152 1.586e-148 FBgn0003360 -2.961 8.059 1.939e-151 7.432e-148 FBgn0039827 -4.129 4.281 5.594e-104 1.608e-100 FBgn0026562 -2.447 11.903 2.260e-102 5.198e-99 FBgn0035085 -2.499 5.542 1.241e-96 2.379e-93 head(resFilt)

logFC logCPM PValue FDR FBgn0039155 -4.378 5.588 4.293e-184 2.841e-180 FBgn0025111 2.943 7.159 2.758e-152 9.128e-149 FBgn0003360 -2.961 8.059 1.939e-151 4.277e-148 FBgn0039827 -4.129 4.281 5.594e-104 9.257e-101 FBgn0026562 -2.447 11.903 2.260e-102 2.992e-99 FBgn0035085 -2.499 5.542 1.241e-96 1.369e-93

Exercise 4.21 Compare the number of genes found at an FDR of 0.05 from the differential analysis before and after independent filtering.

Answer:

## Before independent filtering [1] 1347

## After independent filtering [1] 1363

Continue with the independent filtered data in theresFiltdata object.

TheFDRcolumn in the tableresFiltcontains the adjusted p-values for multiple testing with the Benjamini-Hochberg procedure (i.e. FDR). This is the information that we will use to decide whether the expression of a given gene differs significantly across conditions (e.g. we can arbitrarily decide that genes with an FDR < 0.01 are differentially expressed).

Exercise 4.22 Consider all genes with an adjusted p-value below 5%=0.05 (alpha = 0.05) and subset the results to these genes. Sort it by the log₂-fold-change estimate to get the significant genes with the strongest down-regulation.

Answer:

head(sigDownReg)

logFC logCPM PValue FDR FBgn0085359 -5.153 1.966 2.843e-26 3.764e-24 FBgn0039155 -4.378 5.588 4.293e-184 2.841e-180 FBgn0024288 -4.208 2.161 1.359e-33 2.645e-31 FBgn0039827 -4.129 4.281 5.594e-104 9.257e-101 FBgn0034434 -3.824 3.107 1.732e-52 7.644e-50 FBgn0034736 -3.482 4.060 5.794e-68 3.196e-65

Exercise 4.23 Repet the Exercise4.22for the strongest up-regulated genes.

Answer:

head(sigUpReg)

logFC logCPM PValue FDR FBgn0033764 3.268 2.612 3.224e-29 4.743e-27 FBgn0035189 2.973 4.427 7.154e-48 2.492e-45 FBgn0025111 2.943 7.159 2.758e-152 9.128e-149 FBgn0037290 2.935 2.523 1.192e-25 1.461e-23 FBgn0038198 2.670 2.587 4.163e-19 3.167e-17 FBgn0000071 2.565 5.034 1.711e-78 1.416e-75

Exercise 4.24 Create persistent storage of results. Save the result tables as a csv (comma-separated values) file using thewrite.csv()function (alternative formats are possible).

Answer:

write.csv(sigDownReg, file = "sigDownReg.csv") write.csv(sigUpReg, file = "sigUpReg.csv")

4.9 Interpreting the DE analysis results

MA-plot

Exercise 4.25 Create a MA-plot using theplotSmear()showing the genes selected as differentially expressed with a 1% false discovery rate.

Answer:

2 4 6 8 10 12 14

−4−202

Average logCPM

logFC

Volcano plot

Exercise 4.26 Create a volcano-plot from the res data object. First construct a table containing the log₂ fold change and the negative log₁₀-transformed p-values, then generate the volcano plot using the standard plot() command. Hint: the negative log10-transformed value ofxis-log10(x).

Answer: First few rows of the table containing the log2 fold change and the negative log10-transformed p-values

logFC negLogPval 1 -4.378 179.55 2 2.943 148.04 3 -2.961 147.37 4 -4.129 100.03 5 -2.447 98.52 6 -2.499 92.86

−4 −2 0 2

050100150

logFC

negLogPval

Gene clustering

To draw a CIM (or heatmap) of individual RNA-seq samples,edgeRsuggest using moderated log-counts-per-million.

This can be calculated by thecpm()function with positive values for prior.count, for example y = cpm(dge, prior.count = 1, log = TRUE)

where dge is the normalized DGEList object. This produces a matrix of log₂ counts-per-million (logCPM), with undefined values avoided and the poorly defined log-fold-changes for low counts shrunk towards zero. Larger values forprior.countproduce more shrinkage.

Exercise 4.27 Transform the normalized counts fromdgedata object using thecpm()function.

Answer: First few rows of the transformed counts.

treated2 treated3 untreated3 untreated4 FBgn0000003 -3.2526 -2.3226 -3.253 -3.253 FBgn0000008 3.2056 2.7556 3.195 2.894 FBgn0000015 -3.2526 -3.2526 -2.158 -1.670 FBgn0000017 8.3151 8.3073 8.730 8.366 FBgn0000018 4.9585 4.8757 4.873 5.025 FBgn0000024 -0.2681 -0.7863 -1.113 -1.255

Since the clustering is only relevant for genes that actually are differentially expressed, carry it out only for a gene subset of most highly differential expression.

Exercise 4.28 Select those genes that have adjusted p-values below 0.01 and absolute log2-fold-change above 1.5 from the trasformed data, and use the functioncim()to produce a CIM.

Answer: Transformed values for the first ten selected genes.

treated2 treated3 untreated3 untreated4 FBgn0039155 2.0155 2.335 6.466 6.608 FBgn0025111 7.9476 7.996 5.059 5.006 FBgn0003360 5.9800 5.878 8.802 8.970 FBgn0039827 0.6377 1.516 5.252 5.212 FBgn0026562 10.1798 10.247 12.544 12.769 FBgn0035085 3.8172 3.843 6.279 6.363 FBgn0029167 6.6574 6.462 8.839 8.733 FBgn0000071 5.7325 5.821 3.157 3.283 FBgn0029896 3.3297 3.438 6.007 5.834 FBgn0034897 4.8330 4.643 6.848 6.746

Usecol = cimColorfor sequential colour schemes and symkey = FALSEin thecim() function.

## For sequential colour schemes library(RColorBrewer)

cimColor = colorRampPalette(brewer.pal(9, "Blues"))(255)

−2.32 5.22 9 12.77 Color key

FBgn0026562FBgn0025111 FBgn0001226 FBgn0003360 FBgn0261552FBgn0023479 FBgn0029167 FBgn0034010 FBgn0035085FBgn0029896 FBgn0036821 FBgn0053307 FBgn0040091FBgn0040099 FBgn0034897 FBgn0259715 FBgn0035189FBgn0003501 FBgn0000071 FBgn0034885 FBgn0024984 FBgn0051092FBgn0011260 FBgn0050463 FBgn0035523 FBgn0033775FBgn0261284 FBgn0037290 FBgn0033764 FBgn0038198FBgn0051642 FBgn0086910 FBgn0037153 FBgn0260011FBgn0032405 FBgn0035727 FBgn0034438 FBgn0085359FBgn0024288 FBgn0039155 FBgn0039827 FBgn0034736 FBgn0023549FBgn0000406 FBgn0038832 FBgn0040827 FBgn0037754FBgn0052407 FBgn0016715 FBgn0034389 FBgn0031703FBgn0038351 FBgn0030041 FBgn0034434 FBgn0260429FBgn0035344 FBgn0033495 FBgn0024315 FBgn0039640 FBgn0002868 treated2 treated3 untreated3 untreated4

Dans le document Statistical analysis of RNA-Seq data (Page 23-28)