Independent filtering - Statistical analysis of RNA-Seq data (solutions)

By removing the weakly-expressed genes from the input to the FDR procedure, you can find more genes to be significant among those which are kept, and so improve the power of your test.

Exercise 4.17 Load theHTSFilterpackage and perform independent filtering from results of the exact test. Use the HTSFilter()function on thedgeTestdata object and create an objectdgeTestFiltof the same class asdgeTest containing the data that pass the filter.

Solution 4.17 library(HTSFilter)

dgeTestFilt = HTSFilter(dgeTest, DGEList = dge, plot = FALSE)$filteredData

4.7 Diagnostic plot for multiple testing

For diagnostic of multiple testing results it is instructive to look at the histogram of p-values.

Exercise 4.18 Use thehist()function to plot a histogram from (unadjusted) p-values in thedgeTestdata object.

Solution 4.18

Usebreaks = 50inhist()to generate this plot.

hist(dgeTest$table$PValue, breaks = 50)

Histogram of dgeTest$table$PValue

dgeTest$table$PValue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

0500100015002000

Exercise 4.19 Plot a histogram from (unadjusted) p-values after independent filtering.

Solution 4.19

Usebreaks = 50inhist()to generate this plot.

hist(dgeTestFilt$table$PValue, breaks = 50)

Histogram of dgeTestFilt$table$PValue

dgeTestFilt$table$PValue

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

050010001500

4.8 Inspecting the results

Results tables are generated using the functiontopTags(), which extracts a table with log2 fold changes, p-values and adjusted p-values.

Exercise 4.20 Use the topTags() function to extract a tabular summary of the differential expression statistics from test results before and after independent filtering (check the nargument). Create variables resNoFiltand resFiltfrom it. Visualize and inspect these variables. Are genes sorted? If yes, these are sorted by?

Solution 4.20

resNoFilt = topTags(dgeTest, n = nrow(dgeTest))$table

resFilt = topTags(dgeTestFilt, n = nrow(dgeTestFilt))$table head(resNoFilt)

logFC logCPM PValue FDR FBgn0039155 -4.378 5.588 4.293e-184 4.937e-180 FBgn0025111 2.943 7.159 2.758e-152 1.586e-148 FBgn0003360 -2.961 8.059 1.939e-151 7.432e-148 FBgn0039827 -4.129 4.281 5.594e-104 1.608e-100 FBgn0026562 -2.447 11.903 2.260e-102 5.198e-99 FBgn0035085 -2.499 5.542 1.241e-96 2.379e-93 head(resFilt)

logFC logCPM PValue FDR FBgn0039155 -4.378 5.588 4.293e-184 2.841e-180 FBgn0025111 2.943 7.159 2.758e-152 9.128e-149 FBgn0003360 -2.961 8.059 1.939e-151 4.277e-148 FBgn0039827 -4.129 4.281 5.594e-104 9.257e-101 FBgn0026562 -2.447 11.903 2.260e-102 2.992e-99 FBgn0035085 -2.499 5.542 1.241e-96 1.369e-93 By default, genes are sorted by p-value ("PValue").

Exercise 4.21 Compare the number of genes found at an FDR of 0.05 from the differential analysis before and after independent filtering.

Continue with the independent filtered data in theresFiltdata object.

TheFDRcolumn in the tableresFiltcontains the adjusted p-values for multiple testing with the Benjamini-Hochberg procedure (i.e. FDR). This is the information that we will use to decide whether the expression of a given gene differs significantly across conditions (e.g. we can arbitrarily decide that genes with an FDR < 0.01 are differentially expressed).

Exercise 4.22 Consider all genes with an adjusted p-value below 5%=0.05 (alpha = 0.05) and subset the results table to these genes. Sort it by the log2-fold-change estimate to get the significant genes with the strongest down-regulation.

Solution 4.22

sigDownReg = resFilt[resFilt$FDR < alpha, ]

sigDownReg = sigDownReg[order(sigDownReg$logFC), ] head(sigDownReg)

logFC logCPM PValue FDR FBgn0085359 -5.153 1.966 2.843e-26 3.764e-24 FBgn0039155 -4.378 5.588 4.293e-184 2.841e-180 FBgn0024288 -4.208 2.161 1.359e-33 2.645e-31 FBgn0039827 -4.129 4.281 5.594e-104 9.257e-101 FBgn0034434 -3.824 3.107 1.732e-52 7.644e-50 FBgn0034736 -3.482 4.060 5.794e-68 3.196e-65

Exercise 4.23 Repet the Exercise4.22for the strongest up-regulated genes.

Solution 4.23

sigUpReg = resFilt[resFilt$FDR < alpha, ]

sigUpReg = sigUpReg[order(sigUpReg$logFC, decreasing = TRUE), ] head(sigUpReg)

logFC logCPM PValue FDR FBgn0033764 3.268 2.612 3.224e-29 4.743e-27 FBgn0035189 2.973 4.427 7.154e-48 2.492e-45 FBgn0025111 2.943 7.159 2.758e-152 9.128e-149 FBgn0037290 2.935 2.523 1.192e-25 1.461e-23 FBgn0038198 2.670 2.587 4.163e-19 3.167e-17 FBgn0000071 2.565 5.034 1.711e-78 1.416e-75

Exercise 4.24 Create persistent storage of results. Save the result tables as a csv (comma-separated values) file using thewrite.csv()function (alternative formats are possible).

Solution 4.24

write.csv(sigDownReg, file = "sigDownReg.csv") write.csv(sigUpReg, file = "sigUpReg.csv")

4.9 Interpreting the DE analysis results

MA-plot

Exercise 4.25 Create a MA-plot using theplotSmear()showing the genes selected as differentially expressed with a 1% false discovery rate.

Solution 4.25

de.genes = rownames(resFilt[resFilt$FDR < 0.01, ]) plotSmear(dgeTestFilt, de.tags = de.genes)

2 4 6 8 10 12 14

−4−202

Average logCPM

logFC

Volcano plot

Exercise 4.26 Create a volcano-plot from the res data object. First construct a table containing the log2 fold change and the negative log10-transformed p-values, then generate the volcano plot using the standard plot() command. Hint: the negative log10-transformed value ofxis-log10(x).

Solution 4.26

tab = data.frame(logFC = resFilt$logFC, negLogPval = -log10(resFilt$FDR))

## First few rows of the table containing the log£_2£ fold change and the negative

## log10-transformed p-values head(tab)

logFC negLogPval 1 -4.378 179.55 2 2.943 148.04 3 -2.961 147.37 4 -4.129 100.03 5 -2.447 98.52 6 -2.499 92.86

plot(tab, pch = 16, cex = 0.6)

−4 −2 0 2

050100150

logFC

negLogPval

Gene clustering

To draw a CIM (or heatmap) of individual RNA-seq samples,edgeRsuggest using moderated log-counts-per-million.

This can be calculated by thecpm()function with positive values for prior.count, for example y = cpm(dge, prior.count = 1, log = TRUE)

where dge is the normalized DGEList object. This produces a matrix of log₂ counts-per-million (logCPM), with undefined values avoided and the poorly defined log-fold-changes for low counts shrunk towards zero. Larger values forprior.countproduce more shrinkage.

Exercise 4.27 Transform the normalized counts fromdgedata object using thecpm()function.

Solution 4.27

cpm.mat = cpm(dge, prior.count = 1, log = TRUE) head(cpm.mat)

treated2 treated3 untreated3 untreated4 FBgn0000003 -3.2526 -2.3226 -3.253 -3.253 FBgn0000008 3.2056 2.7556 3.195 2.894 FBgn0000015 -3.2526 -3.2526 -2.158 -1.670 FBgn0000017 8.3151 8.3073 8.730 8.366 FBgn0000018 4.9585 4.8757 4.873 5.025 FBgn0000024 -0.2681 -0.7863 -1.113 -1.255

Since the clustering is only relevant for genes that actually are differentially expressed, carry it out only for a gene subset of most highly differential expression.

Exercise 4.28 Select those genes that have adjusted p-values below 0.01 and absolute log2-fold-change above 1.5 from the trasformed data, and use the functioncim()to produce a CIM.

Solution 4.28

deg.idx = (abs(resFilt$logFC) > 1.5 & resFilt$FDR < 0.01) deg = rownames(resFilt)[deg.idx]

cpm.mat = cpm.mat[deg, ]

Transformed values for the first ten selected genes.

head(cpm.mat)

treated2 treated3 untreated3 untreated4 FBgn0039155 2.0155 2.335 6.466 6.608 FBgn0025111 7.9476 7.996 5.059 5.006 FBgn0003360 5.9800 5.878 8.802 8.970 FBgn0039827 0.6377 1.516 5.252 5.212 FBgn0026562 10.1798 10.247 12.544 12.769 FBgn0035085 3.8172 3.843 6.279 6.363

Usecol = seqColorfor sequential colour schemes in thecim()function.

## For sequential colour schemes library(RColorBrewer)

seqColor = colorRampPalette(brewer.pal(9, "Blues"))(255)

cim(t(cpm.mat), dendrogram = "column", col = seqColor, symkey = FALSE)

−2.32 5.22 9 12.77 Color key

FBgn0026562 FBgn0025111 FBgn0001226 FBgn0003360 FBgn0261552 FBgn0023479FBgn0029167 FBgn0034010 FBgn0035085 FBgn0029896 FBgn0036821 FBgn0053307 FBgn0040091FBgn0040099 FBgn0034897 FBgn0259715 FBgn0035189 FBgn0003501 FBgn0000071FBgn0034885 FBgn0024984 FBgn0051092 FBgn0011260 FBgn0050463 FBgn0035523 FBgn0033775FBgn0261284 FBgn0037290 FBgn0033764 FBgn0038198 FBgn0051642 FBgn0086910FBgn0037153 FBgn0260011 FBgn0032405 FBgn0035727 FBgn0034438 FBgn0085359 FBgn0024288FBgn0039155 FBgn0039827 FBgn0034736 FBgn0023549 FBgn0000406 FBgn0038832FBgn0040827 FBgn0037754 FBgn0052407 FBgn0016715 FBgn0034389 FBgn0031703 FBgn0038351FBgn0030041 FBgn0034434 FBgn0260429 FBgn0035344 FBgn0033495 FBgn0024315FBgn0039640 FBgn0002868 treated2 treated3 untreated3 untreated4

Dans le document Statistical analysis of RNA-Seq data (solutions) (Page 25-30)