First, simple t-tests

In this unit, we will show the difference between using the simple t-test and doing differential expression with the limma hierarchical model. The reference is Smyth 2004, listed in the footnotes.

Here we also show the basic steps for performing a limma analysis. Note that the limma package is very powerful, and has hundreds of pages of documentation which we cannot cover in this course, however we recommend that users wanting to explore further should check out this guide.

We start by loading the spike-in data which was introduced in lecture, which has already been normalized.

# biocLite("SpikeInSubset")
library(SpikeInSubset)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: methods
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: affy
data(rma95)
fac <- factor(rep(1:2,each=3))

We can now perform simple t-tests using the rowttests function in the genefilter package:

library(genefilter)
## 
## Attaching package: 'genefilter'
## The following object is masked from 'package:base':
## 
##     anyNA
rtt <- rowttests(exprs(rma95),fac)

We will define colors depending on whether the p-value is small, the absolute difference in means is large, and whether the feature is a spike-in value.

mask <- with(rtt, abs(dm) < .2 & p.value < .01)
spike <- rownames(rma95) %in% colnames(pData(rma95))
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))

We now plot the results, using the colors defined above. We multiply the dm by -1, because we are interested in the difference from the second group to the first (this is the difference used by lm and the limma package by default). The spike-in genes are in blue, which have mostly small p-value and large difference in means. The red points indicate genes which have small p-values but also small differences in means. We will see how these points change after using limma.

with(rtt, plot(-dm, -log10(p.value), cex=.8, pch=16,
     xlim=c(-1,1), ylim=c(0,5),
     xlab="difference in means",
     col=cols))
abline(h=2,v=c(-.2,.2), lty=2)

plot of chunk unnamed-chunk-4

Note that the red genes have mostly low estimates of standard deviation.

rtt$s <- apply(exprs(rma95), 1, function(row) sqrt(.5 * (var(row[1:3]) + var(row[4:6]))))
with(rtt, plot(s, -log10(p.value), cex=.8, pch=16,
              log="x",xlab="estimate of standard deviation",
              col=cols))

plot of chunk unnamed-chunk-5

limma steps

The following three steps perform the basic limma analysis. We specify coef=2 because we are interested in the difference between groups, not the intercept.

library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
fit <- lmFit(rma95, design=model.matrix(~ fac))
colnames(coef(fit))
## [1] "(Intercept)" "fac2"
fit <- eBayes(fit)
tt <- topTable(fit, coef=2)
tt
##               logFC   AveExpr          t      P.Value    adj.P.Val
## 1708_at  -7.0610613  7.945276 -73.529269 7.816370e-17 9.868948e-13
## 36202_at  0.8525527  9.373033   9.975114 4.935683e-07 3.115897e-03
## 36311_at  0.8318298  8.564315   8.363252 3.017008e-06 1.269758e-02
## 33264_at  0.7118997  4.918953   7.434888 9.666328e-06 2.706595e-02
## 32660_at  0.6554022  8.680132   7.356180 1.071834e-05 2.706595e-02
## 38734_at  0.7467142  6.255772   7.185131 1.345115e-05 2.830571e-02
## 1024_at   0.8426550  9.697281   6.730664 2.503461e-05 4.400123e-02
## 36085_at  0.6449402 12.193130   6.653830 2.787976e-05 4.400123e-02
## 33818_at  0.5321749 12.285643   6.454504 3.699480e-05 5.189960e-02
## 39058_at  0.6090625  7.534532   6.278815 4.767986e-05 5.687699e-02
##                 B
## 1708_at  8.646866
## 36202_at 4.587736
## 36311_at 3.567790
## 33264_at 2.835849
## 32660_at 2.768151
## 38734_at 2.617789
## 1024_at  2.195944
## 36085_at 2.121308
## 33818_at 1.923063
## 39058_at 1.742696

topTable will return the top genes ranked by whichever value you define. You can also ask topTable to return all the values, sorted by "none". Note that a column automatically is included which gives the adjusted p-values for each gene. By default the method of Benjamini-Hochberg is used, by calling the p.adjust function.

# ?topTable
dim(topTable(fit, coef=2, number=Inf, sort.by="none"))
## [1] 12626     6
# ?p.adjust

Here we will compare the previous volcano plot with the limma results. Note that the red points are now all under the line where -log10(p.value) is equal to 2. Also, the blue points which represent real differences have p-values which are even higher than before.

limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=fit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
     col=cols,xlab="difference in means",
     xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)

plot of chunk unnamed-chunk-8

Finally, we will construct a plot which shows how limma shrinks the variance estimates towards a common value, eliminating false positives which might arise from too-low estimates of variance.

Here we pick, for each of 40 bins of different variance estimates, a single gene which falls in that bin. We remove bins which do not have any such genes.

n <- 40
qs <- seq(from=0,to=.2,length=n)
idx <- sapply(seq_len(n),function(i) which(as.integer(cut(rtt$s^2,qs)) == i)[1])
idx <- idx[!is.na(idx)]

Now we will plot a line, from the initial estimate of variance for these genes to the estimate after running limma.

par(mar=c(5,5,2,2))
plot(1,1,xlim=c(0,.21),ylim=c(0,1),type="n",
     xlab="variance estimates",ylab="",yaxt="n")
axis(2,at=c(.1,.9),c("before","after"),las=2)
segments((rtt$s^2)[idx],rep(.1,n),
         fit$s2.post[idx],rep(.9,n))

plot of chunk unnamed-chunk-10

Footnotes

Smyth GK, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”. Stat Appl Genet Mol Biol. 2004 http://www.ncbi.nlm.nih.gov/pubmed/16646809