First, simple t-tests

In this unit, we will show the difference between using the simple t-test and doing differential expression with the limma hierarchical model. The reference is Smyth 2004, listed in the footnotes.

Here we also show the basic steps for performing a limma analysis. Note that the limma package is very powerful, and has hundreds of pages of documentation which we cannot cover in this course, however we recommend that users wanting to explore further should check out this guide.

The spike-in dataset

We start by loading the normalized spike-in data generated by Affymetrix.

# biocLite("SpikeInSubset")
library(SpikeInSubset)
data(rma95)
fac <- factor(rep(1:2,each=3))

Briefly, there are 16 mRNA species for which fixed concentration samples have been prepared and mixed for quantification using the hgu95 array. This subset is such that for each of the spiked in mRNA species, the first trio and second trio of samples in the ExpressionSet have fixed distinct values. This can be seen by examining the pData:

pData(rma95)
##                  37777_at 684_at 1597_at 38734_at 39058_at 36311_at
## 1521a99hpp_av06      0.00   0.25     0.5        1        2        4
## 1532a99hpp_av04      0.00   0.25     0.5        1        2        4
## 2353a99hpp_av08      0.00   0.25     0.5        1        2        4
## 1521b99hpp_av06      0.25   0.50     1.0        2        4        8
## 1532b99hpp_av04      0.25   0.50     1.0        2        4        8
## 2353b99hpp_av08r     0.25   0.50     1.0        2        4        8
##                  36889_at 1024_at 36202_at 36085_at 40322_at 407_at
## 1521a99hpp_av06         8      16       32       64      128   0.00
## 1532a99hpp_av04         8      16       32       64      128   0.00
## 2353a99hpp_av08         8      16       32       64      128   0.00
## 1521b99hpp_av06        16      32       64      128      256   0.25
## 1532b99hpp_av04        16      32       64      128      256   0.25
## 2353b99hpp_av08r       16      32       64      128      256   0.25
##                  1091_at 1708_at 33818_at 546_at
## 1521a99hpp_av06      512    1024      256     32
## 1532a99hpp_av04      512    1024      256     32
## 2353a99hpp_av08      512    1024      256     32
## 1521b99hpp_av06     1024       0      512     64
## 1532b99hpp_av04     1024       0      512     64
## 2353b99hpp_av08r    1024       0      512     64

To get a feel for the response of the array quantifications to this design, consider the following plots for four of the spiked mRNAs.

par(mfrow=c(2,2))
for (i in 1:4) {
  spg = names(pData(rma95))
  plot(1:6, exprs(rma95)[spg[i+6],], main=spg[i+6], ylab="RMA",
    xlab="nominal", axes=FALSE)
  axis(2)
  axis(1, at=1:6, labels=pData(rma95)[[spg[i+6]]])
}

plot of chunk lkpl

Since the RMA (robust multi-array average) quantifications are on a log2 scale, the observed differences in approximate mean of normalized measures seems reasonable. But considerable variability is present for the measures within each concentration setting.

rowttests

We can now perform simple t-tests using the rowttests function in the genefilter package:

library(genefilter)
rtt <- rowttests(exprs(rma95),fac)

We will define colors depending on whether the p-value is small, the absolute difference in means is large, and whether the feature is a spike-in value.

mask <- with(rtt, abs(dm) < .2 & p.value < .01)
spike <- rownames(rma95) %in% colnames(pData(rma95))
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))

We now plot the results, using the colors defined above. We multiply the dm by -1, because we are interested in the difference from the second group to the first (this is the difference used by lm and the limma package by default). The spike-in genes are in blue, which have mostly small p-value and large difference in means. The red points indicate genes which have small p-values but also small differences in means. We will see how these points change after using limma.

with(rtt, plot(-dm, -log10(p.value), cex=.8, pch=16,
     xlim=c(-1,1), ylim=c(0,5),
     xlab="difference in means",
     col=cols))
abline(h=2,v=c(-.2,.2), lty=2)

plot of chunk unnamed-chunk-4

Note that the red genes have mostly low estimates of standard deviation.

rtt$s <- apply(exprs(rma95), 1, function(row) sqrt(.5 * (var(row[1:3]) + var(row[4:6]))))
with(rtt, plot(s, -log10(p.value), cex=.8, pch=16,
              log="x",xlab="estimate of standard deviation",
              col=cols))

plot of chunk unnamed-chunk-5

limma steps

The following three steps perform the basic limma analysis. We specify coef=2 because we are interested in the difference between groups, not the intercept.

library(limma)
options(digits=3)
fit <- lmFit(rma95, design=model.matrix(~ fac))  # step 1 least squares estimates
colnames(coef(fit))     
## [1] "(Intercept)" "fac2"
fit <- eBayes(fit)           # step 2 moderate the t statistics
tt <- topTable(fit, coef=2)  # step 3 report
tt
##           logFC AveExpr      t  P.Value adj.P.Val    B
## 1708_at  -7.061    7.95 -73.53 7.82e-17  9.87e-13 8.65
## 36202_at  0.853    9.37   9.98 4.94e-07  3.12e-03 4.59
## 36311_at  0.832    8.56   8.36 3.02e-06  1.27e-02 3.57
## 33264_at  0.712    4.92   7.43 9.67e-06  2.71e-02 2.84
## 32660_at  0.655    8.68   7.36 1.07e-05  2.71e-02 2.77
## 38734_at  0.747    6.26   7.19 1.35e-05  2.83e-02 2.62
## 1024_at   0.843    9.70   6.73 2.50e-05  4.40e-02 2.20
## 36085_at  0.645   12.19   6.65 2.79e-05  4.40e-02 2.12
## 33818_at  0.532   12.29   6.45 3.70e-05  5.19e-02 1.92
## 39058_at  0.609    7.53   6.28 4.77e-05  5.69e-02 1.74

topTable will return the top genes ranked by whichever value you define. You can also ask topTable to return all the values, sorted by "none". Note that a column automatically is included which gives the adjusted p-values for each gene. By default the method of Benjamini-Hochberg is used, by calling the p.adjust function.

# ?topTable
dim(topTable(fit, coef=2, number=Inf, sort.by="none"))
## [1] 12626     6
# ?p.adjust

Here we will compare the previous volcano plot with the limma results. Note that the red points are now all under the line where -log10(p.value) is equal to 2. Also, the blue points which represent real differences have p-values which are even higher than before.

limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=fit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
     col=cols,xlab="difference in means",
     xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)

plot of chunk unnamed-chunk-8

Finally, we will construct a plot which shows how limma shrinks the variance estimates towards a common value, eliminating false positives which might arise from too-low estimates of variance.

Here we pick, for each of 40 bins of different variance estimates, a single gene which falls in that bin. We remove bins which do not have any such genes.

n <- 40
qs <- seq(from=0,to=.2,length=n)
idx <- sapply(seq_len(n),function(i) which(as.integer(cut(rtt$s^2,qs)) == i)[1])
idx <- idx[!is.na(idx)]

Now we will plot a line, from the initial estimate of variance for these genes to the estimate after running limma.

par(mar=c(5,5,2,2))
plot(1,1,xlim=c(0,.21),ylim=c(0,1),type="n",
     xlab="variance estimates",ylab="",yaxt="n")
axis(2,at=c(.1,.9),c("before","after"),las=2)
segments((rtt$s^2)[idx],rep(.1,n),
         fit$s2.post[idx],rep(.9,n))

plot of chunk unnamed-chunk-10

The statistical details are provided in the lecture on hierarchical models.

Footnotes

Smyth GK, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”. Stat Appl Genet Mol Biol. 2004 http://www.ncbi.nlm.nih.gov/pubmed/16646809