Using limma for microarray analysis
First, simple t-tests
In this unit, we will show the difference between using the simple t-test and doing differential expression with the limma hierarchical model. The reference is Smyth 2004, listed in the footnotes.
Here we also show the basic steps for performing a limma
analysis. Note that the limma
package is very powerful, and has hundreds of pages of documentation which we cannot cover in this course, however we recommend that users wanting to explore further should check out this guide.
The spike-in dataset
We start by loading the normalized spike-in data generated by Affymetrix.
# biocLite("SpikeInSubset")
library(SpikeInSubset)
data(rma95)
fac <- factor(rep(1:2,each=3))
Briefly, there are 16 mRNA species for which fixed concentration samples have been prepared and mixed for quantification using the hgu95 array. This subset is such that for each of the spiked in mRNA species, the first trio and second trio of samples in the ExpressionSet have fixed distinct values. This can be seen by examining the pData:
pData(rma95)
## 37777_at 684_at 1597_at 38734_at 39058_at 36311_at
## 1521a99hpp_av06 0.00 0.25 0.5 1 2 4
## 1532a99hpp_av04 0.00 0.25 0.5 1 2 4
## 2353a99hpp_av08 0.00 0.25 0.5 1 2 4
## 1521b99hpp_av06 0.25 0.50 1.0 2 4 8
## 1532b99hpp_av04 0.25 0.50 1.0 2 4 8
## 2353b99hpp_av08r 0.25 0.50 1.0 2 4 8
## 36889_at 1024_at 36202_at 36085_at 40322_at 407_at
## 1521a99hpp_av06 8 16 32 64 128 0.00
## 1532a99hpp_av04 8 16 32 64 128 0.00
## 2353a99hpp_av08 8 16 32 64 128 0.00
## 1521b99hpp_av06 16 32 64 128 256 0.25
## 1532b99hpp_av04 16 32 64 128 256 0.25
## 2353b99hpp_av08r 16 32 64 128 256 0.25
## 1091_at 1708_at 33818_at 546_at
## 1521a99hpp_av06 512 1024 256 32
## 1532a99hpp_av04 512 1024 256 32
## 2353a99hpp_av08 512 1024 256 32
## 1521b99hpp_av06 1024 0 512 64
## 1532b99hpp_av04 1024 0 512 64
## 2353b99hpp_av08r 1024 0 512 64
To get a feel for the response of the array quantifications to this design, consider the following plots for four of the spiked mRNAs.
par(mfrow=c(2,2))
for (i in 1:4) {
spg = names(pData(rma95))
plot(1:6, exprs(rma95)[spg[i+6],], main=spg[i+6], ylab="RMA",
xlab="nominal", axes=FALSE)
axis(2)
axis(1, at=1:6, labels=pData(rma95)[[spg[i+6]]])
}
Since the RMA (robust multi-array average) quantifications are on a log2 scale, the observed differences in approximate mean of normalized measures seems reasonable. But considerable variability is present for the measures within each concentration setting.
rowttests
We can now perform simple t-tests using the rowttests
function in the genefilter
package:
library(genefilter)
rtt <- rowttests(exprs(rma95),fac)
We will define colors depending on whether the p-value is small, the absolute difference in means is large, and whether the feature is a spike-in value.
mask <- with(rtt, abs(dm) < .2 & p.value < .01)
spike <- rownames(rma95) %in% colnames(pData(rma95))
cols <- ifelse(mask,"red",ifelse(spike,"dodgerblue","black"))
We now plot the results, using the colors defined above. We multiply the dm
by -1, because we are interested in the difference from the second group to the first (this is the difference used by lm
and the limma
package by default). The spike-in genes are in blue, which have mostly small p-value and large difference in means. The red points indicate genes which have small p-values but also small differences in means. We will see how these points change after using limma
.
with(rtt, plot(-dm, -log10(p.value), cex=.8, pch=16,
xlim=c(-1,1), ylim=c(0,5),
xlab="difference in means",
col=cols))
abline(h=2,v=c(-.2,.2), lty=2)
Note that the red genes have mostly low estimates of standard deviation.
rtt$s <- apply(exprs(rma95), 1, function(row) sqrt(.5 * (var(row[1:3]) + var(row[4:6]))))
with(rtt, plot(s, -log10(p.value), cex=.8, pch=16,
log="x",xlab="estimate of standard deviation",
col=cols))
limma steps
The following three steps perform the basic limma
analysis. We specify coef=2
because we are interested in the difference between groups, not the intercept.
library(limma)
options(digits=3)
fit <- lmFit(rma95, design=model.matrix(~ fac)) # step 1 least squares estimates
colnames(coef(fit))
## [1] "(Intercept)" "fac2"
fit <- eBayes(fit) # step 2 moderate the t statistics
tt <- topTable(fit, coef=2) # step 3 report
tt
## logFC AveExpr t P.Value adj.P.Val B
## 1708_at -7.061 7.95 -73.53 7.82e-17 9.87e-13 8.65
## 36202_at 0.853 9.37 9.98 4.94e-07 3.12e-03 4.59
## 36311_at 0.832 8.56 8.36 3.02e-06 1.27e-02 3.57
## 33264_at 0.712 4.92 7.43 9.67e-06 2.71e-02 2.84
## 32660_at 0.655 8.68 7.36 1.07e-05 2.71e-02 2.77
## 38734_at 0.747 6.26 7.19 1.35e-05 2.83e-02 2.62
## 1024_at 0.843 9.70 6.73 2.50e-05 4.40e-02 2.20
## 36085_at 0.645 12.19 6.65 2.79e-05 4.40e-02 2.12
## 33818_at 0.532 12.29 6.45 3.70e-05 5.19e-02 1.92
## 39058_at 0.609 7.53 6.28 4.77e-05 5.69e-02 1.74
topTable
will return the top genes ranked by whichever value you define. You can also ask topTable to return all the values, sorted by "none"
. Note that a column automatically is included which gives the adjusted p-values for each gene. By default the method of Benjamini-Hochberg is used, by calling the p.adjust
function.
# ?topTable
dim(topTable(fit, coef=2, number=Inf, sort.by="none"))
## [1] 12626 6
# ?p.adjust
Here we will compare the previous volcano plot with the limma
results. Note that the red points are now all under the line where -log10(p.value)
is equal to 2. Also, the blue points which represent real differences have p-values which are even higher than before.
limmares <- data.frame(dm=coef(fit)[,"fac2"], p.value=fit$p.value[,"fac2"])
with(limmares, plot(dm, -log10(p.value),cex=.8, pch=16,
col=cols,xlab="difference in means",
xlim=c(-1,1), ylim=c(0,5)))
abline(h=2,v=c(-.2,.2), lty=2)
Finally, we will construct a plot which shows how limma
shrinks the variance estimates towards a common value, eliminating false positives which might arise from too-low estimates of variance.
Here we pick, for each of 40 bins of different variance estimates, a single gene which falls in that bin. We remove bins which do not have any such genes.
n <- 40
qs <- seq(from=0,to=.2,length=n)
idx <- sapply(seq_len(n),function(i) which(as.integer(cut(rtt$s^2,qs)) == i)[1])
idx <- idx[!is.na(idx)]
Now we will plot a line, from the initial estimate of variance for these genes to the estimate after running limma
.
par(mar=c(5,5,2,2))
plot(1,1,xlim=c(0,.21),ylim=c(0,1),type="n",
xlab="variance estimates",ylab="",yaxt="n")
axis(2,at=c(.1,.9),c("before","after"),las=2)
segments((rtt$s^2)[idx],rep(.1,n),
fit$s2.post[idx],rep(.9,n))
The statistical details are provided in the lecture on hierarchical models.
Footnotes
Smyth GK, “Linear models and empirical bayes methods for assessing differential expression in microarray experiments”. Stat Appl Genet Mol Biol. 2004 http://www.ncbi.nlm.nih.gov/pubmed/16646809