Inference: t-tests, multiple comparisons
Introduction
In the previous section, we focused on a pair of genes to illustrate two aspects of variation. One of the genes appeared to have high between-mouse variation that was hidden in the act of pooling samples within strain. When strains were compared on the basis of the pooled data, there was an appearance of a significant strain effect for this gene (), but when individual-level data were used to perform the comparison, the strain effect was found to be very weak at best (). The lesson is to recognize that the most scientifically compelling questions concern biological variation, which can only be directly measured with good experimental design. Accurate interpretation of origin and size of biological variation requires appropriate statistical analysis.
In this section we will cover inference in the context of genome-scale experiments. There are several serious conceptual problems:
- there are many tests, often at least one test for each one of tens of thousands of features
- each feature (typically a gene) exhibits its own technical and biological variability
- there may be unmeasured or unreported sources of biological variation (such as time of day)
- many features are inherently interrelated, so the tests are not independent
We will apply some of the concepts we have covered in previous sections including t-tests and multiple comparisons; later we will compute standard deviation estimates from hierarchical models.
We start by loading the pooling experiment data
library(Biobase)
library(maPooling)
data(maPooling)
pd=pData(maPooling)
individuals=which(rowSums(pd)==1)
And extracting the individual mice as well as their strain
individuals=which(rowSums(pd)==1)
individuals=individuals[-grep("tr",names(individuals))]
y=exprs(maPooling)[,individuals]
g=factor(as.numeric(grepl("b",names(individuals))))
T-tests
We can now apply a t-test to each gene using the rowttest
function in the genefilter
package
library(genefilter)
tt=rowttests(y,g)
Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get
NsigAt01 = sum(tt$p.value<0.01)
NsigAt01
## [1] 1578
NsigAt05 = sum(tt$p.value<0.05)
NsigAt05
## [1] 2908
Multiple testing
We described multiple testing in detail in course 3. Here we provide a quick summary.
Do we report all the nominally significant genes identified above? Let’s explore what happens if we split the first group into two, forcing the null hypothesis to be true
set.seed(0)
shuffledIndex <- factor(sample(c(0,1),sum(g==0),replace=TRUE ))
nulltt <- rowttests(y[,g==0],shuffledIndex)
NfalselySigAt01 = sum(nulltt$p.value<0.01)
NfalselySigAt01
## [1] 79
NfalselySigAt05 = sum(nulltt$p.value<0.05)
NfalselySigAt05
## [1] 840
If we use the 0.05 cutoff we will be reporting 840 false positives. We have described several ways to adjust for this including the qvalue
method available in the qvalue package. After this adjustment we acquire
a smaller list of genes.
library(qvalue)
qvals = qvalue(tt$p.value)$qvalue
sum(qvals<0.05)
## [1] 1179
sum(qvals<0.01)
## [1] 538
And now the null case generates no false positives:
library(qvalue)
nullqvals = qvalue(nulltt$p.value)$qvalue
sum(nullqvals<0.05)
## [1] 0
sum(nullqvals<0.01)
## [1] 0
This addresses in a fairly general way the problem of inflating significance claims when performing many hypothesis tests at a fixed nominal level of significance.