Inference: t-tests, multiple comparisons
In this section we will cover inference in the context of genomics experiments. We apply some of the concepts we have covered in previous sections including t-tests and multiple comparisons; later we will compute standard deviation estimates from hierarchical models.
We start by loading the pooling experiment data
library(Biobase) library(maPooling) data(maPooling) pd=pData(maPooling) individuals=which(rowSums(pd)==1)
And extracting the individual mice as well as their strain
individuals=which(rowSums(pd)==1) individuals=individuals[-grep("tr",names(individuals))] y=exprs(maPooling)[,individuals] g=factor(as.numeric(grepl("b",names(individuals))))
We can now apply a t-test to each gene using the
rowttest function in the
## ## Attaching package: 'genefilter'
## The following object is masked from 'package:base': ## ## anyNA
Now which genes do we report as statistically significant? For somewhat arbitrary reasons, in science p-values of 0.01 and 0.05 are used as cutoff. In this particular example we get
##  1578
##  2908
We described multiple testing in detail in course 3. Here we provide a quick summary.
Do we report all these genes? Let’s explore what happens if we split the first group into two, forcing the null hypothesis to be true
set.seed(0) shuffledIndex <- factor(sample(c(0,1),sum(g==0),replace=TRUE )) nulltt <- rowttests(y[,g==0],shuffledIndex) sum(nulltt$p.value<0.01)
##  79
##  840
If we use the 0.05 cutoff we will be reporting 840 false positives. We have described several ways to adjust for this include the
qvalue method available in the
qvalue package. After this adjustment we include a smaller list of genes.
library(qvalue) qvals = qvalue(tt$p.value)$qvalue sum(qvals<0.05)
##  1179
##  538
And now the null case generates fewer false positives:
library(qvalue) nullqvals = qvalue(nulltt$p.value)$qvalue sum(nullqvals<0.05)
##  0
##  0