Adjusting with linear models exercises
Exercises
For the dataset we have been working with, models do not help due to the almost perfect confounding. This is one reason we created the subset dataset:
library(GSE5859Subset)
data(GSE5859Subset)
Here we purposely confounded month and group (sex), but not completely:
sex = sampleInfo$group
month = factor( format(sampleInfo$date,"%m"))
table( sampleInfo$group, month)

Using the functions
rowttests
andqvalue
compare the two groups. Because this is a smaller dataset which decreases our power, we will use the more lenient FDR cutoff of 10%.How many gene have qvalues less than 0.1?

Note that
sampleInfo$group
here presents males and females. Thus, we expect differences to be in on chrY and, for genes that escape inactivation, chrX. We do not expect many autosomal genes to be different between males and females. This gives us an opportunity to evaluate false and true positives with experimental data. For example, we evaluate results using the proportion genes of the list that are on chrX or chrY.For the list calculated above, what proportion of this list is on chrX or chrY?

We can also check how many of the chromosomes X and Y genes we detected as different. How many are on Y?

Now for the autosomal genes (not on chrX and chrY) for which qvalue < 0.1, perform a ttest comparing samples processed in June to those processed in October.
What proportion of these have pvalues <0.05 ?

The above result shows that the great majority of the autosomal genes show differences due to processing data. This provides further evidence that confounding is resulting in false positives. So we are going to try to model the month effect to better estimate the sex effect. We are going to use a linear model:
Which of the following creates the appropriate design matrix?
 A)
X = model.matrix(~sex+ethnicity)
 B)
X = cbind(sex,as.numeric(month))
 C) It canâ€™t be done with one line.
 D)
X = model.matrix(~sex+month)
 A)

Now use the
X
defined above, to fit a regression model usinglm
for each gene. You can obtain pvalues for estimated parameters usingsummary
. Here is an exampleX = model.matrix(~sex+month) i = 234 y = geneExpression[i,] fit = lm(y~X) summary(fit)$coef
How many of the qvalues for the group comparison are now <0.1?
Note the big drop from what we obtained without the correction.

With this new list, what proportion of these are chrX and chrY?
Notice the big improvement.

How many on Y or X?

Now from the linear model above, extract the pvalues related to the coefficient representing the October versus June differences using the same linear model.
How many of the qvalues for the month comparison are now <0.1?
This approach is basically the approach implemented by Combat.