Crossvalidation Exercises
Exercises
Load the following dataset:
library(GSE5859Subset)
data(GSE5859Subset)
And define the outcome and predictors. To make the problem more difficult, we will only consider autosomal genes:
y = factor(sampleInfo$group)
X = t(geneExpression)
out = which(geneAnnotation$CHR%in%c("chrX","chrY"))
X = X[,out]

Use the
createFold
function in thecaret
package, set the seed to 1set.seed(1)
and create 10 folds.Question: What is the 2nd entry in the fold 3?

We are going to use kNN. We are going to consider a smaller set of predictors by using filtering genes using ttests. Specifically, we will perform a ttest and select the genes with the smallest pvalues.
Let and and train kNN by leaving out the second fold
idx[[2]]
. How many mistakes do we make on the test set? Remember it is indispensable that you perform the ttest on the training data. 
Now run through all 5 folds. What is our error rate?

Now we are going to select the best values of and . Use the expand grid function to try out the following values:
ms=2^c(1:11) ks=seq(1,9,2) params = expand.grid(k=ks,m=ms)
Now use apply or a forloop to obtain error rates for each of these pairs of parameters. Which pair of parameters minimizes the error rate?

Repeat exercise 4, but now perform the ttest filtering before the cross validation. Note how this biases the entire result and gives us much lower estimated error rates.

Repeat exercise 3, but now, instead of
sampleInfo$group
, usey = factor(as.numeric(format( sampleInfo$date, "%m")=="06"))
What is the minimum error rate now?
We achieve much lower error rates when predicting date than when predicting the group. Because group is confounded with date, it is very possible that these predictors have no information about group and that our lower 0.5 error rates are due to the confounding with date. We will learn more about this in the batch effect chapter.