Clustering and Heatmaps Exercise
{pagebreak}
Exercises

Create a random matrix with no correlation in the following way:
set.seed(1) m = 10000 n = 24 x = matrix(rnorm(m*n),m,n) colnames(x)=1:n
Run hierarchical clustering on this data with the
hclust
function with default parameters to cluster the columns. Create a dendrogram.In the dendrogram, which pairs of samples are the furthest away from each other?
 A) 7 and 23
 B) 19 and 14
 C) 1 and 16
 D) 17 and 18

Set the seed at 1,
set.seed(1)
and replicate the creation of this matrix:m = 10000 n = 24 x = matrix(rnorm(m*n),m,n)
then perform hierarchical clustering as in the solution to exercise 1, and find the number of clusters if you use
cuttree
at height 143. This number is a random variable.Based on the Monte Carlo simulation, what is the standard error of this random variable?

Run
kmeans
with 4 centers for the blood RNA data:library(GSE5859Subset) data(GSE5859Subset)
Set the seed to 10,
set.seed(10)
right before runningkmeans
with 5 centers.Explore the relationship of clusters and information in
sampleInfo
. Which of the following best describes what you find? A)
sampleInfo$group
is driving the clusters as the 0s and 1s are in completely different clusters.  B) The year is driving the clusters.
 C) Date is driving the clusters.
 D) The clusters donâ€™t depend on any of the column of
sampleInfo
 A)

Load the data:
library(GSE5859Subset) data(GSE5859Subset)
Pick the 25 genes with the highest across sample variance. This function might help:
install.packages("matrixStats") library(matrixStats) ?rowMads ##we use mads due to a outlier sample
Use
heatmap.2
to make a heatmap showing thesampleInfo$group
with color, the date as labels, the rows labelled with chromosome, and scaling the rows.What do we learn from this heatmap?
 A) The data appears as if it was generated by
rnorm
.  B) Some genes in chr1 are very variable.
 C) A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
 D) A group of chrY genes are higher in October compared to June and appear to drive the clustering. Within those clusters there appears to be clustering by
samplInfo$group
.
 A) The data appears as if it was generated by

Create a large data set of random data that is completely independent of
sampleInfo$group
like this:set.seed(17) m = nrow(geneExpression) n = ncol(geneExpression) x = matrix(rnorm(m*n),m,n) g = factor(sampleInfo$g )
Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest pvalues obtained with
rowttests
. Then, take the 50 genes with largest standard deviations.Which of the following statements is true?
 A) There is no relationship between
g
andx
, but with 8,793 tests some will appear significant by chance. Selecting genes with the ttest gives us a deceiving result.  B) These two techniques produced similar heatmaps.
 C) Selecting genes with the ttest is a better technique since it permits us to detect the two groups. It appears to find hidden signals.
 D) The genes with the largest standard deviation add variability to the plot and do not let us find the differences between the two groups.
 A) There is no relationship between