Clustering and Heatmaps Exercise
{pagebreak}
Exercises
-
Create a random matrix with no correlation in the following way:
set.seed(1) m = 10000 n = 24 x = matrix(rnorm(m*n),m,n) colnames(x)=1:n
Run hierarchical clustering on this data with the
hclust
function with default parameters to cluster the columns. Create a dendrogram.In the dendrogram, which pairs of samples are the furthest away from each other?
- A) 7 and 23
- B) 19 and 14
- C) 1 and 16
- D) 17 and 18
-
Set the seed at 1,
set.seed(1)
and replicate the creation of this matrix:m = 10000 n = 24 x = matrix(rnorm(m*n),m,n)
then perform hierarchical clustering as in the solution to exercise 1, and find the number of clusters if you use
cuttree
at height 143. This number is a random variable.Based on the Monte Carlo simulation, what is the standard error of this random variable?
-
Run
kmeans
with 4 centers for the blood RNA data:library(GSE5859Subset) data(GSE5859Subset)
Set the seed to 10,
set.seed(10)
right before runningkmeans
with 5 centers.Explore the relationship of clusters and information in
sampleInfo
. Which of the following best describes what you find?- A)
sampleInfo$group
is driving the clusters as the 0s and 1s are in completely different clusters. - B) The year is driving the clusters.
- C) Date is driving the clusters.
- D) The clusters don’t depend on any of the column of
sampleInfo
- A)
-
Load the data:
library(GSE5859Subset) data(GSE5859Subset)
Pick the 25 genes with the highest across sample variance. This function might help:
install.packages("matrixStats") library(matrixStats) ?rowMads ##we use mads due to a outlier sample
Use
heatmap.2
to make a heatmap showing thesampleInfo$group
with color, the date as labels, the rows labelled with chromosome, and scaling the rows.What do we learn from this heatmap?
- A) The data appears as if it was generated by
rnorm
. - B) Some genes in chr1 are very variable.
- C) A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
- D) A group of chrY genes are higher in October compared to June and appear to drive the clustering. Within those clusters there appears to be clustering by
samplInfo$group
.
- A) The data appears as if it was generated by
-
Create a large data set of random data that is completely independent of
sampleInfo$group
like this:set.seed(17) m = nrow(geneExpression) n = ncol(geneExpression) x = matrix(rnorm(m*n),m,n) g = factor(sampleInfo$g )
Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest p-values obtained with
rowttests
. Then, take the 50 genes with largest standard deviations.Which of the following statements is true?
- A) There is no relationship between
g
andx
, but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result. - B) These two techniques produced similar heatmaps.
- C) Selecting genes with the t-test is a better technique since it permits us to detect the two groups. It appears to find hidden signals.
- D) The genes with the largest standard deviation add variability to the plot and do not let us find the differences between the two groups.
- A) There is no relationship between