Clustering and Heatmaps Exercise

{pagebreak}

Exercises

Create a random matrix with no correlation in the following way:
```
 set.seed(1)
 m = 10000
 n = 24
 x = matrix(rnorm(m*n),m,n)
 colnames(x)=1:n
```
Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram.

In the dendrogram, which pairs of samples are the furthest away from each other?
- A) 7 and 23
- B) 19 and 14
- C) 1 and 16
- D) 17 and 18
Set the seed at 1, set.seed(1) and replicate the creation of this matrix:
```
 m = 10000
 n = 24
 x = matrix(rnorm(m*n),m,n)
```
then perform hierarchical clustering as in the solution to exercise 1, and find the number of clusters if you use cuttree at height 143. This number is a random variable.

Based on the Monte Carlo simulation, what is the standard error of this random variable?
Run kmeans with 4 centers for the blood RNA data:
```
 library(GSE5859Subset)
 data(GSE5859Subset)
```
Set the seed to 10, set.seed(10) right before running kmeans with 5 centers.

Explore the relationship of clusters and information in sampleInfo. Which of the following best describes what you find?
- A) sampleInfo$group is driving the clusters as the 0s and 1s are in completely different clusters.
- B) The year is driving the clusters.
- C) Date is driving the clusters.
- D) The clusters don’t depend on any of the column of sampleInfo
Load the data:
```
 library(GSE5859Subset)
 data(GSE5859Subset)
```
Pick the 25 genes with the highest across sample variance. This function might help:
```
 install.packages("matrixStats")
 library(matrixStats)
 ?rowMads ##we use mads due to a outlier sample
```
Use heatmap.2 to make a heatmap showing the sampleInfo$group with color, the date as labels, the rows labelled with chromosome, and scaling the rows.

What do we learn from this heatmap?
- A) The data appears as if it was generated by rnorm.
- B) Some genes in chr1 are very variable.
- C) A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
- D) A group of chrY genes are higher in October compared to June and appear to drive the clustering. Within those clusters there appears to be clustering by samplInfo$group.
Create a large data set of random data that is completely independent of sampleInfo$group like this:
```
 set.seed(17)
 m = nrow(geneExpression)
 n = ncol(geneExpression)
 x = matrix(rnorm(m*n),m,n)
 g = factor(sampleInfo$g )
```
Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest p-values obtained with rowttests. Then, take the 50 genes with largest standard deviations.

Which of the following statements is true?
- A) There is no relationship between g and x, but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.
- B) These two techniques produced similar heatmaps.
- C) Selecting genes with the t-test is a better technique since it permits us to detect the two groups. It appears to find hidden signals.
- D) The genes with the largest standard deviation add variability to the plot and do not let us find the differences between the two groups.