{pagebreak}

## Exercises

1. Create a random matrix with no correlation in the following way:

 set.seed(1)
m = 10000
n = 24
x = matrix(rnorm(m*n),m,n)
colnames(x)=1:n


Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram.

In the dendrogram, which pairs of samples are the furthest away from each other?

• A) 7 and 23
• B) 19 and 14
• C) 1 and 16
• D) 17 and 18
2. Set the seed at 1, set.seed(1) and replicate the creation of this matrix:

 m = 10000
n = 24
x = matrix(rnorm(m*n),m,n)


then perform hierarchical clustering as in the solution to exercise 1, and find the number of clusters if you use cuttree at height 143. This number is a random variable.

Based on the Monte Carlo simulation, what is the standard error of this random variable?

3. Run kmeans with 4 centers for the blood RNA data:

 library(GSE5859Subset)
data(GSE5859Subset)


Set the seed to 10, set.seed(10) right before running kmeans with 5 centers.

Explore the relationship of clusters and information in sampleInfo. Which of the following best describes what you find?

• A) sampleInfo$group is driving the clusters as the 0s and 1s are in completely different clusters. • B) The year is driving the clusters. • C) Date is driving the clusters. • D) The clusters don’t depend on any of the column of sampleInfo 4. Load the data:  library(GSE5859Subset) data(GSE5859Subset)  Pick the 25 genes with the highest across sample variance. This function might help:  install.packages("matrixStats") library(matrixStats) ?rowMads ##we use mads due to a outlier sample  Use heatmap.2 to make a heatmap showing the sampleInfo$group with color, the date as labels, the rows labelled with chromosome, and scaling the rows.

What do we learn from this heatmap?

• A) The data appears as if it was generated by rnorm.
• B) Some genes in chr1 are very variable.
• C) A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
• D) A group of chrY genes are higher in October compared to June and appear to drive the clustering. Within those clusters there appears to be clustering by samplInfo$group. 5. Create a large data set of random data that is completely independent of sampleInfo$group like this:

 set.seed(17)
m = nrow(geneExpression)
n = ncol(geneExpression)
x = matrix(rnorm(m*n),m,n)
g = factor(sampleInfo\$g )


Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest p-values obtained with rowttests. Then, take the 50 genes with largest standard deviations.

Which of the following statements is true?

• A) There is no relationship between g and x, but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.
• B) These two techniques produced similar heatmaps.
• C) Selecting genes with the t-test is a better technique since it permits us to detect the two groups. It appears to find hidden signals.
• D) The genes with the largest standard deviation add variability to the plot and do not let us find the differences between the two groups.