1. Create a random matrix with no correlation in the following way:

     m = 10000
     n = 24
     x = matrix(rnorm(m*n),m,n)

    Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram.

    In the dendrogram, which pairs of samples are the furthest away from each other?

    • A) 7 and 23
    • B) 19 and 14
    • C) 1 and 16
    • D) 17 and 18
  2. Set the seed at 1, set.seed(1) and replicate the creation of this matrix:

     m = 10000
     n = 24
     x = matrix(rnorm(m*n),m,n)

    then perform hierarchical clustering as in the solution to exercise 1, and find the number of clusters if you use cuttree at height 143. This number is a random variable.

    Based on the Monte Carlo simulation, what is the standard error of this random variable?

  3. Run kmeans with 4 centers for the blood RNA data:


    Set the seed to 10, set.seed(10) right before running kmeans with 5 centers.

    Explore the relationship of clusters and information in sampleInfo. Which of the following best describes what you find?

    • A) sampleInfo$group is driving the clusters as the 0s and 1s are in completely different clusters.
    • B) The year is driving the clusters.
    • C) Date is driving the clusters.
    • D) The clusters don’t depend on any of the column of sampleInfo
  4. Load the data:


    Pick the 25 genes with the highest across sample variance. This function might help:

     ?rowMads ##we use mads due to a outlier sample

    Use heatmap.2 to make a heatmap showing the sampleInfo$group with color, the date as labels, the rows labelled with chromosome, and scaling the rows.

    What do we learn from this heatmap?

    • A) The data appears as if it was generated by rnorm.
    • B) Some genes in chr1 are very variable.
    • C) A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
    • D) A group of chrY genes are higher in October compared to June and appear to drive the clustering. Within those clusters there appears to be clustering by samplInfo$group.
  5. Create a large data set of random data that is completely independent of sampleInfo$group like this:

     m = nrow(geneExpression)
     n = ncol(geneExpression)
     x = matrix(rnorm(m*n),m,n)
     g = factor(sampleInfo$g )

    Create two heatmaps with these data. Show the group g either with labels or colors. First, take the 50 genes with smallest p-values obtained with rowttests. Then, take the 50 genes with largest standard deviations.

    Which of the following statements is true?

    • A) There is no relationship between g and x, but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.
    • B) These two techniques produced similar heatmaps.
    • C) Selecting genes with the t-test is a better technique since it permits us to detect the two groups. It appears to find hidden signals.
    • D) The genes with the largest standard deviation add variability to the plot and do not let us find the differences between the two groups.