RNA-seq gene-level analysis

Introduction

RNA-seq is a valuable experiment for quantifying both the types and the amount of RNA molecules in a sample. We’ve covered the basic idea of the protocol in lectures, but some early references for RNA-seq include Mortazavi (2008) and Marioni (2008).

In this lab, we will focus on comparing the expression levels of genes across different samples, by counting the number of reads which overlap the exons of genes defined by a known annotation. As described in the lecture, this analysis sets aside the task of estimating the different kinds of RNA molecules, and the different isoforms for genes with multiple isoforms. One advantage of looking at these matrices of counts is that we can use statistical distributions to model how the variance of counts will change when the counts are low vs high. We will explore the relationship of the variance of counts to the mean later in this lab.

Counting reads in genes

In this lab we will examine 8 samples from the airway package, which are from the paper by Himes et al: “RNA-seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells”.

This lab will focus on a summarized version of an RNA-seq experiment: a count matrix, which has genes along the rows and samples along the columns. The values in the matrix are the number of reads which could be uniquely aligned to the exons of a given gene for a given sample. We will demonstrate how to build a count matrix for a subset of reads from an experiment, and then use a pre-made count matrix, to avoid having students download the multi-gigabyte BAM files containing the aligned reads. A new pipeline for building count matrices, which skips the alignment step, is to use fast pseudoaligners such as Sailfish, Salmon and kallisto, followed by the tximport package. See the package vignette for more details. Here, we will continue with the read counting pipeline.

First, make variables for the different BAM files and GTF file. Use the sample.table to contruct the BAM file vector, so that the count matrix will be in the same order as the sample.table.

library(airway)

## Loading required package: SummarizedExperiment

## Loading required package: methods

## Loading required package: GenomicRanges

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit

## Loading required package: S4Vectors

## Loading required package: stats4

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

dir <- system.file("extdata", package="airway", mustWork=TRUE)
csv.file <- file.path(dir, "sample_table.csv")
sample.table <- read.csv(csv.file, row.names=1)
bam.files <- file.path(dir, paste0(sample.table$Run, "_subset.bam"))
gtf.file <- file.path(dir, "Homo_sapiens.GRCh37.75_subset.gtf")

Next we create an Rsamtools variable which wraps our BAM files, and create a transcript database from the GTF file. We can ignore the warning about matchCircularity. Finally, we make a GRangesList which contains the exons for each gene.

library(Rsamtools)

## Loading required package: XVector

## Loading required package: Biostrings

bam.list <- BamFileList(bam.files)
library(GenomicFeatures)

## Loading required package: AnnotationDbi

# for Bioc 3.0 use the commented out line
# txdb <- makeTranscriptDbFromGFF(gtf.file, format="gtf")
txdb <- makeTxDbFromGFF(gtf.file, format="gtf")

## Import genomic features from the file as a GRanges object ...

## OK

## Prepare the 'metadata' data frame ...

## OK

## Make the TxDb object ...

## OK

exons.by.gene <- exonsBy(txdb, by="gene")

The following code chunk creates a SummarizedExperiment containing the counts for the reads in each BAM file (columns) for each gene in exons.by.gene (the rows). We add the sample.table as column data. Remember, we know the order is correct, because the bam.list was constructed from a column of sample.table.

library(GenomicAlignments)
se <- summarizeOverlaps(exons.by.gene, bam.list,
                        mode="Union",
                        singleEnd=FALSE,
                        ignore.strand=TRUE,
                        fragments=TRUE)
colData(se) <- DataFrame(sample.table)

A similar function in the Rsubread library can be used to construct a count matrix:

library(Rsubread)
fc <- featureCounts(bam.files, annot.ext=gtf.file,
                    isGTFAnnotationFile=TRUE, 
                    isPaired=TRUE)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.18.0
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 8 BAM files                                      ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                           P /Users/michael/Library/R/3.2/library/airwa ... ||
## ||                                                                            ||
## ||             Output file : ./.Rsubread_featureCounts_pid4785                ||
## ||             Annotations : /Users/michael/Library/R/3.2/library/airway/ ... ||
## ||                                                                            ||
## ||                 Threads : 1                                                ||
## ||                   Level : meta-feature level                               ||
## ||              Paired-end : yes                                              ||
## ||         Strand specific : no                                               ||
## ||      Multimapping reads : not counted                                      ||
## || Multi-overlapping reads : not counted                                      ||
## ||                                                                            ||
## ||          Chimeric reads : counted                                          ||
## ||        Both ends mapped : not required                                     ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file /Users/michael/Library/R/3.2/library/airway/extda ... ||
## ||    Features : 406                                                          ||
## ||    Meta-features : 20                                                      ||
## ||    Chromosomes : 1                                                         ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7142                                                  ||
## ||    Successfully assigned fragments : 6649 (93.1%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    1 read has missing mates.                                               ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7200                                                  ||
## ||    Successfully assigned fragments : 6712 (93.2%)                          ||
## ||    Running time : 0.14 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 8536                                                  ||
## ||    Successfully assigned fragments : 7910 (92.7%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    1 read has missing mates.                                               ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7544                                                  ||
## ||    Successfully assigned fragments : 7044 (93.4%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 8818                                                  ||
## ||    Successfully assigned fragments : 8261 (93.7%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    6 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 11850                                                 ||
## ||    Successfully assigned fragments : 11148 (94.1%)                         ||
## ||    Running time : 0.12 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    6 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 5877                                                  ||
## ||    Successfully assigned fragments : 5415 (92.1%)                          ||
## ||    Running time : 0.12 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/michael/Library/R/3.2/library/airway/extdata/S ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    3 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 10208                                                 ||
## ||    Successfully assigned fragments : 9538 (93.4%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## ||                         Read assignment finished.                          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

names(fc)

## [1] "counts"     "annotation" "targets"    "stat"

unname(fc$counts) # hide the colnames

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
##  [1,]    0    0    4    1    0    0    1    0
##  [2,]    0    0    1    0    0    0    0    0
##  [3,]    0    0    0    0    0    0    0    0
##  [4,]    0    0    0    0    0    0    0    0
##  [5,] 2673 2031 3263 1570 3446 3615 2171 1949
##  [6,]   38   28   66   24   42   41   47   36
##  [7,]   58   55   55   42   53   54   50   43
##  [8,] 1004 1253 1122 1313 1100 1879  745 1534
##  [9,] 1060 1291 1360 1286 1201 1725  917 1940
## [10,]    2    1    2    1    2    7    1    4
## [11,]    2    0    6    7    1    7    1    6
## [12,] 1588 1745 1778 2007 2146 3349 1262 2561
## [13,]    1    0    0    1    0    0    0    0
## [14,]    1    0    0    0    0    0    0    0
## [15,]    0    0    0    0    0    0    0    0
## [16,]    4   50   19  543    1   10   14 1067
## [17,]    0    0    0    1    0    0    0    0
## [18,]    0    0    0    0    0    0    0    0
## [19,]  218  257  234  248  267  461  206  398
## [20,]    0    1    0    0    2    0    0    0

Plot the first column from each function against each other (after matching the rows of the featureCounts matrix to the one returned by summarizeOverlaps.

plot(assay(se)[,1], 
     fc$counts[match(rownames(se),rownames(fc$counts)),1])
abline(0,1)

plot of chunk unnamed-chunk-5

Visualizing sample-sample distances

We now load the full SummarizedExperiment object, counting reads over all the genes.

library(airway)
data(airway)
airway

## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

colData(airway)

## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677

rowRanges(airway)

## GRangesList object of length 64102:
## $$ENSG00000000003 
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames               ranges strand   |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
##    [1]        X [99883667, 99884983]      -   |    667145 ENSE00001459322
##    [2]        X [99885756, 99885863]      -   |    667146 ENSE00000868868
##    [3]        X [99887482, 99887565]      -   |    667147 ENSE00000401072
##    [4]        X [99887538, 99887565]      -   |    667148 ENSE00001849132
##    [5]        X [99888402, 99888536]      -   |    667149 ENSE00003554016
##    ...      ...                  ...    ... ...       ...             ...
##   [13]        X [99890555, 99890743]      -   |    667156 ENSE00003512331
##   [14]        X [99891188, 99891686]      -   |    667158 ENSE00001886883
##   [15]        X [99891605, 99891803]      -   |    667159 ENSE00001855382
##   [16]        X [99891790, 99892101]      -   |    667160 ENSE00001863395
##   [17]        X [99894942, 99894988]      -   |    667161 ENSE00001828996
## 
## ...
## <64101 more elements>
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome

The counts matrix is stored in assay of a SummarizedExperiment.

head(assay(airway))

##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003        679        448        873        408       1138
## ENSG00000000005          0          0          0          0          0
## ENSG00000000419        467        515        621        365        587
## ENSG00000000457        260        211        263        164        245
## ENSG00000000460         60         55         40         35         78
## ENSG00000000938          0          0          2          0          1
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003       1047        770        572
## ENSG00000000005          0          0          0
## ENSG00000000419        799        417        508
## ENSG00000000457        331        233        229
## ENSG00000000460         63         76         60
## ENSG00000000938          0          0          0

This code chunk is not necessary, but helps to make nicer plots below with large axis labels (mypar(1,2) can be substituted with par(mfrow=c(1,2)) below).

# library(devtools)
# install_github("ririzarr/rafalib")
library(rafalib)
mypar()

Note that, on the un-transformed scale, the high count genes have high variance. That is, in the following scatter plot, the points start out in a tight cone and then fan out toward the top right. This is a general property of counts generated from sampling processes, that the variance typically increases with the expected value. We will explore different scaling and transformations options below.

plot(assay(airway)[,1:2], cex=.1)

plot of chunk unnamed-chunk-11

Creating a DESeqDataSet object

We will use the DESeq2 package to normalize the sample for sequencing depth. The DESeqDataSet object is just an extension of the SummarizedExperiment object, with a few changes. The matrix in assay is now accessed with counts and the elements of this matrix are required to be non-negative integers (0,1,2,…).

We need to specify an experimental design here, for later use in differential analysis. The design starts with the tilde symbol ~, which means, model the counts (log2 scale) using the following formula. Following the tilde, the variables are columns of the colData, and the + indicates that for differential expression analysis we want to compare levels of dex while controlling for the cell differences.

library(DESeq2)

## Loading required package: Rcpp

## Loading required package: RcppArmadillo

dds <- DESeqDataSet(airway, design= ~ cell + dex)

We can also make a DESeqDataSet from a count matrix and column data.

dds.fc <- DESeqDataSetFromMatrix(fc$counts, 
                                 colData=sample.table, 
                                 design=~ cell + dex)

Normalization for sequencing depth

The following estimates size factors to account for differences in sequencing depth, and is only necessary to make the log.norm.counts object below.

dds <- estimateSizeFactors(dds)
sizeFactors(dds)

## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 
##  1.0236476  0.8961667  1.1794861  0.6700538  1.1776714  1.3990365 
## SRR1039520 SRR1039521 
##  0.9207787  0.9445141

colSums(counts(dds))

## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 
##   20637971   18809481   25348649   15163415   24448408   30818215 
## SRR1039520 SRR1039521 
##   19126151   21164133

plot(sizeFactors(dds), colSums(counts(dds)))
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))

plot of chunk unnamed-chunk-14

Size factors are calculated by the median ratio of samples to a pseudo-sample (the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios in this histogram.

loggeomeans <- rowMeans(log(counts(dds)))
hist(log(counts(dds)[,1]) - loggeomeans, 
     col="grey", main="", xlab="", breaks=40)

plot of chunk unnamed-chunk-15

The size factor for the first sample:

exp(median((log(counts(dds)[,1]) - loggeomeans)[is.finite(loggeomeans)]))

## [1] 1.023648

sizeFactors(dds)[1]

## SRR1039508 
##   1.023648

Make a matrix of log normalized counts (plus a pseudocount):

log.norm.counts <- log2(counts(dds, normalized=TRUE) + 1)

Another way to make this matrix, and keep the sample and gene information is to use the function normTransform. The same matrix as above is stored in assay(log.norm).

log.norm <- normTransform(dds)

Examine the log counts and the log normalized counts (plus a pseudocount).

rs <- rowSums(counts(dds))
mypar(1,2)
boxplot(log2(counts(dds)[rs > 0,] + 1)) # not normalized
boxplot(log.norm.counts[rs > 0,]) # normalized

plot of chunk unnamed-chunk-19

Make a scatterplot of log normalized counts against each other. Note the fanning out of the points in the lower left corner, for points less than $2^5 = 32$ .

plot(log.norm.counts[,1:2], cex=.1)

plot of chunk unnamed-chunk-20

Stabilizing count variance

Now we will use a more sophisticated transformation, which is similar to the variance stablizing normalization method taught in Week 3 of Course 4: Introduction to Bioconductor. It uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog is very close to log2. For further details, see the section in the DESeq2 paper.

We use the argument blind=FALSE which means that the global dispersion trend should be estimated by considering the experimental design, but the design is not used for applying the transformation itself. See the DESeq2 vignette for more details.

rld <- rlog(dds, blind=FALSE)
plot(assay(rld)[,1:2], cex=.1)

plot of chunk unnamed-chunk-21

Another transformation for stabilizing variance in the DESeq2 package is varianceStabilizingTransformation. These two tranformations are similar, the rlog might perform a bit better when the size factors vary widely, and the varianceStabilizingTransformation is much faster when there are many samples.

vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
plot(assay(vsd)[,1:2], cex=.1)

plot of chunk unnamed-chunk-22

We can examine the standard deviation of rows over the mean for the log plus pseudocount and the rlog. Note that the genes with high variance for the log come from the genes with lowest mean. If these genes were included in a distance calculation, the high variance at the low count range might overwhelm the signal at the higher count range.

library(vsn)
meanSdPlot(log.norm.counts, ranks=FALSE) 

plot of chunk unnamed-chunk-23

For the rlog:

meanSdPlot(assay(rld), ranks=FALSE)

plot of chunk unnamed-chunk-24

For the VST:

meanSdPlot(assay(vsd), ranks=FALSE)

plot of chunk unnamed-chunk-25

The principal components (PCA) plot is a useful diagnostic for examining relationships between samples:

plotPCA(log.norm, intgroup="dex")

plot of chunk unnamed-chunk-26

Using the rlog:

plotPCA(rld, intgroup="dex")

plot of chunk unnamed-chunk-27

Using the VST:

plotPCA(vsd, intgroup="dex")

plot of chunk unnamed-chunk-28

We can make this plot even nicer using custom code from the ggplot2 library:

library(ggplot2)
(data <- plotPCA(rld, intgroup=c("dex","cell"), returnData=TRUE))

##                   PC1        PC2           group   dex    cell       name
## SRR1039508 -17.889006  -4.157942  untrt : N61311 untrt  N61311 SRR1039508
## SRR1039509   8.436826  -1.650903    trt : N61311   trt  N61311 SRR1039509
## SRR1039512 -10.278063  -5.066604 untrt : N052611 untrt N052611 SRR1039512
## SRR1039513  17.642883  -3.910904   trt : N052611   trt N052611 SRR1039513
## SRR1039516 -14.740835  15.990214 untrt : N080611 untrt N080611 SRR1039516
## SRR1039517  10.956483  20.806410   trt : N080611   trt N080611 SRR1039517
## SRR1039520 -12.120208 -11.962714 untrt : N061011 untrt N061011 SRR1039520
## SRR1039521  17.991921 -10.047558   trt : N061011   trt N061011 SRR1039521

(percentVar <- 100*round(attr(data, "percentVar"),2))

## [1] 42 26

makeLab <- function(x,pc) paste0("PC",pc,": ",x,"% variance")
ggplot(data, aes(PC1,PC2,col=dex,shape=cell)) + geom_point() +
  xlab(makeLab(percentVar[1],1)) + ylab(makeLab(percentVar[2],2))

plot of chunk unnamed-chunk-30

In addition, we can plot a hierarchical clustering based on Euclidean distance matrix:

mypar(1,2)
plot(hclust(dist(t(log.norm.counts))), labels=colData(dds)$dex)
plot(hclust(dist(t(assay(rld)))), labels=colData(rld)$dex)

plot of chunk unnamed-chunk-31

Differential gene expression

Modeling raw counts with normalization

We will now perform differential gene expression on the counts, to try to find genes in which the differences in expected counts across samples due to the condition of interest rises above the biological and technical variance we observe.

We will use an overdispersed Poisson distribution – called the negative binomial – to model the raw counts in the count matrix. The model will include the size factors into account to adjust for sequencing depth. The formula will look like:

$K_{ij} \sim \text{NB}(s_{ij} q_{ij}, \alpha_i )$

where $K_{ij}$ is a single raw count in our count table, $s_{ij}$ is a size factor or more generally a normalization factor, $q_{ij}$ is proportional to gene expression (what we want to model with our design variables), and $\alpha_i$ is a dispersion parameter.

Why bother modeling raw counts, rather than dividing out the sequencing depth and working with the normalized counts? In other words, why put the $s_{ij}$ on the right side of the equation above, rather than dividing out on the left side and modeling $K_{ij} / s_{ij}$ . The reason is that, with the raw count, we have knowledge about the link between the expected value and its variance. So we prefer the first equation below to the second equation, because with the first equation, we have some additional information about the variance of the quantity on the left hand side.

$K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} )$ $\frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij})$

When we sample cDNA fragments from a pool in a sequencing library, we can model the count of cDNA fragments which originated from a given gene with a binomial distribution, with a certain probability of picking a fragment for that gene which relates to factors such as the expression of that gene (the abundance of mRNA in the original population of cells), its length and technical factors in the production of the library. When we have many genes, and the rate for each gene is low, while the total number of fragments is high, we know that the Poisson is a good model for the binomial. And for the binomial and the Poisson, there is an explicit link between on observed count and its expected variance.

Below is an example of what happens when we divide or multiply a raw count. Here we show three distributions which all have the expected value of 100, although they have different variances. The first is a raw count with mean 100, the second and third are raw counts with mean 1000 and 10, which were then scaled by 1/10 and 10, respectively.

mypar(3,1)
n <- 10000
brks <- 0:300
hist(rpois(n,100),main="",xlab="",breaks=brks,col="black")
hist(rpois(n,1000)/10,main="",xlab="",breaks=brks,col="black")
hist(rpois(n,10)*10,main="",xlab="",breaks=brks,col="black")

plot of chunk unnamed-chunk-32

So, when we scale a raw count, we break the implicit link between the mean and the variance. This is not necessarily a problem, if we have 100s of samples over which to observe within-group variance, however RNA-seq samples can often have only 3 samples per group, in which case, we can get a benefit of information from using raw counts, and incorporating normalization factors on the right side of the equation above.

Counts across biological replicates and over-dispersion

For the negative binomial, the variance parameter is called disperison, and it links the mean value with the expected variance. The reason we see more dispersion than in a Poisson is mostly due to changes in the proportions of genes across biological replicates – which we would expect due to natural differences in gene expression.

mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,lambda=100),
     main="Poisson / NB, disp=0",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.01),
     main="NB, disp = 0.01",xlab="",breaks=brks,col="black")
hist(rnbinom(n,mu=100,size=1/.1),
     main="NB, disp = 0.1",xlab="",breaks=brks,col="black")

plot of chunk unnamed-chunk-33

The square root of the dispersion is the coefficient of variation – SD/mean – after subtracting the variance we expect due to Poisson sampling.

disp <- 0.5
mu <- 100
v <- mu + disp * mu^2
sqrt(v)/mu

## [1] 0.7141428

sqrt(v - mu)/mu

## [1] 0.7071068

sqrt(disp)

## [1] 0.7071068

A number of methods for assessing differential gene expression from RNA-seq counts use the negative binomial distribution to make probabilistic statements about the differences seen in an experiment. A few such methods are edgeR, DESeq2, and DSS. Other methods, such as limma+voom find other ways to explicitly model the mean of log counts and the observed variance of log counts. A very incomplete list of statistical methods for RNA-seq differential expression is provided in the footnotes.

DESeq2 performs a similar step to limma as discussed in PH525x Course 3, in using the variance of all the genes to improve the variance estimate for each individual gene. In addition, DESeq2 shrinks the unreliable fold changes from genes with low counts, which will be seen in the resulting MA-plot.

Experimental design and running DESeq2

Remember, we had created the DESeqDataSet object earlier using the following line of code (or alternatively using DESeqDataSetFromMatrix)

dds <- DESeqDataSet(airway, design= ~ cell + dex)

First, we setup the design of the experiment, so that differences will be considered across time and protocol variables. We can read and if necessary reset the design using the following code.

design(dds)

## ~cell + dex

design(dds) <- ~ cell + dex

The last variable in the design is used by default for building results tables (although arguments to results can be used to customize the results table), and we make sure the “control” or “untreated” level is the first level, such that log fold changes will be treated over control, and not control over treated.

levels(dds$dex)

## [1] "trt"   "untrt"

dds$dex <- relevel(dds$dex, "untrt")
levels(dds$dex)

## [1] "untrt" "trt"

The following line runs the DESeq2 model. After this step, we can build a results table, which by default will compare the levels in the last variable in the design, so the dex treatment in our case:

dds <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

res <- results(dds)

Examining results tables

head(res)

## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 6 columns
##                    baseMean log2FoldChange      lfcSE       stat
##                   <numeric>      <numeric>  <numeric>  <numeric>
## ENSG00000000003 708.6021697    -0.37424998 0.09873107 -3.7906000
## ENSG00000000005   0.0000000             NA         NA         NA
## ENSG00000000419 520.2979006     0.20215551 0.10929899  1.8495642
## ENSG00000000457 237.1630368     0.03624826 0.13684258  0.2648902
## ENSG00000000460  57.9326331    -0.08523371 0.24654402 -0.3457140
## ENSG00000000938   0.3180984    -0.11555969 0.14630532 -0.7898530
##                       pvalue        padj
##                    <numeric>   <numeric>
## ENSG00000000003 0.0001502838 0.001251416
## ENSG00000000005           NA          NA
## ENSG00000000419 0.0643763851 0.192284345
## ENSG00000000457 0.7910940556 0.910776144
## ENSG00000000460 0.7295576905 0.878646940
## ENSG00000000938 0.4296136373          NA

table(res$padj < 0.1)

## 
## FALSE  TRUE 
## 13145  4883

A summary of the results can be generated:

summary(res)

## 
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 2641, 7.9% 
## LFC < 0 (down)   : 2242, 6.7% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 15441, 46% 
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

For testing at a different threshold, we provide the alpha to results, so that the mean filtering is optimal for our new FDR threshold.

res2 <- results(dds, alpha=0.05)
table(res2$padj < 0.05)

## 
## FALSE  TRUE 
## 12736  4057

Visualizing results

The MA-plot provides a global view of the differential genes, with the log2 fold change on the y-axis over the mean of normalized counts:

plotMA(res, ylim=c(-4,4))

plot of chunk unnamed-chunk-42

We can also test against a different null hypothesis. For example, to test for genes which have fold change more than doubling or less than halving:

res.thr <- results(dds, lfcThreshold=1)
plotMA(res.thr, ylim=c(-4,4))

plot of chunk unnamed-chunk-43

A p-value histogram:

hist(res$pvalue[res$baseMean > 1], 
     col="grey", border="white", xlab="", ylab="", main="")

plot of chunk unnamed-chunk-44

A sorted results table:

resSort <- res[order(res$padj),]
head(resSort)

## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat
##                  <numeric>      <numeric> <numeric> <numeric>
## ENSG00000152583   997.4398       4.316100 0.1724127  25.03354
## ENSG00000165995   495.0929       3.188698 0.1277441  24.96160
## ENSG00000101347 12703.3871       3.618232 0.1499441  24.13054
## ENSG00000120129  3409.0294       2.871326 0.1190334  24.12201
## ENSG00000189221  2341.7673       3.230629 0.1373644  23.51868
## ENSG00000211445 12285.6151       3.552999 0.1589971  22.34631
##                        pvalue          padj
##                     <numeric>     <numeric>
## ENSG00000152583 2.637881e-138 4.755573e-134
## ENSG00000165995 1.597973e-137 1.440413e-133
## ENSG00000101347 1.195378e-128 6.620010e-125
## ENSG00000120129 1.468829e-128 6.620010e-125
## ENSG00000189221 2.627083e-122 9.472209e-119
## ENSG00000211445 1.311440e-110 3.940441e-107

Examine the counts for the top gene, sorting by p-value:

plotCounts(dds, gene=which.min(res$padj), intgroup="dex")

plot of chunk unnamed-chunk-46

A more sophisticated plot of counts:

library(ggplot2)
data <- plotCounts(dds, gene=which.min(res$padj), intgroup=c("dex","cell"), returnData=TRUE)
ggplot(data, aes(x=dex, y=count, col=cell)) +
  geom_point(position=position_jitter(width=.1,height=0)) +
  scale_y_log10()

plot of chunk unnamed-chunk-47

Connecting by lines shows the differences which are actually being tested by results given that our design includes cell + dex

ggplot(data, aes(x=dex, y=count, col=cell, group=cell)) +
  geom_point() + geom_line() + scale_y_log10() 

plot of chunk unnamed-chunk-48

A heatmap of the top genes:

library(pheatmap)
topgenes <- head(rownames(resSort),20)
mat <- assay(rld)[topgenes,]
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[,c("dex","cell")])
pheatmap(mat, annotation_col=df)

plot of chunk unnamed-chunk-49

Getting alternate annotations

We can then check the annotation of these highly significant genes:

library(org.Hs.eg.db)

## Loading required package: DBI

##

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT" 
##  [5] "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [9] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"    
## [17] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"        
## [21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

anno <- select(org.Hs.eg.db, keys=topgenes,
               columns=c("SYMBOL","GENENAME"), 
               keytype="ENSEMBL")

## 'select()' returned 1:1 mapping between keys and columns

anno[match(topgenes, anno$ENSEMBL),]

##            ENSEMBL  SYMBOL
## 1  ENSG00000152583 SPARCL1
## 2  ENSG00000165995  CACNB2
## 3  ENSG00000101347  SAMHD1
## 4  ENSG00000120129   DUSP1
## 5  ENSG00000189221    MAOA
## 6  ENSG00000211445    GPX3
## 7  ENSG00000157214  STEAP2
## 8  ENSG00000162614    NEXN
## 9  ENSG00000125148    MT2A
## 10 ENSG00000154734 ADAMTS1
## 11 ENSG00000139132    FGD4
## 12 ENSG00000162493    PDPN
## 13 ENSG00000162692   VCAM1
## 14 ENSG00000179094    PER1
## 15 ENSG00000134243   SORT1
## 16 ENSG00000163884   KLF15
## 17 ENSG00000178695  KCTD12
## 18 ENSG00000146250  PRSS35
## 19 ENSG00000198624  CCDC69
## 20 ENSG00000148848  ADAM12
##                                                     GENENAME
## 1                                       SPARC-like 1 (hevin)
## 2         calcium channel, voltage-dependent, beta 2 subunit
## 3                                 SAM domain and HD domain 1
## 4                             dual specificity phosphatase 1
## 5                                        monoamine oxidase A
## 6                                   glutathione peroxidase 3
## 7                    STEAP family member 2, metalloreductase
## 8                          nexilin (F actin binding protein)
## 9                                         metallothionein 2A
## 10 ADAM metallopeptidase with thrombospondin type 1 motif, 1
## 11                   FYVE, RhoGEF and PH domain containing 4
## 12                                                podoplanin
## 13                         vascular cell adhesion molecule 1
## 14                                  period circadian clock 1
## 15                                                sortilin 1
## 16                                    Kruppel-like factor 15
## 17    potassium channel tetramerization domain containing 12
## 18                                      protease, serine, 35
## 19                          coiled-coil domain containing 69
## 20                           ADAM metallopeptidase domain 12

# for Bioconductor >= 3.1, easier to use mapIds() function

Looking up different results tables

The contrast argument allows users to specify what results table should be built. See the help and examples in ?results for more details:

results(dds, contrast=c("cell","N61311","N052611"))

## log2 fold change (MAP): cell N61311 vs N052611 
## Wald test p-value: cell N61311 vs N052611 
## DataFrame with 64102 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE       stat    pvalue
##                 <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSG00000000003 708.60217    -0.20680752 0.1368030 -1.5117175 0.1306057
## ENSG00000000005   0.00000             NA        NA         NA        NA
## ENSG00000000419 520.29790    -0.06389491 0.1490723 -0.4286169 0.6682020
## ENSG00000000457 237.16304     0.06428564 0.1827151  0.3518355 0.7249617
## ENSG00000000460  57.93263     0.34319847 0.2840573  1.2082017 0.2269697
## ...                   ...            ...       ...        ...       ...
## LRG_94                  0             NA        NA         NA        NA
## LRG_96                  0             NA        NA         NA        NA
## LRG_97                  0             NA        NA         NA        NA
## LRG_98                  0             NA        NA         NA        NA
## LRG_99                  0             NA        NA         NA        NA
##                      padj
##                 <numeric>
## ENSG00000000003 0.4851668
## ENSG00000000005        NA
## ENSG00000000419 0.9153673
## ENSG00000000457 0.9334645
## ENSG00000000460 0.6280187
## ...                   ...
## LRG_94                 NA
## LRG_96                 NA
## LRG_97                 NA
## LRG_98                 NA
## LRG_99                 NA

Surrogate variable analysis for RNA-seq

If we suppose that we didn’t know about the different cell-lines in the experiment, but noticed some structure in the counts, we could use surrograte variable analysis (SVA) to detect this hidden structure (see PH525x Course 3 for details on the algorithm).

library(sva)

## Loading required package: mgcv

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:Biostrings':
## 
##     collapse

## The following object is masked from 'package:IRanges':
## 
##     collapse

## This is mgcv 1.8-12. For overview type 'help("mgcv-package")'.

## Loading required package: genefilter

## 
## Attaching package: 'genefilter'

## The following object is masked from 'package:base':
## 
##     anyNA

dat <- counts(dds, normalized=TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx,]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv=2)

## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

Do the surrogate variables capture the cell difference?

plot(svseq$sv[,1], svseq$sv[,2], col=dds$cell, pch=16)

plot of chunk unnamed-chunk-53

Using the surrogate variables in a DESeq2 analysis:

dds.sva <- dds
dds.sva$SV1 <- svseq$sv[,1]
dds.sva$SV2 <- svseq$sv[,2]
design(dds.sva) <- ~ SV1 + SV2 + dex
dds.sva <- DESeq(dds.sva)

## using pre-existing size factors

## estimating dispersions

## found already estimated dispersions, replacing these

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

Session info

sessionInfo()

## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] sva_3.18.0                 genefilter_1.52.1         
##  [3] mgcv_1.8-12                nlme_3.1-126              
##  [5] org.Hs.eg.db_3.2.3         RSQLite_1.0.0             
##  [7] DBI_0.3.1                  pheatmap_1.0.8            
##  [9] ggplot2_2.1.0              hexbin_1.27.1             
## [11] vsn_3.38.0                 DESeq2_1.10.1             
## [13] RcppArmadillo_0.6.600.4.0  Rcpp_0.12.4               
## [15] rafalib_1.0.0              GenomicFeatures_1.22.13   
## [17] AnnotationDbi_1.32.3       Rsamtools_1.22.0          
## [19] Biostrings_2.38.4          XVector_0.10.0            
## [21] airway_0.104.0             SummarizedExperiment_1.0.2
## [23] Biobase_2.30.0             GenomicRanges_1.22.4      
## [25] GenomeInfoDb_1.6.3         IRanges_2.4.8             
## [27] S4Vectors_0.8.11           BiocGenerics_0.16.1       
## [29] knitr_1.12.3              
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1          lattice_0.20-33        
##  [3] digest_0.6.9            plyr_1.8.3             
##  [5] futile.options_1.0.0    acepack_1.3-3.3        
##  [7] evaluate_0.8.3          BiocInstaller_1.20.1   
##  [9] zlibbioc_1.16.0         annotate_1.48.0        
## [11] rpart_4.1-10            Matrix_1.2-4           
## [13] preprocessCore_1.32.0   labeling_0.3           
## [15] splines_3.2.4           BiocParallel_1.4.3     
## [17] geneplotter_1.48.0      stringr_1.0.0          
## [19] foreign_0.8-66          RCurl_1.95-4.8         
## [21] biomaRt_2.26.1          munsell_0.4.3          
## [23] rtracklayer_1.30.4      nnet_7.3-12            
## [25] gridExtra_2.2.1         Hmisc_3.17-3           
## [27] XML_3.98-1.4            GenomicAlignments_1.6.3
## [29] bitops_1.0-6            grid_3.2.4             
## [31] xtable_1.8-2            gtable_0.2.0           
## [33] affy_1.48.0             magrittr_1.5           
## [35] formatR_1.3             scales_0.4.0           
## [37] stringi_1.0-1           affyio_1.40.0          
## [39] limma_3.26.9            latticeExtra_0.6-28    
## [41] futile.logger_1.4.1     Formula_1.2-1          
## [43] lambda.r_1.1.7          RColorBrewer_1.1-2     
## [45] tools_3.2.4             survival_2.38-3        
## [47] colorspace_1.2-6        cluster_2.0.3

Footnotes

RNA-seq introductory papers

Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B., “Mapping and quantifying mammalian transcriptomes by RNA-seq”, Nat Methods. 2008. http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html

John C. Marioni, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad, “RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays” Genome Res. 2008. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527709/

Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L., “Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation”, Nature Biotechnology, 2010. http://www.nature.com/nbt/journal/v28/n5/full/nbt.1621.html

ReCount

Frazee AC, Langmead B, Leek JT. “ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets”. BMC Bioinformatics 12:449 http://www.ncbi.nlm.nih.gov/pubmed/22087737

The following sections give just a few examples of the many RNA-seq differential expression software packages:

Negative binomial count methods

The following methods are available on Bioconductor:

DESeq2

Michael I Love, Simon Anders, Wolfgang Huber, “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2” Genome Biology 2014. http://genomebiology.com/2014/15/12/550

edgeR

Mark D. Robinson, Davis J. McCarthy, and Gordon K. Smyth, “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data” Bioinformatics 2010. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818/

DSS

Hao Wu, Chi Wang, Zhijin Wu, “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data” Biostatistics 2013. http://biostatistics.oxfordjournals.org/content/14/2/232

Variance-mean modeling followed by linear model

limma-voom in the limma Bioconductor package. Limma also contains gene-set testing methods (see ROAST for example in the Reference Manual)

Charity W Law, Yunshun Chen, Wei Shi and Gordon K Smyth, “voom: precision weights unlock linear model analysis tools for RNA-seq read counts”, Genome Biology. 2014. http://genomebiology.com/2014/15/2/R29

Resampling-based methods

SAMseq in the samr package on CRAN

Jun Li and Robert Tibshirani, “Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-seq data”, Stat Methods Med Res. 2013. http://smm.sagepub.com/content/22/5/519.short

Incorporating isoform-abundance

Cuffdiff (the latest version is Cuffdiff2) with cummeRbund the accompanying Bioconductor visualization package.

Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L., “Differential analysis of gene regulation at transcript resolution with RNA-seq” Nat Biotechnol. 2013. http://www.ncbi.nlm.nih.gov/pubmed/23222703

BitSeq (Bioconductor)

Peter Glaus, Antti Honkela, and Magnus Rattray, “Identifying differentially expressed transcripts from RNA-seq data with biological variation”, Bioinformatics. 2012. http://bioinformatics.oxfordjournals.org/content/28/13/1721