Overview of performance enhancements

The concept of scalability for software systems and data analyses is complex and precise metrics and criteria are not ready to hand. An interesting overview from the computer science community is that of Weinstock and Goodenough. Scalability of Bioconductor-based analytical workflows can be pursued along various paths, and many useful ideas are presented in work of Mike Lawrence and Martin Morgan, two Bioconductor core members.

This document addresses only the use of parallel computation as directly supported in Bioconductor. The two main approaches are

  • Shared memory parallelism. The R process is forked an arbitrary number of times with full copies of the memory image and computations proceed independently for each image. Selected results are returned to the master process. This is often effective in multicore environments.

  • Distributed parallelism. Independent processors, potentially running different operating systems, can run compatible instances of R. Job control can be carried out by R or by a cluster scheduler.

For tasks that are “embarrassingly parallel”, that do not require communication between processes beyond starting, stopping, and returning results, either of these approaches can be used reasonably simply in R.

Simple illustration

We can demonstrate the shared memory approach with our laptop.

system.time( lapply(1:8, function(x)Sys.sleep(1) ) )
##    user  system elapsed 
##   0.004   0.001   8.020
library(parallel)
detectCores()
## [1] 4
options(mc.cores=4)
system.time( mclapply(1:8, function(x)Sys.sleep(1) ) )
##    user  system elapsed 
##   0.007   0.022   2.031

For this meaningless computation, we achieved linear speedup: we cut the time for serial computation by a factor of four.

Detour: BAM files from an RNA-seq experiment

We will be working with a set of BAM files created in the study of Zarnack et al. 2013. The hypothesis of this study is that a class of proteins, the heterogeneous nuclear ribonucleoproteins C1 and C2, referred to as hnRNP C, are responsible for preventing incorporation of Alu elements into mature RNA transcripts. Such incorporation may lead to protein dysfunction that could cause disease; see the references of the Zarnack paper for further background.

The package that we will work with has 8 BAM files with reads aligned to chr14. Four of the files are reads obtained from from HeLa cells (negative control) and four are obtained from HeLa cells in which hnRNP C has been “knocked down” with siRNA.

library(RNAseqData.HNRNPC.bam.chr14)
dir(system.file("extdata", package="RNAseqData.HNRNPC.bam.chr14"))
##  [1] "ERR127302_chr14.bam"     "ERR127302_chr14.bam.bai"
##  [3] "ERR127303_chr14.bam"     "ERR127303_chr14.bam.bai"
##  [5] "ERR127304_chr14.bam"     "ERR127304_chr14.bam.bai"
##  [7] "ERR127305_chr14.bam"     "ERR127305_chr14.bam.bai"
##  [9] "ERR127306_chr14.bam"     "ERR127306_chr14.bam.bai"
## [11] "ERR127307_chr14.bam"     "ERR127307_chr14.bam.bai"
## [13] "ERR127308_chr14.bam"     "ERR127308_chr14.bam.bai"
## [15] "ERR127309_chr14.bam"     "ERR127309_chr14.bam.bai"
## [17] "tophat2_out"

These are Tabix-indexed BAM files. In the alphabetic ordering, the first four are two pairs of replicates of HeLa cells that have undergone hnRNP C knockdown, and the second four are two pairs of control replicates.

Implicit parallelism through BiocParallel

To foster harmonious development of reliably performant procedures, Bioconductor introduced the BiocParallel package. Users benefit from autonomous (but optionally controllable) pursuit of parallel computing when feasible. Consider the following example: we will count reads into bins defined by exon addresses in the HNRNPC example dataset.
The RNAseqData.HNRNPC.bam.chr14 package includes a vector of absolute pathnames of the BAM files, which we assign to fns.

library(RNAseqData.HNRNPC.bam.chr14)
fns = RNAseqData.HNRNPC.bam.chr14_BAMFILES
length(fns)
## [1] 8

Here we establish the exon bins into which we will count reads.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevels(txdb) = "chr14"
ebg = exonsBy(txdb, by="gene")

Now we use the summarizeOverlaps function from the GenomicAlignments package to count reads into the exon bins. We’ll time the counting from a single file, and then time the counting from all files at once.

library(GenomicAlignments)
# summarizeOverlaps uses bplapply to iterate over files
s1 = system.time(i1 <- summarizeOverlaps( ebg, fns[3] ))
s1
##    user  system elapsed 
##   2.975   0.178   3.178
# show implicit config
BiocParallel::bpparam()
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2592000; bpprogressbar: FALSE
##   bpRNGseed: 
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK
s2 = system.time(i2 <- summarizeOverlaps( ebg, fns ))
s2
##    user  system elapsed 
##   0.271   0.162  10.480

This is not a thorough way of measuring speedup but it shows reasonable enhancement. In the second computation, we did approximately 8 times as much computation, but the clock time elapsed increased only by a factor of (3.3). The default behavior of BiocParallel on the MacBook Air used to produce this document is to pick up the value of the option mc.cores and use this as the number of workers in a MulticoreParam configuration; if options()$mc.cores is NULL, 2 workers are specified.

When BiocParallel is attached, the summarizeOverlaps function will iterate over the files using bplapply from the BiocParallel package. That function will check the R session for specific parallelization configuration information. If it finds none, it will check for multiple cores and make arrangements to use them if present. The “check” occurs via the function bpparam.

The default situation on a MacBook Air running MacOSX 10.9.5 with an Intel core i7 processor, (two physical cores with two logical cores each, allowing for four concurrent threads) as follows.

options(mc.cores=NULL)
library(BiocParallel)
register(MulticoreParam())
bpparam()
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bptimeout: 2592000; bpprogressbar: FALSE
##   bpRNGseed: 
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

This identifies an object called MulticoreParam which is used to configure the behavior of bplapply and other utilities of the BiocParallel package. There are various configuration object classes that can be used.

‘SnowParam’: distributed memory computing

‘MulticoreParam’: shared memory computing

‘BatchJobsParam’: scheduled cluster computing

‘DoparParam’: foreach computing

‘SerialParam’: non-parallel execution

We need to use register to determine the type of concurrent computation that will be performed.

If process size is large, we may want to leave some cores idle. We can accomplish that by using register. Here we’ll check for any advantage of using all logical cores.

library(BiocParallel)
register(MulticoreParam(workers=1))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##  15.653   1.286  17.070
register(MulticoreParam(workers=2))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.103   0.029   8.818
register(MulticoreParam(workers=3))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.051   0.023   9.543
register(MulticoreParam(workers=4))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
##    user  system elapsed 
##   0.056   0.026   8.774
all.equal(i3,i2)  # check that the results do not change
## [1] TRUE

Note that in this environment, despite increasing the number of CPUs linearly do not appreciably reduce run time after moving to two cores. This due mainly to communication costs that typically increase with the number of CPUs, and this phenomenon will be environment-specific.

In summary, it is very easy to perform embarrassingly parallel tasks with R, and this carries over to genomic data analysis thanks to BiocParallel. There are some strategic considerations concerning control of memory consumption and communication costs, and full mastery of the topic involves attention to profiling and benchmarking methods that can be addressed in an advanced software development course.