Architecture: considerations on high performance computing with Bioconductor
Overview of performance enhancements
The concept of scalability for software systems and data analyses is complex and precise metrics and criteria are not ready to hand. An interesting overview from the computer science community is that of Weinstock and Goodenough. Scalability of Bioconductor-based analytical workflows can be pursued along various paths, and many useful ideas are presented in work of Mike Lawrence and Martin Morgan, two Bioconductor core members.
This document addresses only the use of parallel computation as directly supported in Bioconductor. The two main approaches are
-
Shared memory parallelism. The R process is forked an arbitrary number of times with full copies of the memory image and computations proceed independently for each image. Selected results are returned to the master process. This is often effective in multicore environments.
-
Distributed parallelism. Independent processors, potentially running different operating systems, can run compatible instances of R. Job control can be carried out by R or by a cluster scheduler.
For tasks that are “embarrassingly parallel”, that do not require communication between processes beyond starting, stopping, and returning results, either of these approaches can be used reasonably simply in R.
Simple illustration
We can demonstrate the shared memory approach with our laptop.
system.time( lapply(1:8, function(x)Sys.sleep(1) ) )
## user system elapsed
## 0.004 0.001 8.020
library(parallel)
detectCores()
## [1] 4
options(mc.cores=4)
system.time( mclapply(1:8, function(x)Sys.sleep(1) ) )
## user system elapsed
## 0.007 0.022 2.031
For this meaningless computation, we achieved linear speedup: we cut the time for serial computation by a factor of four.
Detour: BAM files from an RNA-seq experiment
We will be working with a set of BAM files created in the study of Zarnack et al. 2013. The hypothesis of this study is that a class of proteins, the heterogeneous nuclear ribonucleoproteins C1 and C2, referred to as hnRNP C, are responsible for preventing incorporation of Alu elements into mature RNA transcripts. Such incorporation may lead to protein dysfunction that could cause disease; see the references of the Zarnack paper for further background.
The package that we will work with has 8 BAM files with reads aligned to chr14. Four of the files are reads obtained from from HeLa cells (negative control) and four are obtained from HeLa cells in which hnRNP C has been “knocked down” with siRNA.
library(RNAseqData.HNRNPC.bam.chr14)
dir(system.file("extdata", package="RNAseqData.HNRNPC.bam.chr14"))
## [1] "ERR127302_chr14.bam" "ERR127302_chr14.bam.bai"
## [3] "ERR127303_chr14.bam" "ERR127303_chr14.bam.bai"
## [5] "ERR127304_chr14.bam" "ERR127304_chr14.bam.bai"
## [7] "ERR127305_chr14.bam" "ERR127305_chr14.bam.bai"
## [9] "ERR127306_chr14.bam" "ERR127306_chr14.bam.bai"
## [11] "ERR127307_chr14.bam" "ERR127307_chr14.bam.bai"
## [13] "ERR127308_chr14.bam" "ERR127308_chr14.bam.bai"
## [15] "ERR127309_chr14.bam" "ERR127309_chr14.bam.bai"
## [17] "tophat2_out"
These are Tabix-indexed BAM files. In the alphabetic ordering, the first four are two pairs of replicates of HeLa cells that have undergone hnRNP C knockdown, and the second four are two pairs of control replicates.
Implicit parallelism through BiocParallel
To foster harmonious development of reliably performant procedures,
Bioconductor introduced the BiocParallel package. Users
benefit from autonomous (but optionally controllable) pursuit
of parallel computing when feasible. Consider the following
example: we will count reads into bins defined by exon addresses
in the HNRNPC example dataset.
The RNAseqData.HNRNPC.bam.chr14 package includes a vector
of absolute pathnames of the BAM files, which we assign to fns
.
library(RNAseqData.HNRNPC.bam.chr14)
fns = RNAseqData.HNRNPC.bam.chr14_BAMFILES
length(fns)
## [1] 8
Here we establish the exon bins into which we will count reads.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
seqlevels(txdb) = "chr14"
ebg = exonsBy(txdb, by="gene")
Now we use the summarizeOverlaps
function from
the GenomicAlignments package to count reads into the exon bins.
We’ll time the counting from a single file, and then time
the counting from all files at once.
library(GenomicAlignments)
# summarizeOverlaps uses bplapply to iterate over files
s1 = system.time(i1 <- summarizeOverlaps( ebg, fns[3] ))
s1
## user system elapsed
## 2.975 0.178 3.178
# show implicit config
BiocParallel::bpparam()
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 4; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bptimeout: 2592000; bpprogressbar: FALSE
## bpRNGseed:
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
s2 = system.time(i2 <- summarizeOverlaps( ebg, fns ))
s2
## user system elapsed
## 0.271 0.162 10.480
This is not a thorough way of measuring speedup but it
shows reasonable enhancement.
In the second computation, we did approximately 8 times as
much computation, but the clock time elapsed increased only
by a factor of (3.3).
The default behavior of BiocParallel on the MacBook
Air used to produce this document is to pick up the
value of the option mc.cores
and use this as the number
of workers in a MulticoreParam
configuration; if options()$mc.cores
is NULL, 2 workers are specified.
When BiocParallel is attached,
the summarizeOverlaps
function will iterate
over the files using bplapply
from the BiocParallel package.
That function will check the R session
for specific parallelization configuration information.
If it finds none, it will check for multiple cores
and make arrangements to use them if present.
The “check” occurs via the function bpparam
.
The default situation on a MacBook Air running MacOSX 10.9.5 with an Intel core i7 processor, (two physical cores with two logical cores each, allowing for four concurrent threads) as follows.
options(mc.cores=NULL)
library(BiocParallel)
register(MulticoreParam())
bpparam()
## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bptimeout: 2592000; bpprogressbar: FALSE
## bpRNGseed:
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
This identifies an object called MulticoreParam
which is
used to configure the behavior of bplapply
and other utilities
of the BiocParallel package. There are various configuration
object classes that can be used.
‘SnowParam’: distributed memory computing
‘MulticoreParam’: shared memory computing
‘BatchJobsParam’: scheduled cluster computing
‘DoparParam’: foreach computing
‘SerialParam’: non-parallel execution
We need to use register
to determine the type of
concurrent computation that will be performed.
If process size is large, we may want to leave
some cores idle. We can accomplish that by using register
.
Here we’ll check for any advantage of using all logical cores.
library(BiocParallel)
register(MulticoreParam(workers=1))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
## user system elapsed
## 15.653 1.286 17.070
register(MulticoreParam(workers=2))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
## user system elapsed
## 0.103 0.029 8.818
register(MulticoreParam(workers=3))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
## user system elapsed
## 0.051 0.023 9.543
register(MulticoreParam(workers=4))
system.time(i3 <- summarizeOverlaps( ebg, fns ))
## user system elapsed
## 0.056 0.026 8.774
all.equal(i3,i2) # check that the results do not change
## [1] TRUE
Note that in this environment, despite increasing the number of CPUs linearly do not appreciably reduce run time after moving to two cores. This due mainly to communication costs that typically increase with the number of CPUs, and this phenomenon will be environment-specific.
In summary, it is very easy to perform embarrassingly parallel tasks with R, and this carries over to genomic data analysis thanks to BiocParallel. There are some strategic considerations concerning control of memory consumption and communication costs, and full mastery of the topic involves attention to profiling and benchmarking methods that can be addressed in an advanced software development course.