Benchmarking multiple out-of-memory strategies
Introduction
In many large-data situations, it is impractical to load and retain data in R’s working memory space. We have had a look at HDF5, SQLite and tabix-indexed text as possible solutions to problems arising with memory constraints. We’ll call these “out-of-memory” (OOM) approaches
How can we obtain data on which approach will be most effective for a given task? Comparative benchmarking is a very useful skill and we give a very rudimentary account of this here.
The harness
It is common to speak of a program that drives other programs as a “harness” (see wikipedia for related discussion). We have such a program in ph525x:
benchOOM
## function (NR = 5000, NC = 100, times = 5, inseed = 1234, methods = list(.h5RoundTrip,
## .ffRoundTrip, .slRoundTrip, .dtRoundTrip, .bmRoundTrip))
## {
## require(microbenchmark)
## require(parallel)
## options(mc.cores = 2)
## nel = NR * NC
## set.seed(inseed)
## x = array(rnorm(nel), dim = c(NR, NC))
## cbind(NR = NR, NC = NC, times = times, do.call(rbind, lapply(methods,
## function(z) getStats(times, x, rtfun = z))))
## }
## <environment: namespace:ph525x>
This program is going to help us assess performance of various OOM approaches. We consider a very limited problem, that of managing data that could reside in an R matrix. The main parameters are
NRandNC: row and column dimensionstimes: number of benchmark replications for averaginginseed: a seed for random number generation to ensure reproducibilitymethods: a list of methods
The methods parameter is most complex. Each element of the list
is assumed to be a function with the matrix to
be managed via OOM as the first argument, some additional
parameters, and a parameter intimes that gives the number
of benchmark replicates.
Our objective is to produce a table that looks like
> b1
NR NC times meth wr ingFull ing1K
1 5000 100 5 hdf5 10.71714 9.4100810 14.2984402
2 5000 100 5 ff 25.34365 63.0977338 4.4320688
3 5000 100 5 sqlite 174.89003 105.1254638 28.4717496
4 5000 100 5 data.table 49.35190 7.9871552 13.9007588
5 5000 100 5 bigmemory 23.39697 0.9660878 0.9950034
where each method listed in meth is asked to perform the same
task a fixed number of times for averaging. The construction of
the table occurs by binding together metadata about the task and
method to the result of getStats. We’ll leave the details
of getStats to independent investigation.
An example method (OOM benchmarker)
Let’s look at the method for HDF5:
ph525x:::.h5RoundTrip
## function (x, chunkIn = c(1000, 10), inLevel = 0, intimes = 1)
## {
## system("rm -rf ex_hdf5file.h5")
## require(rhdf5)
## h5createFile("ex_hdf5file.h5")
## h5createDataset("ex_hdf5file.h5", "x", c(nrow(x), ncol(x)),
## storage.mode = "double", chunk = chunkIn, level = inLevel)
## mw = microbenchmark(h5write(x, "ex_hdf5file.h5", name = "x"),
## times = intimes)
## mr = microbenchmark(h5read("ex_hdf5file.h5", name = "x"),
## times = intimes)
## msel = microbenchmark(ysel <- h5read("ex_hdf5file.h5", name = "x",
## index = list(4001:5000, 1:100)), times = intimes)
## stopifnot(all.equal(ysel, x[4001:5000, ]))
## list(mwrite = mw, ingFull = mr, ing1K = msel, times = intimes,
## method = "hdf5")
## }
## <environment: namespace:ph525x>
The program has three main phases
- HDF5-related setup, cleaning out any previous archives and establishing the basic target file
- Benchmarking of data export via
h5write - Benchmarking of ingestion via
h5readwith various restrictions
The results of microbenchmark are assembled in a list.