When is R Slow, and Why?
Computational speed is a common complaint lodged against R.
Some recent posts on
r-bloggers.com have compared
the speed of R with some other programming languages [
1], and showed the
favorable impact of the new compiler package on run-times [
2]. I and
others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks
in R [
3,
4]. With the new Rcpp attributes framework,
writing fully vectorized C++ functions and incorporating them in R code is now very easy [
5].
On a day-to-day basis, though, R's performance is largely a function of coding style.
R allows novices users to write horribly inefficient code [
6]
that produces the correct answer (eventually). Yet by failing to utilize vectorization
and pre-allocation of data structures, naive R code can be many orders of magnitude slower than
need be.
R-help is littered with the
tears of novices, and there's even a (fantastic)
parody of Dante's Inferno outlining the common "Deadly Sins of R" [
7].
Problem Statement: Appending to Ragged Arrays
I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for,
and that I'm sure others
have encountered. What is the "R way" to vectorize computations on ragged array?
One example of a ragged array is
a list of vectors that have varying and different lengths. Say you need to dynamically grow
many vectors by varying lengths
over the course of a stochastic simulation. Using a simple tool like
lapply
, the entire data
structure will be allocated anew with every assignment. This problem is briefly touched
on in the official
Introduction
to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing
may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How
might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?
The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum
possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix
assignment, and for each vector track the index of the start of free space.
If we try to write past the end of the matrix, R emits the appropriate error.
This method requires some book-keeping on our part. One nice addition would be
an S4 class with slots for the data matrix and the vector of free-space
indices, as well as methods to dynamically expand the data matrix and validate the object.
As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use
much less memory at the expense of slower access times [
8].
Here, we're using more memory than is strictly needed to achieve much faster access times.
Is pre-allocation and book-keeping worth the trouble?
object.size(matrix(1.5, nrow=1e3,
ncol=1e3))
shows that a data structure of 1,000 vectors, each of length approximately 1,000,
occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking
at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea
it is to *not* pre-allocate a frequently-resized ragged list?
Three Solutions and Some Tests
Using the above logic, I prototyped a solution as an R function, and then transcribed the result into
a C++ function (boundary checks are important in C++).
The result is three methods: a "naive" list-append method, an R method that uses matrix assignment,
and a final C++ method that modifies the pre-allocated matrix in-place.
In C++/Rcpp, functions can use pass-by-reference semantics [
9],
which can have major speed advantages by
allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires
some caution on the user's part. Pass-by-reference is very different from R's "function
programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit
assignment call is required to modify an object [
10].
I added a unit test to ensure identical results between all three methods, and then used the fantastic
rbenchmark package to time each solution.
As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++
pre-allocation methods are close in performance. Only with more iterations and larger
data structures does the C++ method really start to pull ahead. And by that time, the naive R
method takes *forever*.
Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices)
is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say.
lapply()
is
conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this
simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution
using
lapply()
and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting
(large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.
Code
Note: you can find the full code
here and
here.
Setup: two helper functions are used to generate ragged arrays via random
draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a
minimum length of 1,
gen.lengths()
), and draws from the normal distribution fill each
vector with data (
gen.dat()
).
## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
## a vector of random draws
pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
## a list of vectors, vector i has length nelem[i]
## each vector is filled with random draws
lapply(nelem, rexp, rate=rate)
}
Three solutions: a naive
lapply()
method, followed by pre-allocation
in R.
## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
## grow dat by appending to list element i
## memory will be reallocated at each call
dat <- mapply( append, dat, new.dat )
## update lengths
dat.lengths <- dat.lengths + new.lengths
return(list(dat=dat, len=dat.lengths))
}
## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
## grow pre-allocated dat by inserting data in the correct place
for (irow in 1:length(new.dat)) {
## insert one vector at a time
## col indices for where to insert new.dat
cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
dat[irow, cols.ii] = new.dat[[irow]]
}
## update lengths
dat.lengths <- dat.lengths + new.lengths
return(list(dat=dat, len=dat.lengths))
}
Next, the solution as a C++ function. This goes in a separate file that I'll call
helpers.cpp
(compiled below).
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
void appendRcpp( List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
// "append" fill oldmat w/
// we will loop through rows, filling retmat in with the vectors in list
// then update retmat_size to index the next free
// newLenths isn't used, added for compatibility
NumericVector fillTmp;
int sizeOld, sizeAdd, sizeNew;
// pull out dimensions of matrix to fill
int nrow = retmat.nrow();
int ncol = retmat.ncol();
// check that dimensions match
if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) {
throw std::range_error("In appendC(): dimension mismatch");
}
for (int ii = 0; ii= ncol) {
throw std::range_error("In appendC(): exceeded max cols");
}
// iterator for row to fill
NumericMatrix::Row retRow = retmat(ii, _);
// fill row of return matrix, starting at first non-zero elem
std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
// update size of retmat
retmatLengths[ii] = sizeNew;
}
}
Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each
solution with identical data will be used for timing.
## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
## list of empty vectors, fill with append
dat.list <- lapply(1:particles, function(x) numeric())
## preallocated matrix, fill rows from left to right
dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
## length of each element/row
N.list <- N.r <- N.c <- rep(0, particles)
## repeat process, "appending" as we go
for (ii in 1:nrep) {
N.new <- gen.lengths(particles)
dat.new <- gen.dat(N.new)
## in R, list of vectors
tmp <- appendL(dat.new, N.new, dat.list, N.list)
## unpack, update
dat.list <- tmp$dat
N.list <- tmp$len
## in R, preallocate
tmp <- appendR(dat.new, N.new, dat.r, N.r)
## unpack, update
dat.r <- tmp$dat
N.r <- tmp$len
## as above for C, modify dat.c and N.c in place
appendRcpp(dat.new, N.new, dat.c, N.c)
}
## pull pre-allocated data back into list
dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
## check that everything is
identical(dat.r, dat.c) && identical(N.r, N.c) &&
identical(dat.list, dat.r.list) && identical(N.list, N.r)
}
## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
## object to modify
N.test <- rep(0, particles)
dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
## speed is affected by size,
## so ensure that each run the same elements
set.seed(seed)
for (irep in 1:nrep) {
## generate draws
N.new <- gen.lengths(particles)
dat.new <- gen.dat(N.new)
## bind in using given method
tmp <- append.fun(dat.new, N.new, dat.test, N.test)
if(do.update) {
## skip update for C
dat.test <- tmp$dat
N.test <- tmp$len
}
}
}
Finally, we run it all:
library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")
## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)
## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
replications=10
)
## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
replications=1e1
)
Benchmark results show that the list-append method is 500 times slower than the improved
R method, and 1,000 times slower than the C++ method (
timings
).
As we move to larger data structures
(
timings.fast
), the advantage of modifying in-place with C++ rather than having to
explicitly assign the results quickly add up.
> timings
test replications elapsed relative user.self sys.self
2 c 10 0.057 1.000 0.056 0.000
3 list 10 52.792 926.175 52.674 0.036
1 r 10 0.128 2.246 0.123 0.003
> timings.fast
test replications elapsed relative user.self sys.self
2 c 10 0.684 1.000 0.683 0.000
1 r 10 24.962 36.494 24.934 0.027
References
[1] How slow is R really?
[2] Speeding
up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values