Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

03 July 2014

Efficient Ragged Arrays in R and Rcpp

When is R Slow, and Why?

Computational speed is a common complaint lodged against R. Some recent posts on r-bloggers.com have compared the speed of R with some other programming languages [1], and showed the favorable impact of the new compiler package on run-times [2]. I and others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks in R [3,4]. With the new Rcpp attributes framework, writing fully vectorized C++ functions and incorporating them in R code is now very easy [5].

On a day-to-day basis, though, R's performance is largely a function of coding style. R allows novices users to write horribly inefficient code [6] that produces the correct answer (eventually). Yet by failing to utilize vectorization and pre-allocation of data structures, naive R code can be many orders of magnitude slower than need be. R-help is littered with the tears of novices, and there's even a (fantastic) parody of Dante's Inferno outlining the common "Deadly Sins of R" [7].

Problem Statement: Appending to Ragged Arrays

I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for, and that I'm sure others have encountered. What is the "R way" to vectorize computations on ragged array? One example of a ragged array is a list of vectors that have varying and different lengths. Say you need to dynamically grow many vectors by varying lengths over the course of a stochastic simulation. Using a simple tool like lapply, the entire data structure will be allocated anew with every assignment. This problem is briefly touched on in the official Introduction to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?

The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix assignment, and for each vector track the index of the start of free space. If we try to write past the end of the matrix, R emits the appropriate error. This method requires some book-keeping on our part. One nice addition would be an S4 class with slots for the data matrix and the vector of free-space indices, as well as methods to dynamically expand the data matrix and validate the object. As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use much less memory at the expense of slower access times [8]. Here, we're using more memory than is strictly needed to achieve much faster access times.

Is pre-allocation and book-keeping worth the trouble? object.size(matrix(1.5, nrow=1e3, ncol=1e3)) shows that a data structure of 1,000 vectors, each of length approximately 1,000, occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea it is to *not* pre-allocate a frequently-resized ragged list?

Three Solutions and Some Tests

Using the above logic, I prototyped a solution as an R function, and then transcribed the result into a C++ function (boundary checks are important in C++). The result is three methods: a "naive" list-append method, an R method that uses matrix assignment, and a final C++ method that modifies the pre-allocated matrix in-place. In C++/Rcpp, functions can use pass-by-reference semantics [9], which can have major speed advantages by allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires some caution on the user's part. Pass-by-reference is very different from R's "function programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit assignment call is required to modify an object [10].

I added a unit test to ensure identical results between all three methods, and then used the fantastic rbenchmark package to time each solution. As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++ pre-allocation methods are close in performance. Only with more iterations and larger data structures does the C++ method really start to pull ahead. And by that time, the naive R method takes *forever*.

Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices) is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say. lapply() is conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution using lapply() and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting (large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.

Code

Note: you can find the full code here and here.

Setup: two helper functions are used to generate ragged arrays via random draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a minimum length of 1, gen.lengths()), and draws from the normal distribution fill each vector with data (gen.dat()).
## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
    ## a vector of random draws
    pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
    ## a list of vectors, vector i has length nelem[i]
    ## each vector is filled with random draws
    lapply(nelem, rexp, rate=rate)
}

Three solutions: a naive lapply() method, followed by pre-allocation in R.
## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow dat by appending to list element i
    ## memory will be reallocated at each call
    dat <- mapply( append, dat, new.dat )
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow pre-allocated dat by inserting data in the correct place
    for (irow in 1:length(new.dat)) {
        ## insert one vector at a time
        ## col indices for where to insert new.dat
        cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
        dat[irow, cols.ii] = new.dat[[irow]]
    }
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}


Next, the solution as a C++ function. This goes in a separate file that I'll call helpers.cpp (compiled below).
#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void appendRcpp(  List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
    // "append" fill oldmat w/  
    // we will loop through rows, filling retmat in with the vectors in list
    // then update retmat_size to index the next free
    // newLenths isn't used, added for compatibility
    NumericVector fillTmp;
    int sizeOld, sizeAdd, sizeNew;
    // pull out dimensions of matrix to fill
    int nrow = retmat.nrow();
    int ncol = retmat.ncol();
    // check that dimensions match
    if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) { 
        throw std::range_error("In appendC(): dimension mismatch");
    }
    for (int ii = 0; ii= ncol) {
            throw std::range_error("In appendC(): exceeded max cols");
        }
        // iterator for row to fill
        NumericMatrix::Row retRow = retmat(ii, _);
        // fill row of return matrix, starting at first non-zero elem
        std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
        // update size of retmat
        retmatLengths[ii] = sizeNew;
    }
}


Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each solution with identical data will be used for timing.
## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
    ## list of empty vectors, fill with append
    dat.list <- lapply(1:particles, function(x) numeric())
    ## preallocated matrix, fill rows from left to right
    dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## length of each element/row
    N.list <- N.r <- N.c <- rep(0, particles)
    ## repeat process, "appending" as we go
    for (ii in 1:nrep) {
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## in R, list of vectors
        tmp <- appendL(dat.new, N.new, dat.list, N.list)
        ## unpack, update
        dat.list <- tmp$dat
        N.list <- tmp$len
        ## in R, preallocate
        tmp <- appendR(dat.new, N.new, dat.r, N.r)
        ## unpack, update
        dat.r <- tmp$dat
        N.r <- tmp$len
        ## as above for C, modify dat.c and N.c in place
        appendRcpp(dat.new, N.new, dat.c, N.c)
    }
    ## pull pre-allocated data back into list
    dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
    ## check that everything is  
    identical(dat.r, dat.c) && identical(N.r, N.c) &&
    identical(dat.list, dat.r.list) && identical(N.list, N.r)
}

## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
    ## object to modify
    N.test <- rep(0, particles)
    dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## speed is affected by size, 
    ## so ensure that each run the same elements
    set.seed(seed)
    for (irep in 1:nrep) {
        ## generate draws
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## bind in using given method
        tmp <- append.fun(dat.new, N.new, dat.test, N.test)
        if(do.update) {
            ## skip update for C
            dat.test <- tmp$dat
            N.test <- tmp$len
        }
    }
}


Finally, we run it all:
library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")

## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)

## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
    replications=10
)

## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    replications=1e1
)


Benchmark results show that the list-append method is 500 times slower than the improved R method, and 1,000 times slower than the C++ method (timings). As we move to larger data structures (timings.fast), the advantage of modifying in-place with C++ rather than having to explicitly assign the results quickly add up.
> timings
  test replications elapsed relative user.self sys.self
2    c           10   0.057    1.000     0.056    0.000
3 list           10  52.792  926.175    52.674    0.036
1    r           10   0.128    2.246     0.123    0.003

> timings.fast
  test replications elapsed relative user.self sys.self
2    c           10   0.684    1.000     0.683    0.000
1    r           10  24.962   36.494    24.934    0.027

References


[1] How slow is R really?
[2] Speeding up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values

No comments:

Post a Comment