R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

10 May 2011

Problems with plyr -- the memory/complexity trade-off

Two types of R users

My overwhelming impression from UseR 2010 is that, generally speaking, there are 2 types of regular R users -- those who have heard and are made uncomfortable by the idea of the *apply() functions, and those who really get it. In UNM R programming group that I've been leading for about a year now, I've really tried to get people over the hump and into the second group. Once there, folks seem to really appreciate the amazing power of vectorization in R, and begin to enjoy writing code. The conceptual clarity of:

mymatrix = matrix(1:10, nrow=10)
  apply(mymatrix, 1, sum)
  apply(mymatrix, 2, sum)

over the clunky:


may not be immediately apparent. Eventually, though, folks go searching for things like rowMedian() and become frustrated that R doesn't "have all the functions" that they need. Well, R does, once you grok apply().

Hadley's Magic Brainchild

In the last year, I've had some serious ahah! moments with plyr. Lately, I've added the reshape package to the mix to achieve some serious R 1-liner Zen. Multidimensional arrays to long-form dataframes?

myarray = array(0, dim=c(3,4,5,6,10), dimnames=list(a=1:3, b=1:4, c=1:5, d=letters[1:6], e=LETTERS[1:10]))
  melt( adply(myarray, 1:4, I))

Sure. No problem. It's really that easy?!

An unexpeceted bonus? This way of thinking lends itself nicely to thinking of explicit parallelism. If you can describe the problem as a single recipe that's done for each "piece" of a whole, then you're one step away from using all the cores on your machine to solve a problem with foreach. Need to apply some sort of long-running analysis to each element of a list, and want to return the results as a list? Why write a for loop when you can do:

## prep
## warning -- this  may not be fast
## use smaller matrices for smaller machines! 
  nn = 2^20
  rr = 2^10
  mylist = list(a=matrix(rnorm(nn), nrow=rr), b=matrix(rnorm(nn), nrow=rr), c=matrix(rnorm(nn), nrow=rr))

## analysis
  myeigs = foreach( ii = iter(mylist), .combine=c) %do% { print('ping'); return(list(eigen(ii)))}

and then, once it works, change %do% to %dopar%, add the following, and you're off to the races!

  ## use the appropriate # of cores, of course
  myeigs1 = foreach( ii = iter(mylist), .combine=c) %dopar% { print('ping'); return(list(eigen(ii)))}

Compared to something like llply, your dimnames don't automatically propagate to the results, but I think this is still pretty amazing. Complete control and simple parallelism.

Debugging with %dopar% is tricky, of course, because there are separate stacks for each item (i think), and messages (such as calls to print()) don't return to the console as you might expect them to. So, when in doubt, back off to %do%.

What could possibly go wrong?

The only problem with all of this is, when tasks are embarassingly parallel, that data also becomes embarassingly parallel to point where it no longer fits into memory. Thus, I returned today to a long-running bootstrap computation to find R consuming ~7GB RAM, 5+ GB swap, and this message:

Error: cannot allocate vector of size 818.6 Mb
Enter a frame number, or 0 to exit   
4: mk.bootrep(zz, 75)
5: mk.boot.R#165: aaply(tmpreps, .margins = 1, function(x) {
6: laply(.data = pieces, .fun = .fun, ..., .progress = .progress, .drop = .dro
7: list_to_array(res, attr(.data, "split_labels"), .drop)
8: unname(unlist(res))

What's happening is that plyr is trying to do everything at once. As anyone who's used grep can tell you, doing one row at a time, or streaming data is often a much better idea. I got the intended results from above by pre-allocating an array and writing each item of my results list into the array in a loop in seconds, and barely broke 3 GB of RAM usage.

Now, nothing here is really news. The dangers of "Growing Objects" is covered in Circle 2 of Burns Statistics' wonderful R Inferno. Still, plyr strikes me as an interesting case where reducing conceptual complexity can lead to a rather steep increase in computational complexity. And the most interesting thing of all is that it happens quite suddenly above a certain threshold.

Parting thoughts

I wonder if there's any major barrier to a stream=TRUE argument to the plyr functions -- I haven't thought about it too much, but imagine that you'd also need a finalizer function to prepare the return object to be written/streamed into. At what point is it easier to just do by hand with a for loop?

Honestly, I don't know the answer. I don't do too many things that break plyr, but I've learned how important it is to understand when I'm likely exceed its limits. .variables in ddply is another one that I've learned to be careful with. If, after subdividing your input data.frame, you end up with 1e5 or 1e6 pieces, things start to break down pretty fast.

Honestly, I love writing solutions with the *apply() family and the ddply functions. I think it makes code cleaner and more logical. Watching the light of this dawn in other R users' eyes is truly exciting. Yet it pays to remember that, like all things, it has its limits. In this case, one seems to reach the limits suddenly and harshly.