Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

02 October 2012

Modeling Philosopy

I've found it sometimes difficult to explain my modeling philosophy, especially in the face of "why don't you just do X?" or "isn't that just like Y and Z?" responses. Without going into X, Y, or Z in the slightest, I present the following quote from one of my intellectual heros. His short book (less than 100 pages) full of equations that I struggle to make sense of (not unlike going to the opera) is full of little morsels of wisdom. Here's one of my favorites:
It need hardly be repeated that no ~detailed~ statistical agreement with observation is to be expected from models which are based on admitted simplifications. But the situation is quite analogous to others where complex and interacting statistical systems are under consideration. What is to be looked for is a comparable pattern with the main features of the phenomena; when, as we shall see, the model predicts important features which are found to conform to reality, it becomes one worthy of further study and elaboration.
--M.S Bartlett, 1960.
Stochastic Population Models in Ecology and Epidemiology

29 February 2012

Custom Amazon EC2 config for Rstudio

Introduction

This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.

To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.

Edit -- another look

After playing with this setup more and trying to shrink the root partition (as per
these instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.


If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.

Modifications

I proceeded to lock it down -- add an admin user, prevent root login, etc.
Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).

Below is the commandline session that I used to do it, along with annotations and links to some files.
adduser myusername
adduser myusername sudoers
## add correct key to ~myusername/.ssh/authorized_keys

vi /etc/ssh/sshd_config 
## disable root login
/etc/init.d/ssh restart
## now log in as myusername via another terminal to make sure it works, and then log out as root


Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.

## change hostname to match afraid.org entry
sudo vi /etc/hostname
sudo /etc/init.d/hostname restart

## Now it's time to make Rstudio server a little more secure
## from http://rstudio.org/docs/server/running_with_proxy
sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev
sudo a2enmod proxy
sudo a2enmod proxy_http

## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/
openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \
'/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org'

## change permissions, move to default ubuntu locations
sudo chmod 444 crocus.crt
sudo chown root.ssl-cert crocus.crt
sudo mv crocus.crt /usr/share/ca-certificates

sudo chmod 640 crocus.key
sudo chown root.ssl-cert crocus.key
sudo mv crocus.key /etc/ssl/private/

sudo a2enmod rewrite
sudo a2enmod ssl

sudo vi /etc/apache2/sites-enabled/000-default
sudo /etc/init.d/apache2 restart

You can see my full apache config file here:

Conclusions

Now I access Rstudio on the EC2 instance with a browser via:
https://myhostname.mydomain.org

I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.

Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.

Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.

All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.

28 February 2012

Adventures in R Studio Server: Apache2, Https, Security, and Amazon EC2.

I just put a fresh install of Ubuntu Server (10.04.4 LTS) on one of our machines.  As I was doing some post-install config, I accidentally installed Rstudio Server.  And subsequently fell down an exciting little rabbit-hole of server configuration and "ooooh-lala!" playtime.

A friend sung the wonders of Rstudio Server to me recently, and I filed it under "things to ignore for now".  Just another thing to learn, right?  Turns out, the Rstudio folks do *great* work and write good docs, so I hardly had to learn anything.  I just had to dust off my sysadmin skills and fire up some google.

I'm a little concerned about running web services on public-facing machines.  Even more so, given that R provides fairly low-level access to operating system services.  Still, I was impressed to see system user authentication.

I followed the docs for running apache2 as a proxy server, and learned a little about apache in the process.  Since I made it this far, I figured I'd run it through https/ssl, add some memory limitations, etc.  I'm still not entirely convinced this is secure -- it seems that running it in a virtual machine or chroot jail would be ideal.

 On the other hand, I ran across this post on running Rstudio Server inside Amazon EC2 instances.  Nighttime EC2 spot prices on "Quadruple Extra Large" instances (68.4 GB of memory,
8 virtual cores with 3.25 EC2 Compute Units each) fell below $1 an hour tonight, which is cheap enough to play with for an hour or two -- take it through some paces and see how well it does with a *very* *large* *job* or two.  Instances can now be stopped and saved to EBS (elastic block storage), and so only need to be configured once, which really simplifies matters. In fact, I'm wondering if Rstudio (well, R, really) is my "killer app" for EC2. 

Overall, I was really impressed at how fast and easy this was to get up and running. Fun times ahead!

29 October 2011

SQL Koan

It's not that profound, but I sure do like it: a simple, elegant example of a self-join that gives a truth table for the NOT DISTINCT FROM operator.

WITH x(v) AS (VALUES (1),(2),(NULL))
  SELECT l.v, r.v, l.v = r.v AS equality,
    l.v IS NOT DISTINCT FROM r.v AS isnotdistinctfrom
  FROM x l, x r;

Thanks to Sam Mason via the PostgreSQL - general mailing list.

28 October 2011

Why can't ads be more fun?

The worst part about advertisements these days, in my not-so-humble opinion, is the degree to which they belittle the user. Well-endowed lady wants to be my Facebook friend? "Click here to confirm request." Riiight. Lonely singles near me? Yeah, sure. OH GOOD LORD, A THREAT HAS BEEN IDENTIFIED! Oh, wait, y'all really are evil! On the infrequent occasions that I do watch TV, the ads are less flagrantly insidious, but they're nonetheless relentlessly patronizing.

The second-worst part of ads, IMNSHO, is the brute-force repetition. The same inane 30 seconds once? I can do that, but five times in an hour? You've got to be kidding me. Nope, no kidding here. Just 30 seconds of the same inanity, again and again.

Consequently, I find unique and intelligent ads rather inspiring. I know, it shouldn't be this way. One might thing that unique and inspiring would be more common amongst something so prevalent. Anyway...

Some of the GEICO ads that have played on Hulu lately have managed to avoid the "worst part". They're short and playful, vaguely reminiscent of [adult swim] commercials. "Is the sword mightier than the pen" is my current favorite; I still sometimes chuckle when I see it. They still fail occasionally in the repetition department. If you're a large international corporation, how hard is it to make more than a handful of moderately-entertaining 30-second spots?


Which brings me back to banner ads. I just saw this at the top of ars technica today for the first time, and it really caught my eye. I don't read or write Javascript, but I found myself puzzling through it and then following the link.


At the top of the page, I find this banner, which visually reminds me where I came from, that I've come to the right place, and gives me another little puzzle. Neat. Thank you, Heinz College of Carnegie Mellon.


None of this is revolutionary, but I do hope it is evolutionary. Ads don't have to be boring, and I'm guessing that interesting and intelligent ads are more effective even as they're less painful and insulting. My guess is that a generational change-over in marketing departments and their managers is underway, and will be slow. Still, I look forwards to a new generation of more modern advertising approaches that gives me something to think while my eyeballs are held hostage.

06 October 2011

FFT / Power Spectrum Box-and-Whisker Plot with Gggplot2

I have a bunch of time series whose power spectra (FFT via R's spectrum() function) I've been trying to visualize in an intuitive, aesthetically appealing way. At first, I just used lattice's bwplot, but the spacing of the X-axis here really matters. The spectra's frequencies aren't regularly-spaced categories, which is the default of bwplot. If all the series are of the same length, then the coefficients of the are all estimated at the same frequencies, and one can use plyr's split-apply-combine magic to compute the distribution of coefficients at each frequency.

I spent some time trying to faithfully reproduce the plotting layout of the figures produced by, e.g. spectrum(rnorm(1e3)) with respect to the axes. This was way more annoying than i expected...

To include a generic example, I started looking around for easily generated, interesting time series. The logistic map is so damned easy to compute and generally pretty interesting, capable of producing a range of frequencies. Still, I was rather surprised by the final output. Playing with these parameters produces a range of serious weirdness -- apparent ghosting, interesting asymmetries, etc..

Note that I've just used the logistic map for an interesting, self-contained example. You can use the exact same code to generate your own spectrum() box-and-whisker plot by substituting your matrix of time series for mytimeseries below. Do note here that series are by row, which is not exactly standard. For series by column, just change the margin from 1 to 2 in the call to adply below.

## box-and-whisker plot of FFT power spectrum by frequency/period
## vary these parameters -- r as per logistic map
## see http://en.wikipedia.org/wiki/Logistic_map for details
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) {
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## modify these for interesting behavior differences
rfrom = 3.4
rto = 3.7
rsteps=200
seqlen=1e4

require(plyr)
require(ggplot2)

mytimeseries = aaply(seq(from=rfrom, to=rto, length.out=rsteps), 1,
function(x) {
       logmap(seqlen, x)
})

## you can plug in any array here for mytimeseries
## each row is a timeseries
## for series by column, change the margin from 1 to 2, below
logspec = adply( mytimeseries, 1, function(x) {
       ## change spec.pgram parameters as per goals
       ret = spectrum(x, taper=0, demean=T, pad=0, fast=T, plot=F)
       return( data.frame(freq=ret$freq, spec=ret$spec))
})

## boxplot stats at each unique frequency
logspec.bw = ddply(logspec, 'freq', function(x) {
           ret = boxplot.stats(log10(x$spec));
           ## sometimes $out is empty, bind NA to keep dimensions correct
           return(data.frame(t(10^ret$stats), out=c(NA,10^ret$out)))
})

## plot -- outliers are dots
## be careful with spacing of hard returns -- ggplot has odd rules
## as close to default "spectrum" plot as possible
ggplot(logspec.bw, aes(x=1/(freq)))  +  geom_ribbon(aes(ymin=X1,
ymax=X5), alpha=0.35, fill='green')  +
       geom_ribbon(aes(ymin=X2, ymax=X4), alpha=0.5, fill='blue') +
geom_line(aes(y=X3))  +
       geom_point(aes(y=out), color='darkred') +
       scale_x_continuous( trans=c('inverse'), name='Period')  +
       scale_y_continuous(name='Power', trans='log10')


Following my nose further on the logistic map example, I also used the animate package to make a movie that walks through the power spectrum of the map for a range of r values. Maybe one day I'll set this to music and post it to Youtube! Though I think some strategic speeding up and slowing down in important parameter regions is warranted for a truly epic tour of the logistic map's power spectrum.

require(animation)
saveVideo({
       ani.options(interval = 1/5, ani.type='png', ani.width=1600, ani.height=1200)
        for (ii in c(seq(from=3.2, to=3.5, length.out=200), seq(from=3.5, to=4, length.out=1200))) {
            spectrum(logmap(1e4, ii), taper=0, demean=T, pad=0, fast=T, sub=round(ii, 3))
        }
},  video.name = "logmap2.mp4", other.opts = "-b 600k")

Bat Country

I've spent a lot of time thinking about and using R's spectrum() function and the Fast Fourier Transform (FFT) in the last 5+ years. Lately, they've begun to remind me a little of a Theremin: simple to use, difficult to master.

While prepping a figure for the R Graph Gallery (which is awesome, by the way), I ran across this curious example -- its striking visual appearance was definitely not what I was expecting.


I decided to use the logistic map to generate an ensemble of time series with a range of frequencies. The logistic map goes through characteristic period doubling, and then exhibits broad-spectrum "noise" at the onset of chaos. And it's really easy to compute. So, I can tune my time series to have a range of frequency distributions.

In this example, I'm in the periodic regime (r=3.5, period 4) . So why am I getting this Batman motif?

Actually, it's weirdly simple. This is an artifact of the (default) tapering that R's spectrum() function applies to the series before the FFT is computed. In theory, the Fourier Transform assumes an infinite-length sequence, while in practice the FFT assumes the series is circular -- in essence, gluing the beginning and end of the series together to make a loop. If the series is not perfectly periodic, though, this gluing introduces a sharp discontinuity. Tapering is typically applied to reduce or control this discontinuity, so that both ends gently decline to zero. When the series is genuinely periodic, though, this does some weird effects. In essence, the taper transfers power from the fundamental frequencies in a curious way. Note that the fundamental period is 4, as seen by the peak at frequency 1/4, with a strong harmonic at period 2, frequency 1/2.

Zero-padding has a similar effect. If your series is zero (or relatively small) at both ends, then you could glue it together without introducing discontinuities, but peaks near the beginning and the end will still be seen as near by, resulting in spurious peaks. In any case, R pads your series by default (fast=T) unless the length of the series is highly composite (i.e. it can be factored into many small divisors). Below, it turns out that a series of length 1e5 is composite enough that R doesn't pad it. 1e5? Totally different results.

If I understand this correctly, both padding and tapering are attempts to reduce or control spectral leakage. We've constructed an example where no spectral leakage is expected from the original time series, which closely fits the assumptions of the FFT without modification.

The moral? Know thy functions (and their default parameter values, and their effects)!

## using R -- 
## logistic map function:
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) { 
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## compare:
spectrum(logmap(1e5, 3.5), main='R tapers to 0.1 by default')
spectrum(logmap(1e5, 3.5), taper=0, main='No taper')
spectrum(logmap(1e5+1, 3.5), taper=0, main='Non-composite length, fast=T pads by default')
spectrum(logmap(1e5, 3.5), taper=0.5, sub='Lots of padding approximately recovers "correct" results')

18 June 2011

Efficient loops in R -- the complexity versus speed trade-off

I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections.

My ballpark estimate was 1e6 agents using a weekly timestep for about 50 years. This is a stochastic model, so I'd like to replicate it at least 100 times. So, I'll need at least 50*50*100 = 250,000 steps. I figured I'd be happy if I could get my step runtimes down to ~1 second -- dividing the 100 runs over 4 cores, this would give a total runtime of ~17.5 hours. Not short, but not a problem.

At first, I was disheartened to see the runtime of my prototype step function stretching into the minutes. What's going on? Well, I'd used plyr in a very inappropriate way -- for a very large loop. I began to investigate, and discovered that writing an aesthetically unappealing while gave me a 30+x speed-up.

All of which got me thinking -- how expensive are loops and function calls in R? Next week I'm leading a tutorial in R and C++ using the wonderful Rcpp and inline packages here at Santa Fe Institute's 2011 Complex Systems Summer School. Might this make a nice example?

It does, and in spades. Below are the test functions, and you can see that the code complexity increases somewhat from one to the other, but outcomes are identical, again with 30+x speedup for each subsequent case. Here, I'm using the native R API. I also tested using Rcpp to import rbinom(), but that ended up taking twice as long as the naive while loop.

So, the moral of the story seems to be that if you can write a long loop in pure C++, it's a really easy win.

Note -- The as< double >(y); in src below doesn't seem to copy-and-paste correctly for some reason. If testfun2 doesn't compile, check to make sure this bit pasted correctly.


The pure R function definitions

## use aaply -- the simplest code
require(plyr)
testfun0 <- function(x, y)  aaply(x, 1, function(x) rbinom(1, x, y)

## rewrite the above as an explicit loop
testfun1 = function(nrep, x, y) {
    ## loop index
    ii<-0;
    ## result vector
    ret<-rep(0, nrep);
    while(ii < nrep) {
        ii<-ii+1;
        ## how many successes for each element of bb?
        ret[ii] <- rbinom(1, x[ii], y)
    }
    return(ret)
}

Rcpp function definitions (with lots of comments)

## define source code as string, pass to inline
src <-  ' 
    // Cast input parameters to correct types
    // clone prevents direct modification of the input object
    IntegerVector tmp(clone(x));
    double rate = as< double >(y); 
    // IntegerVector inherits from STL vector, so we get standard STL methods
    int tmpsize = tmp.size(); 
    // initialize RNG from R set.seed() call
    RNGScope scope; 
    // loop
    for (int ii =0; ii < tmpsize; ii++) {
        // Rf_rbinom is in the R api
        // For more details, See Rcpp-quickref.pdf, "Random Functions"
        // also, Rcpp provides "safe" accessors via, e.g., tmp(ii)
        tmp(ii) = Rf_rbinom(tmp(ii), rate);
    };
    // tmp points to a native R object, so we can return it as-is
    // if we wanted to return, e.g., the double rate, use:
    // return wrap(rate)
    return tmp;
'

require(inline)
## compile the function, inspect the process with verbose=T
testfun2 = cxxfunction(signature(x='integer', y='numeric'), src, plugin='Rcpp', verbose=T)

## timing function
ps <- function(x) print(system.time(x))

The tests

## Input vector
bb <- rbinom(1e5, 20, 0.5)
## test each case for 2 different loop lengths 
for( nrep in c(1e4, 1e5)){
    ## initialize RNG each time with the same seed
    ## plyr
    set.seed(20); ps(cc0<- testfun0(nrep[1:nrep], bb, 0.1))
    ## explicit while loop
    set.seed(20); ps(cc1<- testfun1(nrep, bb, 0.1))
    ## Rcpp
    set.seed(20); ps(cc2<- testfun2(bb[1:nrep], 0.1))
    print(all.equal(cc1, cc2))
}


## output
   user  system elapsed 
  3.392   0.020   3.417 
   user  system elapsed 
  0.116   0.000   0.119 
   user  system elapsed 
  0.000   0.000   0.002 
[1] TRUE
   user  system elapsed 
 37.534   0.064  37.601 
   user  system elapsed 
  1.228   0.000   1.230 
   user  system elapsed 
  0.020   0.000   0.021 
[1] TRUE

Postlude

After posting this to the very friendly Rcpp-devel mailing list, I got an in-depth reply from Douglas Bates pointing out that, in this case, the performance of vanilla apply() beats the while loop by a narrow margin. He also gives an interesting example of how to use an STL template and an iterator to achieve the same result. I admit that templates are still near-magic to me, and for now I prefer the clarity of the above. Still, if you're curious, this should whet your appetite for some of the wonders of Rcpp.
## Using Rcpp, inline and STL algorithms and containers
 
 ## use the std::binary_function templated struct
inc <-  '
  struct Rb : std::binary_function< double, double, double > {
     double operator() (double size, double prob) const { return ::Rf_rbinom(size, prob); }
  };
'

## define source code as a character vector, pass to inline
 src <-  ' 
     NumericVector sz(x);
     RNGScope scope;
 
     return NumericVector::import_transform(sz.begin(), sz.end(), std::bind2nd(Rb(), as< double >(y)));
'
 
## compile the function, inspect the process with verbose=TRUE
f5 <- cxxfunction(signature(x='numeric', y='numeric'), src, 'Rcpp', inc, verbose=TRUE)

11 June 2011

The importance of being unoriginal (and befriending google)

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I've never had any reason to go under the hood until now. Lately, I've been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call("bincounts"), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    };
    return(cbind(bin=bb, dens=aa/length(x)))
}


## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don't understand. On the other hand, trunc(x) doesn't take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid's red wagon with a landrover...

Dear Google...

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn't return the bin boundaries by default, but it's good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there's something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn't quite as simple as one first thought. On the other hand, sometimes we just want answers. When that's the case, asking google is a pretty good bet.

## their method
require(ash)
get2 = bin1(test, c(0,1), 1e3+1)$nc

10 May 2011

Problems with plyr -- the memory/complexity trade-off

Two types of R users


My overwhelming impression from UseR 2010 is that, generally speaking, there are 2 types of regular R users -- those who have heard and are made uncomfortable by the idea of the *apply() functions, and those who really get it. In UNM R programming group that I've been leading for about a year now, I've really tried to get people over the hump and into the second group. Once there, folks seem to really appreciate the amazing power of vectorization in R, and begin to enjoy writing code. The conceptual clarity of:

mymatrix = matrix(1:10, nrow=10)
  apply(mymatrix, 1, sum)
  apply(mymatrix, 2, sum)
 

over the clunky:

rowSums(mymatrix)
  colSums(mymatrix)

may not be immediately apparent. Eventually, though, folks go searching for things like rowMedian() and become frustrated that R doesn't "have all the functions" that they need. Well, R does, once you grok apply().

Hadley's Magic Brainchild

In the last year, I've had some serious ahah! moments with plyr. Lately, I've added the reshape package to the mix to achieve some serious R 1-liner Zen. Multidimensional arrays to long-form dataframes?

myarray = array(0, dim=c(3,4,5,6,10), dimnames=list(a=1:3, b=1:4, c=1:5, d=letters[1:6], e=LETTERS[1:10]))
  melt( adply(myarray, 1:4, I))

Sure. No problem. It's really that easy?!


An unexpeceted bonus? This way of thinking lends itself nicely to thinking of explicit parallelism. If you can describe the problem as a single recipe that's done for each "piece" of a whole, then you're one step away from using all the cores on your machine to solve a problem with foreach. Need to apply some sort of long-running analysis to each element of a list, and want to return the results as a list? Why write a for loop when you can do:

## prep
## warning -- this  may not be fast
## use smaller matrices for smaller machines! 
  nn = 2^20
  rr = 2^10
  mylist = list(a=matrix(rnorm(nn), nrow=rr), b=matrix(rnorm(nn), nrow=rr), c=matrix(rnorm(nn), nrow=rr))

## analysis
  myeigs = foreach( ii = iter(mylist), .combine=c) %do% { print('ping'); return(list(eigen(ii)))}

and then, once it works, change %do% to %dopar%, add the following, and you're off to the races!

require(multicore)
  require(doMC)
  ## use the appropriate # of cores, of course
  registerDoMC(cores=4)
  myeigs1 = foreach( ii = iter(mylist), .combine=c) %dopar% { print('ping'); return(list(eigen(ii)))}

Compared to something like llply, your dimnames don't automatically propagate to the results, but I think this is still pretty amazing. Complete control and simple parallelism.

Debugging with %dopar% is tricky, of course, because there are separate stacks for each item (i think), and messages (such as calls to print()) don't return to the console as you might expect them to. So, when in doubt, back off to %do%.

What could possibly go wrong?

The only problem with all of this is, when tasks are embarassingly parallel, that data also becomes embarassingly parallel to point where it no longer fits into memory. Thus, I returned today to a long-running bootstrap computation to find R consuming ~7GB RAM, 5+ GB swap, and this message:

Error: cannot allocate vector of size 818.6 Mb
Enter a frame number, or 0 to exit   
...
4: mk.bootrep(zz, 75)
5: mk.boot.R#165: aaply(tmpreps, .margins = 1, function(x) {
6: laply(.data = pieces, .fun = .fun, ..., .progress = .progress, .drop = .dro
7: list_to_array(res, attr(.data, "split_labels"), .drop)
8: unname(unlist(res))

What's happening is that plyr is trying to do everything at once. As anyone who's used grep can tell you, doing one row at a time, or streaming data is often a much better idea. I got the intended results from above by pre-allocating an array and writing each item of my results list into the array in a loop in seconds, and barely broke 3 GB of RAM usage.

Now, nothing here is really news. The dangers of "Growing Objects" is covered in Circle 2 of Burns Statistics' wonderful R Inferno. Still, plyr strikes me as an interesting case where reducing conceptual complexity can lead to a rather steep increase in computational complexity. And the most interesting thing of all is that it happens quite suddenly above a certain threshold.

Parting thoughts

I wonder if there's any major barrier to a stream=TRUE argument to the plyr functions -- I haven't thought about it too much, but imagine that you'd also need a finalizer function to prepare the return object to be written/streamed into. At what point is it easier to just do by hand with a for loop?

Honestly, I don't know the answer. I don't do too many things that break plyr, but I've learned how important it is to understand when I'm likely exceed its limits. .variables in ddply is another one that I've learned to be careful with. If, after subdividing your input data.frame, you end up with 1e5 or 1e6 pieces, things start to break down pretty fast.

Honestly, I love writing solutions with the *apply() family and the ddply functions. I think it makes code cleaner and more logical. Watching the light of this dawn in other R users' eyes is truly exciting. Yet it pays to remember that, like all things, it has its limits. In this case, one seems to reach the limits suddenly and harshly.