25 November 2012

Successive Differences of a Randomly-Generated Timeseries

I was wondering about the null distribution of successive differences of random sequences, and decided to do some numerical experiments. I quickly realized that successive differences equates to taking successively higher-order numerical derivatives, which functions as a high-pass filter. So, the null distribution really depends on the spectrum of the original timeseries.

Here, I've only played with random sequences, which are equivalent to white noise. Using the wonderful animation package, I created a movie that shows the timeseries resulting from differencing, along with their associated power spectra. You can see that, by the end, almost all of the power is concentrated in the highest frequencies. The code required to reproduce this video is shown below.

Note -- For optimum viewing, switch the player quality to high.
require(animation)
## large canvas, write output to this directory, 1 second per frame
ani.options(ani.width=1280, ani.height=720, loop=F, title='Successive differences of white noise', outdir='.', interval=1)
require(plyr)
require(xts)
## How many realizations to plot?
N=5
## random numbers
aa = sapply(1:26, function(x) rnorm(1e2));
colnames(aa) = LETTERS;

saveVideo( {
    ## for successive differences, do...
    for (ii in 1:50) {
        ## first make the differences and normalize
        aa1 = apply(aa, 2, function(x) {
            ret=diff(x, differences=ii);ret=ret/max(ret)
        }); 
        ## Turn into timeseries object for easy plotting
        aa1 = xts(aa1, as.Date(1:nrow(aa1)));

        ## next, compute spectrum
        aa2 = alply(aa1, 2, function(x) {
            ## of each column, don't modify original data 
            ret=spectrum(x, taper=0, fast=F, detrend=F, plot=F);
            ## turn into timeseries object
            ret= zoo(ret$spec, ret$freq)});
        ## combine into timeseries matrix
        aa2 = do.call(cbind, aa2 );
        colnames(aa2) = LETTERS;

        ## plot of timeseries differences
        ## manually set limits so plot area is exactly the same between successive figures
        myplot = xyplot(aa1[,1:N], layout=c(N,1),
                        xlab=sprintf('Difference order = %d', ii), 
                        ylab='Normalized Difference',
                        ylim=c(-1.5, 1.5), 
                        scales=list(alternating=F, x=list(rot=90), y=list(relation='same'))); 

        ## plot of spectrum
        myplot1 = xyplot(aa2[,1:N], layout=c(N,1),
                        ylim=c(-0.01, 5), xlim=c(0.1, 0.51),
                        xlab='Frequency', ylab='Spectral Density',
                        type=c('h','l'), 
                        scales=list(y=list(relation='same'), alternating=F));
        ## write them to canvas
        plot(myplot, split=c(1,1,1,2), more=T);
        plot(myplot1, split=c(1,2,1,2), more=F);
        ## provide feedback of process
        print(ii)}

## controls for ffmpeg
}, other.opts = "-b 5000k -bt 5000k")

08 November 2012

Consumerist Dilemas

Consumer society gives us lots of choices, but sometimes provides little opportunity for post-choice customization. We buy something, and we can either return it or keep it, but it is what it is. Want something a little shorter, a little tighter, or a little less stiff? Too bad. Unless you're loaded, and then you can hire someone to do a custom job. But, to my knowledge, even the super-rich no longer order custom automobiles.

For a small subset of expensive, long-lived personal items like cars and mattresses, this can make new purchases particularly stressful. This is one plausible explanation for the copious consumer-choice media devoted to, for example, cars, as well as the effort companies put into brand identity. If they sell you something that you like, then there's a reasonable chance you'll get another one, the "safe choice", when the old one breaks.

I've faced this dilemma recently with shoes and glasses. I typically have one or two pairs of each that I use primarily, daily, typically for at least 2 years. I'm very near-sighted, so I have to get the expensive high-index lenses. My big toes spread out, I assume from years of wearing flip-flops and Chacos, so most every shoe feels narrow and pointy. The decision to get a new pair of glasses or shoes fills me with an existential dread -- what if I'm stuck with costly discomfort for the next year or more?

This month I discovered that keyboards engender a similar dread. I'm at a keyboard for at least 20 hours on an average week, and I imagine it gets up to 60 hours on a long week. Not all of that is typing, but I've had a slowly building case of carpal tunnel / repetitive stress injury, particularly in my right hand, from long coding sessions that involve heavy use of shift, up-arrow, and enter keys. After a month of poking around the internet, checking reviews on Amazon and looking at eBay for used items, I felt totally conflicted. I just wanted to "try something on" before I shelled out cash to saddle myself with something that I ended up hating.

But anything would be better than my stock Dell monstrosity. In this spirit, I dropped by the chaotic indie computer parts/repair store across the street from campus, and found two ergonomic keyboards used and on the cheap:

#1 Microsoft Comfort Curve 2000 v1
#2 Adesso EKB-2100W

The perfect keyboard is neither of them. The Adesso has a nice split design, but it's huge. Enormous. Heavy -- it could be an instrument of murder in the game of Clue. #1 looks nice, and has really easy key-travel. Unfortunately, the left Caps-lock, which is mapped to control and is involved in something like ~5-20% of my (non-prose) keystrokes, is sticky and giving my pinky problems already.

As least they're cheap.

A friend pointed out that one key to preventing RSI is to change up the routine. With this in mind, maybe two keyboards isn't such a bad idea. Did I mention they're cheap? At very least, the existential dread has abated. No more keyboard choices to make in the foreseeable future. Now, if I could just find a low-profile pair of sneakers with a large toebox and a tasteful lack of neon and mesh.

02 October 2012

Modeling Philosopy

I've found it sometimes difficult to explain my modeling philosophy, especially in the face of "why don't you just do X?" or "isn't that just like Y and Z?" responses. Without going into X, Y, or Z in the slightest, I present the following quote from one of my intellectual heros. His short book (less than 100 pages) full of equations that I struggle to make sense of (not unlike going to the opera) is full of little morsels of wisdom. Here's one of my favorites:
It need hardly be repeated that no ~detailed~ statistical agreement with observation is to be expected from models which are based on admitted simplifications. But the situation is quite analogous to others where complex and interacting statistical systems are under consideration. What is to be looked for is a comparable pattern with the main features of the phenomena; when, as we shall see, the model predicts important features which are found to conform to reality, it becomes one worthy of further study and elaboration.
--M.S Bartlett, 1960.
Stochastic Population Models in Ecology and Epidemiology

29 February 2012

Custom Amazon EC2 config for Rstudio

Introduction

This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.

To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.

Edit -- another look

After playing with this setup more and trying to shrink the root partition (as per
these instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.


If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.

Modifications

I proceeded to lock it down -- add an admin user, prevent root login, etc.
Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).

Below is the commandline session that I used to do it, along with annotations and links to some files.
adduser myusername
adduser myusername sudoers
## add correct key to ~myusername/.ssh/authorized_keys

vi /etc/ssh/sshd_config 
## disable root login
/etc/init.d/ssh restart
## now log in as myusername via another terminal to make sure it works, and then log out as root


Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.

## change hostname to match afraid.org entry
sudo vi /etc/hostname
sudo /etc/init.d/hostname restart

## Now it's time to make Rstudio server a little more secure
## from http://rstudio.org/docs/server/running_with_proxy
sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev
sudo a2enmod proxy
sudo a2enmod proxy_http

## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/
openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \
'/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org'

## change permissions, move to default ubuntu locations
sudo chmod 444 crocus.crt
sudo chown root.ssl-cert crocus.crt
sudo mv crocus.crt /usr/share/ca-certificates

sudo chmod 640 crocus.key
sudo chown root.ssl-cert crocus.key
sudo mv crocus.key /etc/ssl/private/

sudo a2enmod rewrite
sudo a2enmod ssl

sudo vi /etc/apache2/sites-enabled/000-default
sudo /etc/init.d/apache2 restart

You can see my full apache config file here:

Conclusions

Now I access Rstudio on the EC2 instance with a browser via:
https://myhostname.mydomain.org

I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.

Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.

Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.

All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.

28 February 2012

Adventures in R Studio Server: Apache2, Https, Security, and Amazon EC2.

I just put a fresh install of Ubuntu Server (10.04.4 LTS) on one of our machines.  As I was doing some post-install config, I accidentally installed Rstudio Server.  And subsequently fell down an exciting little rabbit-hole of server configuration and "ooooh-lala!" playtime.

A friend sung the wonders of Rstudio Server to me recently, and I filed it under "things to ignore for now".  Just another thing to learn, right?  Turns out, the Rstudio folks do *great* work and write good docs, so I hardly had to learn anything.  I just had to dust off my sysadmin skills and fire up some google.

I'm a little concerned about running web services on public-facing machines.  Even more so, given that R provides fairly low-level access to operating system services.  Still, I was impressed to see system user authentication.

I followed the docs for running apache2 as a proxy server, and learned a little about apache in the process.  Since I made it this far, I figured I'd run it through https/ssl, add some memory limitations, etc.  I'm still not entirely convinced this is secure -- it seems that running it in a virtual machine or chroot jail would be ideal.

 On the other hand, I ran across this post on running Rstudio Server inside Amazon EC2 instances.  Nighttime EC2 spot prices on "Quadruple Extra Large" instances (68.4 GB of memory,
8 virtual cores with 3.25 EC2 Compute Units each) fell below $1 an hour tonight, which is cheap enough to play with for an hour or two -- take it through some paces and see how well it does with a *very* *large* *job* or two.  Instances can now be stopped and saved to EBS (elastic block storage), and so only need to be configured once, which really simplifies matters. In fact, I'm wondering if Rstudio (well, R, really) is my "killer app" for EC2. 

Overall, I was really impressed at how fast and easy this was to get up and running. Fun times ahead!

29 October 2011

SQL Koan

It's not that profound, but I sure do like it: a simple, elegant example of a self-join that gives a truth table for the NOT DISTINCT FROM operator.

WITH x(v) AS (VALUES (1),(2),(NULL))
  SELECT l.v, r.v, l.v = r.v AS equality,
    l.v IS NOT DISTINCT FROM r.v AS isnotdistinctfrom
  FROM x l, x r;

Thanks to Sam Mason via the PostgreSQL - general mailing list.

28 October 2011

Why can't ads be more fun?

The worst part about advertisements these days, in my not-so-humble opinion, is the degree to which they belittle the user. Well-endowed lady wants to be my Facebook friend? "Click here to confirm request." Riiight. Lonely singles near me? Yeah, sure. OH GOOD LORD, A THREAT HAS BEEN IDENTIFIED! Oh, wait, y'all really are evil! On the infrequent occasions that I do watch TV, the ads are less flagrantly insidious, but they're nonetheless relentlessly patronizing.

The second-worst part of ads, IMNSHO, is the brute-force repetition. The same inane 30 seconds once? I can do that, but five times in an hour? You've got to be kidding me. Nope, no kidding here. Just 30 seconds of the same inanity, again and again.

Consequently, I find unique and intelligent ads rather inspiring. I know, it shouldn't be this way. One might thing that unique and inspiring would be more common amongst something so prevalent. Anyway...

Some of the GEICO ads that have played on Hulu lately have managed to avoid the "worst part". They're short and playful, vaguely reminiscent of [adult swim] commercials. "Is the sword mightier than the pen" is my current favorite; I still sometimes chuckle when I see it. They still fail occasionally in the repetition department. If you're a large international corporation, how hard is it to make more than a handful of moderately-entertaining 30-second spots?


Which brings me back to banner ads. I just saw this at the top of ars technica today for the first time, and it really caught my eye. I don't read or write Javascript, but I found myself puzzling through it and then following the link.


At the top of the page, I find this banner, which visually reminds me where I came from, that I've come to the right place, and gives me another little puzzle. Neat. Thank you, Heinz College of Carnegie Mellon.


None of this is revolutionary, but I do hope it is evolutionary. Ads don't have to be boring, and I'm guessing that interesting and intelligent ads are more effective even as they're less painful and insulting. My guess is that a generational change-over in marketing departments and their managers is underway, and will be slow. Still, I look forwards to a new generation of more modern advertising approaches that gives me something to think while my eyeballs are held hostage.

06 October 2011

FFT / Power Spectrum Box-and-Whisker Plot with Gggplot2

I have a bunch of time series whose power spectra (FFT via R's spectrum() function) I've been trying to visualize in an intuitive, aesthetically appealing way. At first, I just used lattice's bwplot, but the spacing of the X-axis here really matters. The spectra's frequencies aren't regularly-spaced categories, which is the default of bwplot. If all the series are of the same length, then the coefficients of the are all estimated at the same frequencies, and one can use plyr's split-apply-combine magic to compute the distribution of coefficients at each frequency.

I spent some time trying to faithfully reproduce the plotting layout of the figures produced by, e.g. spectrum(rnorm(1e3)) with respect to the axes. This was way more annoying than i expected...

To include a generic example, I started looking around for easily generated, interesting time series. The logistic map is so damned easy to compute and generally pretty interesting, capable of producing a range of frequencies. Still, I was rather surprised by the final output. Playing with these parameters produces a range of serious weirdness -- apparent ghosting, interesting asymmetries, etc..

Note that I've just used the logistic map for an interesting, self-contained example. You can use the exact same code to generate your own spectrum() box-and-whisker plot by substituting your matrix of time series for mytimeseries below. Do note here that series are by row, which is not exactly standard. For series by column, just change the margin from 1 to 2 in the call to adply below.

## box-and-whisker plot of FFT power spectrum by frequency/period
## vary these parameters -- r as per logistic map
## see http://en.wikipedia.org/wiki/Logistic_map for details
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) {
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## modify these for interesting behavior differences
rfrom = 3.4
rto = 3.7
rsteps=200
seqlen=1e4

require(plyr)
require(ggplot2)

mytimeseries = aaply(seq(from=rfrom, to=rto, length.out=rsteps), 1,
function(x) {
       logmap(seqlen, x)
})

## you can plug in any array here for mytimeseries
## each row is a timeseries
## for series by column, change the margin from 1 to 2, below
logspec = adply( mytimeseries, 1, function(x) {
       ## change spec.pgram parameters as per goals
       ret = spectrum(x, taper=0, demean=T, pad=0, fast=T, plot=F)
       return( data.frame(freq=ret$freq, spec=ret$spec))
})

## boxplot stats at each unique frequency
logspec.bw = ddply(logspec, 'freq', function(x) {
           ret = boxplot.stats(log10(x$spec));
           ## sometimes $out is empty, bind NA to keep dimensions correct
           return(data.frame(t(10^ret$stats), out=c(NA,10^ret$out)))
})

## plot -- outliers are dots
## be careful with spacing of hard returns -- ggplot has odd rules
## as close to default "spectrum" plot as possible
ggplot(logspec.bw, aes(x=1/(freq)))  +  geom_ribbon(aes(ymin=X1,
ymax=X5), alpha=0.35, fill='green')  +
       geom_ribbon(aes(ymin=X2, ymax=X4), alpha=0.5, fill='blue') +
geom_line(aes(y=X3))  +
       geom_point(aes(y=out), color='darkred') +
       scale_x_continuous( trans=c('inverse'), name='Period')  +
       scale_y_continuous(name='Power', trans='log10')


Following my nose further on the logistic map example, I also used the animate package to make a movie that walks through the power spectrum of the map for a range of r values. Maybe one day I'll set this to music and post it to Youtube! Though I think some strategic speeding up and slowing down in important parameter regions is warranted for a truly epic tour of the logistic map's power spectrum.

require(animation)
saveVideo({
       ani.options(interval = 1/5, ani.type='png', ani.width=1600, ani.height=1200)
        for (ii in c(seq(from=3.2, to=3.5, length.out=200), seq(from=3.5, to=4, length.out=1200))) {
            spectrum(logmap(1e4, ii), taper=0, demean=T, pad=0, fast=T, sub=round(ii, 3))
        }
},  video.name = "logmap2.mp4", other.opts = "-b 600k")

Bat Country

I've spent a lot of time thinking about and using R's spectrum() function and the Fast Fourier Transform (FFT) in the last 5+ years. Lately, they've begun to remind me a little of a Theremin: simple to use, difficult to master.

While prepping a figure for the R Graph Gallery (which is awesome, by the way), I ran across this curious example -- its striking visual appearance was definitely not what I was expecting.


I decided to use the logistic map to generate an ensemble of time series with a range of frequencies. The logistic map goes through characteristic period doubling, and then exhibits broad-spectrum "noise" at the onset of chaos. And it's really easy to compute. So, I can tune my time series to have a range of frequency distributions.

In this example, I'm in the periodic regime (r=3.5, period 4) . So why am I getting this Batman motif?

Actually, it's weirdly simple. This is an artifact of the (default) tapering that R's spectrum() function applies to the series before the FFT is computed. In theory, the Fourier Transform assumes an infinite-length sequence, while in practice the FFT assumes the series is circular -- in essence, gluing the beginning and end of the series together to make a loop. If the series is not perfectly periodic, though, this gluing introduces a sharp discontinuity. Tapering is typically applied to reduce or control this discontinuity, so that both ends gently decline to zero. When the series is genuinely periodic, though, this does some weird effects. In essence, the taper transfers power from the fundamental frequencies in a curious way. Note that the fundamental period is 4, as seen by the peak at frequency 1/4, with a strong harmonic at period 2, frequency 1/2.

Zero-padding has a similar effect. If your series is zero (or relatively small) at both ends, then you could glue it together without introducing discontinuities, but peaks near the beginning and the end will still be seen as near by, resulting in spurious peaks. In any case, R pads your series by default (fast=T) unless the length of the series is highly composite (i.e. it can be factored into many small divisors). Below, it turns out that a series of length 1e5 is composite enough that R doesn't pad it. 1e5? Totally different results.

If I understand this correctly, both padding and tapering are attempts to reduce or control spectral leakage. We've constructed an example where no spectral leakage is expected from the original time series, which closely fits the assumptions of the FFT without modification.

The moral? Know thy functions (and their default parameter values, and their effects)!

## using R -- 
## logistic map function:
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) { 
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## compare:
spectrum(logmap(1e5, 3.5), main='R tapers to 0.1 by default')
spectrum(logmap(1e5, 3.5), taper=0, main='No taper')
spectrum(logmap(1e5+1, 3.5), taper=0, main='Non-composite length, fast=T pads by default')
spectrum(logmap(1e5, 3.5), taper=0.5, sub='Lots of padding approximately recovers "correct" results')

18 June 2011

Efficient loops in R -- the complexity versus speed trade-off

I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections.

My ballpark estimate was 1e6 agents using a weekly timestep for about 50 years. This is a stochastic model, so I'd like to replicate it at least 100 times. So, I'll need at least 50*50*100 = 250,000 steps. I figured I'd be happy if I could get my step runtimes down to ~1 second -- dividing the 100 runs over 4 cores, this would give a total runtime of ~17.5 hours. Not short, but not a problem.

At first, I was disheartened to see the runtime of my prototype step function stretching into the minutes. What's going on? Well, I'd used plyr in a very inappropriate way -- for a very large loop. I began to investigate, and discovered that writing an aesthetically unappealing while gave me a 30+x speed-up.

All of which got me thinking -- how expensive are loops and function calls in R? Next week I'm leading a tutorial in R and C++ using the wonderful Rcpp and inline packages here at Santa Fe Institute's 2011 Complex Systems Summer School. Might this make a nice example?

It does, and in spades. Below are the test functions, and you can see that the code complexity increases somewhat from one to the other, but outcomes are identical, again with 30+x speedup for each subsequent case. Here, I'm using the native R API. I also tested using Rcpp to import rbinom(), but that ended up taking twice as long as the naive while loop.

So, the moral of the story seems to be that if you can write a long loop in pure C++, it's a really easy win.

Note -- The as< double >(y); in src below doesn't seem to copy-and-paste correctly for some reason. If testfun2 doesn't compile, check to make sure this bit pasted correctly.


The pure R function definitions

## use aaply -- the simplest code
require(plyr)
testfun0 <- function(x, y)  aaply(x, 1, function(x) rbinom(1, x, y)

## rewrite the above as an explicit loop
testfun1 = function(nrep, x, y) {
    ## loop index
    ii<-0;
    ## result vector
    ret<-rep(0, nrep);
    while(ii < nrep) {
        ii<-ii+1;
        ## how many successes for each element of bb?
        ret[ii] <- rbinom(1, x[ii], y)
    }
    return(ret)
}

Rcpp function definitions (with lots of comments)

## define source code as string, pass to inline
src <-  ' 
    // Cast input parameters to correct types
    // clone prevents direct modification of the input object
    IntegerVector tmp(clone(x));
    double rate = as< double >(y); 
    // IntegerVector inherits from STL vector, so we get standard STL methods
    int tmpsize = tmp.size(); 
    // initialize RNG from R set.seed() call
    RNGScope scope; 
    // loop
    for (int ii =0; ii < tmpsize; ii++) {
        // Rf_rbinom is in the R api
        // For more details, See Rcpp-quickref.pdf, "Random Functions"
        // also, Rcpp provides "safe" accessors via, e.g., tmp(ii)
        tmp(ii) = Rf_rbinom(tmp(ii), rate);
    };
    // tmp points to a native R object, so we can return it as-is
    // if we wanted to return, e.g., the double rate, use:
    // return wrap(rate)
    return tmp;
'

require(inline)
## compile the function, inspect the process with verbose=T
testfun2 = cxxfunction(signature(x='integer', y='numeric'), src, plugin='Rcpp', verbose=T)

## timing function
ps <- function(x) print(system.time(x))

The tests

## Input vector
bb <- rbinom(1e5, 20, 0.5)
## test each case for 2 different loop lengths 
for( nrep in c(1e4, 1e5)){
    ## initialize RNG each time with the same seed
    ## plyr
    set.seed(20); ps(cc0<- testfun0(nrep[1:nrep], bb, 0.1))
    ## explicit while loop
    set.seed(20); ps(cc1<- testfun1(nrep, bb, 0.1))
    ## Rcpp
    set.seed(20); ps(cc2<- testfun2(bb[1:nrep], 0.1))
    print(all.equal(cc1, cc2))
}


## output
   user  system elapsed 
  3.392   0.020   3.417 
   user  system elapsed 
  0.116   0.000   0.119 
   user  system elapsed 
  0.000   0.000   0.002 
[1] TRUE
   user  system elapsed 
 37.534   0.064  37.601 
   user  system elapsed 
  1.228   0.000   1.230 
   user  system elapsed 
  0.020   0.000   0.021 
[1] TRUE

Postlude

After posting this to the very friendly Rcpp-devel mailing list, I got an in-depth reply from Douglas Bates pointing out that, in this case, the performance of vanilla apply() beats the while loop by a narrow margin. He also gives an interesting example of how to use an STL template and an iterator to achieve the same result. I admit that templates are still near-magic to me, and for now I prefer the clarity of the above. Still, if you're curious, this should whet your appetite for some of the wonders of Rcpp.
## Using Rcpp, inline and STL algorithms and containers
 
 ## use the std::binary_function templated struct
inc <-  '
  struct Rb : std::binary_function< double, double, double > {
     double operator() (double size, double prob) const { return ::Rf_rbinom(size, prob); }
  };
'

## define source code as a character vector, pass to inline
 src <-  ' 
     NumericVector sz(x);
     RNGScope scope;
 
     return NumericVector::import_transform(sz.begin(), sz.end(), std::bind2nd(Rb(), as< double >(y)));
'
 
## compile the function, inspect the process with verbose=TRUE
f5 <- cxxfunction(signature(x='numeric', y='numeric'), src, 'Rcpp', inc, verbose=TRUE)