R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

29 October 2011

SQL Koan

It's not that profound, but I sure do like it: a simple, elegant example of a self-join that gives a truth table for the NOT DISTINCT FROM operator.

WITH x(v) AS (VALUES (1),(2),(NULL))
  SELECT l.v, r.v, l.v = r.v AS equality,
    l.v IS NOT DISTINCT FROM r.v AS isnotdistinctfrom
  FROM x l, x r;

Thanks to Sam Mason via the PostgreSQL - general mailing list.

28 October 2011

Why can't ads be more fun?

The worst part about advertisements these days, in my not-so-humble opinion, is the degree to which they belittle the user. Well-endowed lady wants to be my Facebook friend? "Click here to confirm request." Riiight. Lonely singles near me? Yeah, sure. OH GOOD LORD, A THREAT HAS BEEN IDENTIFIED! Oh, wait, y'all really are evil! On the infrequent occasions that I do watch TV, the ads are less flagrantly insidious, but they're nonetheless relentlessly patronizing.

The second-worst part of ads, IMNSHO, is the brute-force repetition. The same inane 30 seconds once? I can do that, but five times in an hour? You've got to be kidding me. Nope, no kidding here. Just 30 seconds of the same inanity, again and again.

Consequently, I find unique and intelligent ads rather inspiring. I know, it shouldn't be this way. One might thing that unique and inspiring would be more common amongst something so prevalent. Anyway...

Some of the GEICO ads that have played on Hulu lately have managed to avoid the "worst part". They're short and playful, vaguely reminiscent of [adult swim] commercials. "Is the sword mightier than the pen" is my current favorite; I still sometimes chuckle when I see it. They still fail occasionally in the repetition department. If you're a large international corporation, how hard is it to make more than a handful of moderately-entertaining 30-second spots?

Which brings me back to banner ads. I just saw this at the top of ars technica today for the first time, and it really caught my eye. I don't read or write Javascript, but I found myself puzzling through it and then following the link.

At the top of the page, I find this banner, which visually reminds me where I came from, that I've come to the right place, and gives me another little puzzle. Neat. Thank you, Heinz College of Carnegie Mellon.

None of this is revolutionary, but I do hope it is evolutionary. Ads don't have to be boring, and I'm guessing that interesting and intelligent ads are more effective even as they're less painful and insulting. My guess is that a generational change-over in marketing departments and their managers is underway, and will be slow. Still, I look forwards to a new generation of more modern advertising approaches that gives me something to think while my eyeballs are held hostage.

06 October 2011

FFT / Power Spectrum Box-and-Whisker Plot with Gggplot2

I have a bunch of time series whose power spectra (FFT via R's spectrum() function) I've been trying to visualize in an intuitive, aesthetically appealing way. At first, I just used lattice's bwplot, but the spacing of the X-axis here really matters. The spectra's frequencies aren't regularly-spaced categories, which is the default of bwplot. If all the series are of the same length, then the coefficients of the are all estimated at the same frequencies, and one can use plyr's split-apply-combine magic to compute the distribution of coefficients at each frequency.

I spent some time trying to faithfully reproduce the plotting layout of the figures produced by, e.g. spectrum(rnorm(1e3)) with respect to the axes. This was way more annoying than i expected...

To include a generic example, I started looking around for easily generated, interesting time series. The logistic map is so damned easy to compute and generally pretty interesting, capable of producing a range of frequencies. Still, I was rather surprised by the final output. Playing with these parameters produces a range of serious weirdness -- apparent ghosting, interesting asymmetries, etc..

Note that I've just used the logistic map for an interesting, self-contained example. You can use the exact same code to generate your own spectrum() box-and-whisker plot by substituting your matrix of time series for mytimeseries below. Do note here that series are by row, which is not exactly standard. For series by column, just change the margin from 1 to 2 in the call to adply below.

## box-and-whisker plot of FFT power spectrum by frequency/period
## vary these parameters -- r as per logistic map
## see for details
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) {
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])

## modify these for interesting behavior differences
rfrom = 3.4
rto = 3.7


mytimeseries = aaply(seq(from=rfrom, to=rto, length.out=rsteps), 1,
function(x) {
       logmap(seqlen, x)

## you can plug in any array here for mytimeseries
## each row is a timeseries
## for series by column, change the margin from 1 to 2, below
logspec = adply( mytimeseries, 1, function(x) {
       ## change spec.pgram parameters as per goals
       ret = spectrum(x, taper=0, demean=T, pad=0, fast=T, plot=F)
       return( data.frame(freq=ret$freq, spec=ret$spec))

## boxplot stats at each unique frequency = ddply(logspec, 'freq', function(x) {
           ret = boxplot.stats(log10(x$spec));
           ## sometimes $out is empty, bind NA to keep dimensions correct
           return(data.frame(t(10^ret$stats), out=c(NA,10^ret$out)))

## plot -- outliers are dots
## be careful with spacing of hard returns -- ggplot has odd rules
## as close to default "spectrum" plot as possible
ggplot(, aes(x=1/(freq)))  +  geom_ribbon(aes(ymin=X1,
ymax=X5), alpha=0.35, fill='green')  +
       geom_ribbon(aes(ymin=X2, ymax=X4), alpha=0.5, fill='blue') +
geom_line(aes(y=X3))  +
       geom_point(aes(y=out), color='darkred') +
       scale_x_continuous( trans=c('inverse'), name='Period')  +
       scale_y_continuous(name='Power', trans='log10')

Following my nose further on the logistic map example, I also used the animate package to make a movie that walks through the power spectrum of the map for a range of r values. Maybe one day I'll set this to music and post it to Youtube! Though I think some strategic speeding up and slowing down in important parameter regions is warranted for a truly epic tour of the logistic map's power spectrum.

       ani.options(interval = 1/5, ani.type='png', ani.width=1600, ani.height=1200)
        for (ii in c(seq(from=3.2, to=3.5, length.out=200), seq(from=3.5, to=4, length.out=1200))) {
            spectrum(logmap(1e4, ii), taper=0, demean=T, pad=0, fast=T, sub=round(ii, 3))
}, = "logmap2.mp4", other.opts = "-b 600k")

Bat Country

I've spent a lot of time thinking about and using R's spectrum() function and the Fast Fourier Transform (FFT) in the last 5+ years. Lately, they've begun to remind me a little of a Theremin: simple to use, difficult to master.

While prepping a figure for the R Graph Gallery (which is awesome, by the way), I ran across this curious example -- its striking visual appearance was definitely not what I was expecting.

I decided to use the logistic map to generate an ensemble of time series with a range of frequencies. The logistic map goes through characteristic period doubling, and then exhibits broad-spectrum "noise" at the onset of chaos. And it's really easy to compute. So, I can tune my time series to have a range of frequency distributions.

In this example, I'm in the periodic regime (r=3.5, period 4) . So why am I getting this Batman motif?

Actually, it's weirdly simple. This is an artifact of the (default) tapering that R's spectrum() function applies to the series before the FFT is computed. In theory, the Fourier Transform assumes an infinite-length sequence, while in practice the FFT assumes the series is circular -- in essence, gluing the beginning and end of the series together to make a loop. If the series is not perfectly periodic, though, this gluing introduces a sharp discontinuity. Tapering is typically applied to reduce or control this discontinuity, so that both ends gently decline to zero. When the series is genuinely periodic, though, this does some weird effects. In essence, the taper transfers power from the fundamental frequencies in a curious way. Note that the fundamental period is 4, as seen by the peak at frequency 1/4, with a strong harmonic at period 2, frequency 1/2.

Zero-padding has a similar effect. If your series is zero (or relatively small) at both ends, then you could glue it together without introducing discontinuities, but peaks near the beginning and the end will still be seen as near by, resulting in spurious peaks. In any case, R pads your series by default (fast=T) unless the length of the series is highly composite (i.e. it can be factored into many small divisors). Below, it turns out that a series of length 1e5 is composite enough that R doesn't pad it. 1e5? Totally different results.

If I understand this correctly, both padding and tapering are attempts to reduce or control spectral leakage. We've constructed an example where no spectral leakage is expected from the original time series, which closely fits the assumptions of the FFT without modification.

The moral? Know thy functions (and their default parameter values, and their effects)!

## using R -- 
## logistic map function:
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) { 
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])

## compare:
spectrum(logmap(1e5, 3.5), main='R tapers to 0.1 by default')
spectrum(logmap(1e5, 3.5), taper=0, main='No taper')
spectrum(logmap(1e5+1, 3.5), taper=0, main='Non-composite length, fast=T pads by default')
spectrum(logmap(1e5, 3.5), taper=0.5, sub='Lots of padding approximately recovers "correct" results')

18 June 2011

Efficient loops in R -- the complexity versus speed trade-off

I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections.

My ballpark estimate was 1e6 agents using a weekly timestep for about 50 years. This is a stochastic model, so I'd like to replicate it at least 100 times. So, I'll need at least 50*50*100 = 250,000 steps. I figured I'd be happy if I could get my step runtimes down to ~1 second -- dividing the 100 runs over 4 cores, this would give a total runtime of ~17.5 hours. Not short, but not a problem.

At first, I was disheartened to see the runtime of my prototype step function stretching into the minutes. What's going on? Well, I'd used plyr in a very inappropriate way -- for a very large loop. I began to investigate, and discovered that writing an aesthetically unappealing while gave me a 30+x speed-up.

All of which got me thinking -- how expensive are loops and function calls in R? Next week I'm leading a tutorial in R and C++ using the wonderful Rcpp and inline packages here at Santa Fe Institute's 2011 Complex Systems Summer School. Might this make a nice example?

It does, and in spades. Below are the test functions, and you can see that the code complexity increases somewhat from one to the other, but outcomes are identical, again with 30+x speedup for each subsequent case. Here, I'm using the native R API. I also tested using Rcpp to import rbinom(), but that ended up taking twice as long as the naive while loop.

So, the moral of the story seems to be that if you can write a long loop in pure C++, it's a really easy win.

Note -- The as< double >(y); in src below doesn't seem to copy-and-paste correctly for some reason. If testfun2 doesn't compile, check to make sure this bit pasted correctly.

The pure R function definitions

## use aaply -- the simplest code
testfun0 <- function(x, y)  aaply(x, 1, function(x) rbinom(1, x, y)

## rewrite the above as an explicit loop
testfun1 = function(nrep, x, y) {
    ## loop index
    ## result vector
    ret<-rep(0, nrep);
    while(ii < nrep) {
        ## how many successes for each element of bb?
        ret[ii] <- rbinom(1, x[ii], y)

Rcpp function definitions (with lots of comments)

## define source code as string, pass to inline
src <-  ' 
    // Cast input parameters to correct types
    // clone prevents direct modification of the input object
    IntegerVector tmp(clone(x));
    double rate = as< double >(y); 
    // IntegerVector inherits from STL vector, so we get standard STL methods
    int tmpsize = tmp.size(); 
    // initialize RNG from R set.seed() call
    RNGScope scope; 
    // loop
    for (int ii =0; ii < tmpsize; ii++) {
        // Rf_rbinom is in the R api
        // For more details, See Rcpp-quickref.pdf, "Random Functions"
        // also, Rcpp provides "safe" accessors via, e.g., tmp(ii)
        tmp(ii) = Rf_rbinom(tmp(ii), rate);
    // tmp points to a native R object, so we can return it as-is
    // if we wanted to return, e.g., the double rate, use:
    // return wrap(rate)
    return tmp;

## compile the function, inspect the process with verbose=T
testfun2 = cxxfunction(signature(x='integer', y='numeric'), src, plugin='Rcpp', verbose=T)

## timing function
ps <- function(x) print(system.time(x))

The tests

## Input vector
bb <- rbinom(1e5, 20, 0.5)
## test each case for 2 different loop lengths 
for( nrep in c(1e4, 1e5)){
    ## initialize RNG each time with the same seed
    ## plyr
    set.seed(20); ps(cc0<- testfun0(nrep[1:nrep], bb, 0.1))
    ## explicit while loop
    set.seed(20); ps(cc1<- testfun1(nrep, bb, 0.1))
    ## Rcpp
    set.seed(20); ps(cc2<- testfun2(bb[1:nrep], 0.1))
    print(all.equal(cc1, cc2))

## output
   user  system elapsed 
  3.392   0.020   3.417 
   user  system elapsed 
  0.116   0.000   0.119 
   user  system elapsed 
  0.000   0.000   0.002 
[1] TRUE
   user  system elapsed 
 37.534   0.064  37.601 
   user  system elapsed 
  1.228   0.000   1.230 
   user  system elapsed 
  0.020   0.000   0.021 
[1] TRUE


After posting this to the very friendly Rcpp-devel mailing list, I got an in-depth reply from Douglas Bates pointing out that, in this case, the performance of vanilla apply() beats the while loop by a narrow margin. He also gives an interesting example of how to use an STL template and an iterator to achieve the same result. I admit that templates are still near-magic to me, and for now I prefer the clarity of the above. Still, if you're curious, this should whet your appetite for some of the wonders of Rcpp.
## Using Rcpp, inline and STL algorithms and containers
 ## use the std::binary_function templated struct
inc <-  '
  struct Rb : std::binary_function< double, double, double > {
     double operator() (double size, double prob) const { return ::Rf_rbinom(size, prob); }

## define source code as a character vector, pass to inline
 src <-  ' 
     NumericVector sz(x);
     RNGScope scope;
     return NumericVector::import_transform(sz.begin(), sz.end(), std::bind2nd(Rb(), as< double >(y)));
## compile the function, inspect the process with verbose=TRUE
f5 <- cxxfunction(signature(x='numeric', y='numeric'), src, 'Rcpp', inc, verbose=TRUE)

11 June 2011

The importance of being unoriginal (and befriending google)

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I've never had any reason to go under the hood until now. Lately, I've been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call("bincounts"), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    return(cbind(bin=bb, dens=aa/length(x)))

## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don't understand. On the other hand, trunc(x) doesn't take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid's red wagon with a landrover...

Dear Google...

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn't return the bin boundaries by default, but it's good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there's something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn't quite as simple as one first thought. On the other hand, sometimes we just want answers. When that's the case, asking google is a pretty good bet.

## their method
get2 = bin1(test, c(0,1), 1e3+1)$nc

10 May 2011

Problems with plyr -- the memory/complexity trade-off

Two types of R users

My overwhelming impression from UseR 2010 is that, generally speaking, there are 2 types of regular R users -- those who have heard and are made uncomfortable by the idea of the *apply() functions, and those who really get it. In UNM R programming group that I've been leading for about a year now, I've really tried to get people over the hump and into the second group. Once there, folks seem to really appreciate the amazing power of vectorization in R, and begin to enjoy writing code. The conceptual clarity of:

mymatrix = matrix(1:10, nrow=10)
  apply(mymatrix, 1, sum)
  apply(mymatrix, 2, sum)

over the clunky:


may not be immediately apparent. Eventually, though, folks go searching for things like rowMedian() and become frustrated that R doesn't "have all the functions" that they need. Well, R does, once you grok apply().

Hadley's Magic Brainchild

In the last year, I've had some serious ahah! moments with plyr. Lately, I've added the reshape package to the mix to achieve some serious R 1-liner Zen. Multidimensional arrays to long-form dataframes?

myarray = array(0, dim=c(3,4,5,6,10), dimnames=list(a=1:3, b=1:4, c=1:5, d=letters[1:6], e=LETTERS[1:10]))
  melt( adply(myarray, 1:4, I))

Sure. No problem. It's really that easy?!

An unexpeceted bonus? This way of thinking lends itself nicely to thinking of explicit parallelism. If you can describe the problem as a single recipe that's done for each "piece" of a whole, then you're one step away from using all the cores on your machine to solve a problem with foreach. Need to apply some sort of long-running analysis to each element of a list, and want to return the results as a list? Why write a for loop when you can do:

## prep
## warning -- this  may not be fast
## use smaller matrices for smaller machines! 
  nn = 2^20
  rr = 2^10
  mylist = list(a=matrix(rnorm(nn), nrow=rr), b=matrix(rnorm(nn), nrow=rr), c=matrix(rnorm(nn), nrow=rr))

## analysis
  myeigs = foreach( ii = iter(mylist), .combine=c) %do% { print('ping'); return(list(eigen(ii)))}

and then, once it works, change %do% to %dopar%, add the following, and you're off to the races!

  ## use the appropriate # of cores, of course
  myeigs1 = foreach( ii = iter(mylist), .combine=c) %dopar% { print('ping'); return(list(eigen(ii)))}

Compared to something like llply, your dimnames don't automatically propagate to the results, but I think this is still pretty amazing. Complete control and simple parallelism.

Debugging with %dopar% is tricky, of course, because there are separate stacks for each item (i think), and messages (such as calls to print()) don't return to the console as you might expect them to. So, when in doubt, back off to %do%.

What could possibly go wrong?

The only problem with all of this is, when tasks are embarassingly parallel, that data also becomes embarassingly parallel to point where it no longer fits into memory. Thus, I returned today to a long-running bootstrap computation to find R consuming ~7GB RAM, 5+ GB swap, and this message:

Error: cannot allocate vector of size 818.6 Mb
Enter a frame number, or 0 to exit   
4: mk.bootrep(zz, 75)
5: mk.boot.R#165: aaply(tmpreps, .margins = 1, function(x) {
6: laply(.data = pieces, .fun = .fun, ..., .progress = .progress, .drop = .dro
7: list_to_array(res, attr(.data, "split_labels"), .drop)
8: unname(unlist(res))

What's happening is that plyr is trying to do everything at once. As anyone who's used grep can tell you, doing one row at a time, or streaming data is often a much better idea. I got the intended results from above by pre-allocating an array and writing each item of my results list into the array in a loop in seconds, and barely broke 3 GB of RAM usage.

Now, nothing here is really news. The dangers of "Growing Objects" is covered in Circle 2 of Burns Statistics' wonderful R Inferno. Still, plyr strikes me as an interesting case where reducing conceptual complexity can lead to a rather steep increase in computational complexity. And the most interesting thing of all is that it happens quite suddenly above a certain threshold.

Parting thoughts

I wonder if there's any major barrier to a stream=TRUE argument to the plyr functions -- I haven't thought about it too much, but imagine that you'd also need a finalizer function to prepare the return object to be written/streamed into. At what point is it easier to just do by hand with a for loop?

Honestly, I don't know the answer. I don't do too many things that break plyr, but I've learned how important it is to understand when I'm likely exceed its limits. .variables in ddply is another one that I've learned to be careful with. If, after subdividing your input data.frame, you end up with 1e5 or 1e6 pieces, things start to break down pretty fast.

Honestly, I love writing solutions with the *apply() family and the ddply functions. I think it makes code cleaner and more logical. Watching the light of this dawn in other R users' eyes is truly exciting. Yet it pays to remember that, like all things, it has its limits. In this case, one seems to reach the limits suddenly and harshly.

29 March 2011

Why PKI matters.

This article landed in my inbox this week in a newsletter from the EFF. I usually don't read them, but the term "meltdown" caught my eye, what with all the nuke new this month. They also managed to work in "too big to fail", and neither reference was hyperbolic. The internet depends on a level of trust, and (surprise) there are people working to co-opt that trust.

A number of my friends think I'm a little weird for using real passwords, for not sharing them, etc. But I read Cryptonomicon and friends; I have at least one box exposed to the real, honest-to-god, jungle-out-there internet; I have a healthy fear of all the things that could go wrong. As part of my lab group's data entry project, I recently registered my first SSL certificate from (free!), and learned quite a bit about PKI in the process.

Still, reading an article like this really drives home both the complexities and importance of the global PKI system. Trust is difficult when it's strung across the globe on fiber optics cables, and enforced by our inability to quickly factor very large numbers. But old-school techniques of impersonation and breaking and entering will always be with us. I may trust, but how do I know that it's actually them? PKI.

How cool is that?

20 March 2011

Looking at the "Curse of Dimensionality" with R, foreach, and lattice

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.

It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.

In general, I'm a big fan of R's lattice package, though I'm never quite happy with the legend/key settings.

The code is below. One thing that I've learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep <= reps) {
                ## 2 points, one per row, randomly chosen
                tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
                ## calc dist
                mydist = as.numeric(dist(tmp, method=method))
                ## fill return dataframe
                ret[ii, ] = list(ss, .dim, .rep, mydist, method)
                ## update rep counter
                .rep = .rep +1
                ## update ret row counter
                ii = ii +1
    } ## rbind happens here

For plotting, I've used:
xyplot(.dist ~ (.dim)|method,, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',

24 February 2011

snow and ssh -- secure inter-machine parallelism with R

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

... It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

... the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login privileges are required!


R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail...


The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don't have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)


This is a funny example of %dopar%; it's minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don't want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave's stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what's happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it's packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn't been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you'd like to see a doSSH package.
+10 if you'd like to help!

ssh port forwarding

Is it just me, or is the openssh port forwarding syntax *really* confusing?  Every time I have to work through examples afresh.  It's ridiculously powerful -- who can argue with key-based authentication on a user-land vpn, and it's available for all major OSes.  Still, I wish I *really* understood the syntax.  

I suppose practice makes perfect.  I'm working on a small project now.  With any luck, examples to come.

Openssh FAQ
Red Hat Magazine

03 February 2011

Postgreql Hacks -- firing a trigger "for each transaction"

The problem

I spent today learning about triggers in postgresql - there are triggers "for each statement" and "for each row". I went searching for triggers "for each transaction", and found a few other folks with the same question but no compelling answer.

My use case is this - I have 2 tables, a parent (entries) and child (reports), with a one-to-many relationship. I need to add multiple reports for one entry in a single transaction, and then update the associated entry if the transaction succeeds. Here are the table definitions:

create schema transtest;
set searchpath to transtest,public;
    -- it's nice to have a sandbox
    -- remember to reset searchpath when done

create table myents ( id serial primary key, sum integer);
    -- the entries table
create table myrecs ( id serial primary key, entid integer references myents (id), cases integer);
    -- the reports table, many reports per entry
create table myids ( id integer primary key references myents (id));
    -- table used to store trigger state

The solution

It appears that deferred constraint triggers are *almost* "per transaction". But constraint triggers are *always* for each row. Yet Postgresql 9.0 allows conditions on triggers - so all I need is to store some state of whether to fire the trigger or not for each row. First, the function that I'll be using in the condition:

-- function for when-clause (condition) in trigger
-- add id to the "to-do" list if it's not there already
-- return True only if it's been added to the least
create or replace function f_condition(myid integer) returns boolean as $_$
        myret boolean := True;
        newid integer;
        RAISE NOTICE 'Return Value here is %', myret;
        select count(*) into newid from myids where id = myid;
        if newid > 0 then -- id is there, return False 
           myret := False; 
        else    -- add id to queue, return True
            insert into myids (id) values (myid);
        end if;
        return myret;
$_$ language plpgsql;

Next, the trigger function, which will clean up after the above function when it's done:

-- trigger function to rebuild myent.sum for this
create or replace function f_refresh() returns trigger as $_$
        mysum integer;
        myid integer;
        IF (TG_OP = 'DELETE') THEN
            myid := OLD.entid;
            myid := NEW.entid;
        END IF;
        RAISE NOTICE 'Return Value here is %', myid;
        select sum(cases) into mysum from myrecs where entid = myid;
            -- compute entry sum from records and update it.
            -- PL/pgSQL rules on queries are quirky.  Read the docs.
        update myents set sum = mysum where id = myid;
        delete from myids where id = myid;
            -- clean up
$_$ language plpgsql;

Finally, the trigger definitions using the above.

-- only fire trigger once per using when clause (condition)
-- eval is deferred until transaction concludes
create constraint trigger t_cache after insert or update 
    on myrecs initially deferred for each row 
    when ( f_condition(NEW.entid) )  -- comment out for pre pg 9.0
    execute procedure f_refresh();

-- need 2, one for each for NEW (insert and update) and OLD (delete)
create constraint trigger t_cache_del after delete
    on myrecs initially deferred for each row 
    when ( f_condition(OLD.entid) )
    execute procedure f_refresh();

The test

All software needs to be tester, right?
I don't have pg9.0 yet, and it won't hit Ubuntu mainline until Ubuntu Natty, but it looks easy enough to install via a backports repo. So, this isn't fully tested. Comment out the "when" lines in the trigger defines and it will work, albeit running once per row *after* the commit.

In any event, here goes! The sum should be zero until after the commit. If the when clause works correctly, then the trigger should only fire (and emit a notice) once. Likewise, the myids table should be empty.

insert into myents (sum) VALUES ( 0);
    insert into myrecs (entid, cases) VALUES ( 1, 0), (1, 1), (1,2);
    select * from myents;  -- 0
    select * from myids;   -- entid from above
select * from myents;   -- new sum
select * from myids;   -- empty

References -- Postgresql Docs:

02 February 2011

I've been obsessively following Egypt for the past few days. Both the BBC and Al Jazeera have excellent coverage, as well as Google Realtime. After days of seeing BBC's "Have your say" links on the BBC ( and reading tech and business stories to boot ), I sent one in. Here goes: ### Begin Comments The latest comments from both Mubarak and Obama indicate that change is on the horizon. How close the horizon actually is seems to be the biggest question on people's minds. Several days ago, in the space of minutes, some 80 million people vanished from the internet (see This likely happened at direct orders from the current regime - who else wields such power? This is one of the most significant "black marks" against the current regime - a totalitarian attempt to curtail information flow. A resumption of basic information services is an absolutely prerequisite to any transition. If the current regime is serious about stability, they will "turn the lights back on" ASAP. The fact that they haven't done so yet casts serious doubt on their intentions.