Life in Code

Playing with OpenCL

2016-11-14T03:21:00.000-07:00

I spent last week reading up on modern C++ developments, including some great essays from Herb Sutter. I was particularly struck by his prescient series on Moore's Law, The Free Lunch Is Over and Welcome to the Jungle. The latter essay portrays all possible computer architectures on a 2D plane of CPU versus memory architecture. The axes are a bit tricky, but the general idea is that a platform at the "origin" is predictable and easy to program for, whereas things get trickier as you move up and/or right.

This figure deserves describes everything from cellphones and game consoles to super computers and communications satellites. It also got me wondering how hard a simple "hello world" OpenCL program would be to get running on my Intel-only laptop. Can I do this in a day, or perhaps just an evening?

Personally, I don't want to fuss with hardware right now - I just want to see how the GPU/C++ pieces fit together. Conveniently, my laptop contains a low-end 24-core embedded GPU on its Broadwell chip. Running Debian, I was able to easily download the requisite packages and get started. I quickly discovered that consumer Intel GPUs, at present, do *not* support double anything, making this a less-than-ideal test-bed for scientific programming.

My test case was inspired by a performance issue that I ran into in my work. In a scientific simulation program that I use & help develop, Valgrind revealed that pow(double, double) was taking fully half of the total computational time. Poking around a bit, I see that pow() and log really are quite complex to compute, particularly for doubles (since the total effort is a function of precision). With this in mind, I set up a simple example using both OpenCl and straight C++, and compared timings. Note - I strongly recommend using a sample size of greather than one to draw any conclusions with real-life consequences!

In this example, the vanilla C++ is clean and easy to read, but is ~20x slower than the OpenCL version. Worried about the possibility of "unintended optimizations", I tried using a different kernel function. I used float for both examples to keep the total computational complexity the same. The speed results remained, but the new test revealed different answers. To the best of my understanding, this highlights differences in precision-sensitive operations between the OpenCL and stdc++ platforms. This is a pretty tricky area - just know that it's something to keep an eye on if you require perfect concordance between platforms.

EDIT: I also added an example using Boost.Compute today, which brings the best of both the C++ and OpenCL worlds. Boost.Compute has straighforward docs, and includes a nice closure macro that allows the direct incorporation of C++ code in kernel functions. The resulting code is *way* less verbose than vanilla OpenCL. The only downside is the extra requirement. That and some *very* noisy compiler warnings.

Here's a full example that can be found in my github test code repo (boost not shown, timings comparable with OpenCL):

$make; time ./opencl; time ./straight

g++ -Wall -std=c++11 -lOpenCL -o opencl opencl.cpp
g++ -Wall -std=c++11 -o straight straight.cpp
Using platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics 5500 BroadWell U-Processor GT2

 result:
200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./opencl  0.20s user 0.09s system 97% cpu 0.295 total

 result:
200 201 202 203 204 205 206 207 208 209
99990 99991 99992 99993 99994 99995 99996 99997 99998 99999
./straight  6.03s user 0.00s system 99% cpu 6.032 total

Hopefully this example helps you get started experimenting with GPU computing. As Herb Sutter points out, we can expect more and greater hardware parallelism in the near future. Discrete GPUs are now commonly used in scientific computing, and Intel is now selling a massively-multicore add-on card, the Xeon Phi processor. Finally, Floating point precision remains an interesting question to keep an eye on in this domain.

Shiny on Webfaction: VPS installion without root

2016-02-15T22:06:00.000-07:00

I've been using Webfaction (plug) as an inexpensive managed VPN. Part of me wants VPS root access, but I'm mostly happy to leave the administrative details to others. Webfaction seems to be a good example of a common VPS plan: user-only access in a rich development environment. Compilers, zsh, and even tmux are available from the shell, making this a very comfortable dev environment overall.

Most times root doesn't matter, but sometimes it complicates new software installs. I've been looking forwards to testing R's webapp package Shiny, but all of the docs assume root access (and some even state that it's required). I set off without knowing if this would work, attempting to see how far I could get. What follows is a (hopefully) reproducible account of a user-land install of R & Shiny via ssh on a Webfaction slice. To the best of my knowledge, this requires only standard development tools, and so should(??) work.

In the following I use [tab] to indicate hitting tab key for auto-completion. The VPS login username is [user]. [edit] means call your editor of choice (vim, emacs, or, god forbid, nano). This assumes you are using bash (which seems to be the default shell on most VPNs).

Prepare the build environment

## ssh to webhost
## make directories, set paths, etc
## source build dir
mkdir ~/src
## software install dir
mkdir ~/local
## personal content dir
CONTENTDIR=~/var
mkdir $CONTENTDIR
## some hosts have /tmp set noexec?
mkdir src/tmp
## Install software here
INSTPREFIX=$HOME/local

## set paths:  
##
echo 'export PATH=$PATH:~/local/bin:~/local/shiny-server/bin' >> ~/.bashrc
echo 'export TMPDIR=$HOME/src/tmp' >>~/.bashrc

## check that all is well
[edit] ~/.bashrc
## update env
. .bashrc

[Ref: temp dir and R packages]

Install R from source: fast and (mostly) easy

cd ~/src
wget http://cran.us.r-project.org/src/base/R-3/R-3.2.3.tar.gz
tar xzf R-3.2.3.tar.gz
cd R-[tab]
./configure --prefix=$INSTPREFIX
## missing library, search and add directory
CPPFLAGS=/usr/lib/jvm/java/include/ make
make install
cd ~

Prep R environment

## The following commands are in R
install.packages(c('shiny', 'rmarkdown'))

## From the shell:
## on a headless / no-X11 box, need cairo for png
echo "options(bitmapType='cairo')" >> ~/.Rprofile
## check that all is well
[edit] ~/.Rprofile

[Ref: R png without X11]

Install cmake (if needed)

## first install cmake - skip if's already available 
`which cmake`
## nothing?  continue
## NOTE - I'm using the source tarball here, not binaries
wget https://cmake.org/files/v3.4/cmake-3.4.3.tar.gz 
tar xzf cmake-[tab]
cd cmake-[tab]
./configure --prefix=$INSTPREFIX
gmake
make install

Install Shiny Server

## From shell
cd ~/src
git clone https://github.com/rstudio/shiny-server.git
cd shiny-server
cmake -DCMAKE_INSTALL_PREFIX=$INSTPREFIX
make 
## "make install" Complains about no build dir
## I'm not sure what happens here, but this seems to work
PYTHON=`which python`
mkdir build
./bin/npm --python="$PYTHON" rebuild 
./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="$PYTHON" rebuild 
make install

[Ref: shiny build docs]

Configure Shiny Server

All of the Shiny Server docs assume the config file is located in /etc/, which I don't have access to. There's _zero_ documentation on running shiny, nor does running shiny-server -h or shiny-server --help provide any indication. Trial and error and reading source code on github finally leads to shiny-server path-to-config-file. So, let's make a shiny site!

## Nest content in ~/var
mkdir $CONTENTDIR/shiny
cp -rp ~/src/shiny-server/samples $CONTENTDIR/shiny/apps
mkdir $CONTENTDIR/shiny/logs
## copy the packaged settings template to the content dir
cp ~/src/shiny-server/config/default.config $CONTENTDIR/shiny/server.conf
[edit] $CONTENTDIR/shiny/server.conf
##
## server.conf content follows:
run_as [user];
## leave location as-is
## substitute var with $CONTENTDIR if needed
    site_dir /home/[user]/var/shiny/apps;
    log_dir /home/[user]/var/shiny/logs;    
## save file
## back at shell, run shiny, put in background
shiny-server ~/var/shiny/server.conf &

[Ref: Shiny-server docs]

Testing

Shiny should give messages about Starting listener on 0.0.0.0:3838. First up, let's use ssh to connect remote port 3838 to a local port. This allows local testing without deployment. As an aside, if you're not using ~/.ssh/config on a local machine to manage keys and hostname shortcuts, you should!

## on local machine:
ssh -nNT -L 9000:127.0.0.1:3838 [user]@webhost

Now, if all went well, you should be able to navigate to the welcome page via browser on local machine:
http://127.0.0.1:9000

Once shiny is working, don't forget to take a look at your logs:
ls -alh $CONTENTDIR/shiny/logs

I had trouble with the packaged rmd example app (which renders a .Rmd file). Reading logs showed install issues with pandoc, and I had to manually fiddle with the links:

ln -s $INSTPREFIX/shiny-server/ext/pandoc/static/pandoc $INSTPREFIX/shiny-server/ext/pandoc/

[Ref: port forwarding]

Wrap-up

For a full production environment, you would want a process monitor to keep shiny-server running, as well a public-facing server. See your webhost's documentation for process monitors. More details on shiny-server and apache are here (I haven't tried these proxy methods).

Finally, a more conventional approach using root access on a VPS (such as DigitalOcean) is available here.

Update - 17 Feb 2016: Deployment Logistics

After a day of kicking the tires, I'm happy to report Shiny-server is working well on Webfaction in production mode. Two points:

Making a webapp. In the Webfaction control panel, I added a custom application. In the following, substitute [appname] for the value entered in the Name field. For App category I selected "Websockets", and then clicked "Save". Copy the port number. Edit the server.conf file from above, replacing the number in listen 3838; with the port number copied from Webfaction. Finally, create a website, add a name (can be the same as [appname] from above), and a domain. It typically takes a few minutes for DNS changes to propagate.

The above steps creates a directory named $HOME/webapps/[appname]. I placed the server.conf file here, created app and log directories, and then updated server.conf to reflect the new locations:

## Create the following directories
## add these paths to server.conf, 
## and don't forget the trailing ; 
mkdir $HOME/[appname]/logs
## shiny app files go here:
mkdir $HOME/[appname]/app

[Ref: Webfaction custom application]
[Ref: Webfaction Applications and Websites]

Running the server. Shiny-server will use a PID file, which makes job-spawning a simple shell script + cron job. If shiny-server is already running, it will recognize the PID file and not start another process. I made the following script:

#!/bin/sh
## executable shell script names $HOME/bin/my.shiny.sh
## make sure to run: chmod +x $HOME/bin/my.shiny.sh
APPROOT=$HOME/webapps/[appname]                                                                               
PIDFN=$APPROOT/shiny-server.pid                   
## using full path                                                        
$HOME/local/shiny-server/bin/shiny-server $APPROOT/server.conf --pidfile=$PIDFN>> $APPROOT/logs/server.log 2>&1 &

Now run crontab -e and add an entry for the script (above):

## try once an hour, on the 10th minute of the hour
10 * * * * /home/[user]/bin/my.shiny.sh

Finally, take a look at memory usage. If you exceed memory limits, Webfaction automatically kills everything. And R's memory use grows with more connections (which themselves persist, because websockets). Webfaction distributes a nice python script that shows per-process and total memory usage.

[Ref: shiny-server systemd script (shows commandline usage)]
[Ref: Webfaction cron]

I should point out that I like Webfaction (plug) well enough to pay them money. Their intro plan is $10/month for 1GB RAM + 100GB full SSD, with a 1-month free trial. I like that the webfaction user-base is big enough that lots of my questions are already answered, but small enough that staff actually answer new questions.

I've done my best to document exactly what I did, but I'm sure there are typos. Let me know if you encounter any issues!

Numerical Simulations and Data passing: C++, Python, and Protocol Buffers

2014-11-14T01:53:00.000-07:00

Problem statement & Requirements

I'm working with a complex C++ simulation that requires a large number of user-specified parameters. Both speed and readability are important. I'd like to define all possible parameters in one (and only one) place, and include sensible defaults that can be easily over-ridden. Finally, intelligent type-handling would be nice. For convenience, I decided to wrap the C++ simulation in python setup/glue code. Python is a logical choice here as the "available everywhere" glue language that has nice standard libraries.

Available libraries

There aren't many data-passing options that work with both C++ and python. Libconfig, JSON, XML, and Google Protocol Buffers (PB) appear to be the only reasonable options. Here's my thoughts on the first three:

Libconfig: Nice clean library, good language support. The big downside is that data structures must be defined both in a data file and in code - e.g. data is "moved" from a file into C++ variables. I feel like libconfig is best for a small number of complex variables, like lists and vectors.
JSON: no clear standard C++ library, library docs so-so, speed complaints from some?
XML: Massive overkill.

That leaves PB, which has nice docs for both C++ and python. All the variables, along with their types and defaults, are defined in a .proto file. The protoc tool auto-generates python and C++ code from the .proto file. By adding it to my Makefile, C++ classes are autogenerated at compile time. This makes for fast and readable C++ code - like using a named dict, but without the speed costs.

Solution / Workflow

I'm using python to read user-supplied values into a set of PB messages, and then serializing the messages to files. C++ then reads the messages from those files at runtime. A python script run by make synchronizes the locations of files between python and C++. I also want to process commandline options for my python wrapper script. Happily, I can hand a PB message to python's parser.parse_args() and have it set PB message attributes with setattr(). The last python step (aside from writing the message to disk) is reading "variable,value" pairs from a .csv file. If a variable has already been set by parse_args, I skip it: the commandline values override .csv file values.

Summary

Overall, PB makes a very nice data coupler between an interpreted language like python and a compiled language like C++. Python excels at text processing and is easy to prototype, while C++ is fast and beautiful. PB has a few side-benefits. On the C++ side, it provides some natural namespace encapsulation to manage variable-explosion. Runtime inspection with gdb is easy enough. Finally, storing all the options values used to run each simulation in a standard-format file is handy - it allows tests to re-run the simulation with exactly the same inputs.

Python Snippets

def main():
    ## initialize protobuf, fill with ParseArgs
    setupSim = ProtoBufInput_pb2.setupSim()
    setupSim = ParseArgs(sys.argv[1:], setupSim)
    prepInput(setupSim)
    RunSim()

def ParseArgs(argv, setupSim):
    parser = OptionParser(usage="wrapper.py [options]\nNote: commandline args over-ride values in files.", version=setupSim.version)
    ## these must be valid protocol buffer fields 
    parser.add_option("-t", "--test", dest="testCLI",
        action='store_true', help="Run test suite")
    parser.add_option("-d", "--days", metavar='N',
        dest="number_of_days",
        type='int', help="Number of days to simulate")
    ## parse!
    (setupSim, args) = parser.parse_args(argv, values=setupSim)
    return(setupSim)

def prepInput(setupSim):
    ## options from ParseArgs
    inhandle = open(setupSim.file_options, 'r')
    outhandle = open(ProtoDataFiles.PbFile_setupSim, 'wb')
    reader = csv.reader(inhandle, delimiter=',')
    header = reader.next()
    if not (header == ['variable','value']):
        raise Exception('Incorrect header format') 

    for row in reader:
        ## skip comments, check for 2 fields per row
        if (row[0][0] == '#'):
            continue
        if not (len(row) == 2):
            raise Exception('Problem with value pair: %s' % row)
            
        ## pack the message using text representation
        msgText = '%s : %s' % (row[0], row[1])
        if setupSim.HasField(row[0]):
            print("Skipping config file, keeping commandline value: %s, %s" % (row[0], getattr(setupSim,row[0])))
            continue
        setupSim = Merge(msgText, setupSim)
    ## write out to file for C++ to read
    outhandle.write(setupSim.SerializeToString())
    outhandle.close()


def RunSim():
    subprocess.Popen("./sim").communicate()

if __name__ == "__main__":
    main()

C++ Code Snippets

//PbRead.h
#include 
#include 
#include 
#include 
#include "proto/ProtoBufInput.pb.h"

template 
void PbRead(Type &msg, const char *filename){
    std::fstream infile(filename, std::ios::in | std::ios::binary);
    if (!infile) {
       throw std::runtime_error("Setup message file not found");
    } else if (!msg.ParseFromIstream(&infile)) {
       throw std::runtime_error("Parse error in message file");
    }
}


// sim.cpp
#include "PbRead.h"
#include "ProtoDataFiles.h"

// protocol buffers get passed around, are globals
ProtoBufInput::setupSim PbSetupSim;

int main(int argc,char **argv)
{
    GOOGLE_PROTOBUF_VERIFY_VERSION;
    PbSetupSim.set_init(true);
    // #define PbFile_setupSim "filename" in ProtoDataFiles.h, written by make
    PbRead(PbSetupSim, PbFile_setupSim);
    //...
    if (PbSetupSim.test_2()){
       //...
    }
}

Efficient Ragged Arrays in R and Rcpp

2014-07-03T19:38:00.000-07:00

When is R Slow, and Why?

Computational speed is a common complaint lodged against R. Some recent posts on r-bloggers.com have compared the speed of R with some other programming languages [1], and showed the favorable impact of the new compiler package on run-times [2]. I and others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks in R [3,4]. With the new Rcpp attributes framework, writing fully vectorized C++ functions and incorporating them in R code is now very easy [5].

On a day-to-day basis, though, R's performance is largely a function of coding style. R allows novices users to write horribly inefficient code [6] that produces the correct answer (eventually). Yet by failing to utilize vectorization and pre-allocation of data structures, naive R code can be many orders of magnitude slower than need be. R-help is littered with the tears of novices, and there's even a (fantastic) parody of Dante's Inferno outlining the common "Deadly Sins of R" [7].

Problem Statement: Appending to Ragged Arrays

I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for, and that I'm sure others have encountered. What is the "R way" to vectorize computations on ragged array? One example of a ragged array is a list of vectors that have varying and different lengths. Say you need to dynamically grow many vectors by varying lengths over the course of a stochastic simulation. Using a simple tool like lapply, the entire data structure will be allocated anew with every assignment. This problem is briefly touched on in the official Introduction to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?

The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix assignment, and for each vector track the index of the start of free space. If we try to write past the end of the matrix, R emits the appropriate error. This method requires some book-keeping on our part. One nice addition would be an S4 class with slots for the data matrix and the vector of free-space indices, as well as methods to dynamically expand the data matrix and validate the object. As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use much less memory at the expense of slower access times [8]. Here, we're using more memory than is strictly needed to achieve much faster access times.

Is pre-allocation and book-keeping worth the trouble?

object.size(matrix(1.5, nrow=1e3,
ncol=1e3))

shows that a data structure of 1,000 vectors, each of length approximately 1,000, occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea it is to *not* pre-allocate a frequently-resized ragged list?

Three Solutions and Some Tests

Using the above logic, I prototyped a solution as an R function, and then transcribed the result into a C++ function (boundary checks are important in C++). The result is three methods: a "naive" list-append method, an R method that uses matrix assignment, and a final C++ method that modifies the pre-allocated matrix in-place. In C++/Rcpp, functions can use pass-by-reference semantics [9], which can have major speed advantages by allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires some caution on the user's part. Pass-by-reference is very different from R's "function programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit assignment call is required to modify an object [10].

I added a unit test to ensure identical results between all three methods, and then used the fantastic rbenchmark package to time each solution. As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++ pre-allocation methods are close in performance. Only with more iterations and larger data structures does the C++ method really start to pull ahead. And by that time, the naive R method takes *forever*.

Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices) is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say. lapply() is conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution using lapply() and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting (large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.

Code

Note: you can find the full code here and here.

Setup: two helper functions are used to generate ragged arrays via random draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a minimum length of 1, gen.lengths()), and draws from the normal distribution fill each vector with data (gen.dat()).

## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
    ## a vector of random draws
    pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
    ## a list of vectors, vector i has length nelem[i]
    ## each vector is filled with random draws
    lapply(nelem, rexp, rate=rate)
}

Three solutions: a naive lapply() method, followed by pre-allocation in R.

## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow dat by appending to list element i
    ## memory will be reallocated at each call
    dat <- mapply( append, dat, new.dat )
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow pre-allocated dat by inserting data in the correct place
    for (irow in 1:length(new.dat)) {
        ## insert one vector at a time
        ## col indices for where to insert new.dat
        cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
        dat[irow, cols.ii] = new.dat[[irow]]
    }
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

Next, the solution as a C++ function. This goes in a separate file that I'll call helpers.cpp (compiled below).

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void appendRcpp(  List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
    // "append" fill oldmat w/  
    // we will loop through rows, filling retmat in with the vectors in list
    // then update retmat_size to index the next free
    // newLenths isn't used, added for compatibility
    NumericVector fillTmp;
    int sizeOld, sizeAdd, sizeNew;
    // pull out dimensions of matrix to fill
    int nrow = retmat.nrow();
    int ncol = retmat.ncol();
    // check that dimensions match
    if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) { 
        throw std::range_error("In appendC(): dimension mismatch");
    }
    for (int ii = 0; ii= ncol) {
            throw std::range_error("In appendC(): exceeded max cols");
        }
        // iterator for row to fill
        NumericMatrix::Row retRow = retmat(ii, _);
        // fill row of return matrix, starting at first non-zero elem
        std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
        // update size of retmat
        retmatLengths[ii] = sizeNew;
    }
}

Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each solution with identical data will be used for timing.

## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
    ## list of empty vectors, fill with append
    dat.list <- lapply(1:particles, function(x) numeric())
    ## preallocated matrix, fill rows from left to right
    dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## length of each element/row
    N.list <- N.r <- N.c <- rep(0, particles)
    ## repeat process, "appending" as we go
    for (ii in 1:nrep) {
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## in R, list of vectors
        tmp <- appendL(dat.new, N.new, dat.list, N.list)
        ## unpack, update
        dat.list <- tmp$dat
        N.list <- tmp$len
        ## in R, preallocate
        tmp <- appendR(dat.new, N.new, dat.r, N.r)
        ## unpack, update
        dat.r <- tmp$dat
        N.r <- tmp$len
        ## as above for C, modify dat.c and N.c in place
        appendRcpp(dat.new, N.new, dat.c, N.c)
    }
    ## pull pre-allocated data back into list
    dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
    ## check that everything is  
    identical(dat.r, dat.c) && identical(N.r, N.c) &&
    identical(dat.list, dat.r.list) && identical(N.list, N.r)
}

## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
    ## object to modify
    N.test <- rep(0, particles)
    dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## speed is affected by size, 
    ## so ensure that each run the same elements
    set.seed(seed)
    for (irep in 1:nrep) {
        ## generate draws
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## bind in using given method
        tmp <- append.fun(dat.new, N.new, dat.test, N.test)
        if(do.update) {
            ## skip update for C
            dat.test <- tmp$dat
            N.test <- tmp$len
        }
    }
}

Finally, we run it all:

library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")

## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)

## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
    replications=10
)

## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    replications=1e1
)

Benchmark results show that the list-append method is 500 times slower than the improved R method, and 1,000 times slower than the C++ method (timings). As we move to larger data structures (timings.fast), the advantage of modifying in-place with C++ rather than having to explicitly assign the results quickly add up.

> timings
  test replications elapsed relative user.self sys.self
2    c           10   0.057    1.000     0.056    0.000
3 list           10  52.792  926.175    52.674    0.036
1    r           10   0.128    2.246     0.123    0.003

> timings.fast
  test replications elapsed relative user.self sys.self
2    c           10   0.684    1.000     0.683    0.000
1    r           10  24.962   36.494    24.934    0.027

References

[1] How slow is R really?
[2] Speeding up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values

Tools for Online Teaching

2014-05-30T18:19:00.000-07:00

Last semester (Fall 2014), I organized and taught an interdisciplinary, collaborative class titled Probability for Scientists. Getting 4 separate teachers on the same page was a challenge, but as scientists we're used to communicating over email, and CC'ing everyone worked well enough. Throw in a shared dropbox folder, a shared google calendar, and a weekly planning meeting, and instructor collaboration went pretty smoothly.

It was more challenging to organize the class so that we could easily provide students with up-to-date course information and supplemental material. We ended up using blogger, which has some key benefits and disadvantages. It was *really* easy to set up and add permissions for multiple individuals. This allowed any of us to put up images (for example, photos of the whiteboard at the end of class) and post instructions. One downside we heard from students was an apparent lack of organization. I attempted to organize the blog with intelligent post labels, along with the "Labels widget" (which shows all possible labels and the number of posts per label on the right hand sidebar). I also included the "Posts by Date" sidebar, so all the content could be accessed chronologically. I understand their comments, though I'm not convinced that a single, monolithic page of information is the right direction either.

One feature of blogger that proved helpful was its "Pages", static html files or links that appear (by default) at the top of the blog. These links are always present no matter where in the blog you navigate to. We added the syllabus and course calendar here. And here's where things start to get interesting.

All four instructors need the ability to collaboratively edit and then publish public-facing material with students (like the syllabus), as edit private course material, like grades. Dropbox makes a natural choice for private material, whereas a collaborative source repository like GitHub makes a natural choice for public material. Personally, I'm more familiar/comfortable with Google Code, but I don't think the choice of services here is material. [Sidenote: If you have a pro GitHub account, the choice seems easy: use a private GitHub repo.]

Using a public git repo gave us a few things that I really liked. First off, our class schedule was a simple HTML table in the repo that we pointed at with a static link via the Pages widget, above. When I needed to update an assignment due date, I did a quick edit-and-commit on my laptop, pushed it to the archive, and magically appeared on the blog. If several instructors are simultaneously working on course material, this essentially provides an audit trail of who did what. This is kindergarden-level stuff for software developmers. But these are tools that teachers could benefit from and aren't very familiar with. For example, getting colleagues to set up git represents a non-trivial challenge. Nonetheless, there are other benefits of learning git if you're a scientist (that I won't address here, though see here for some thoughts).

In our class, I also used R for data visualizations. I let the awesome knitr package build the appropriate .png files from my R code (along with pdfs for download, if needed). Again, it sounds simple, but adding the generated figures to the git archive allowed me to quickly link to them in blog posts, and then update them later if needed.

I would have preferred having the blog posts themselves under revision control (only "Pages" can point to an external source html without javascript, which I didn't have time for). But, for the simplicity of setup and the low (e.g. free) cost of use, I didn't find posts to be much of an issue. Blogger allows composition in pure html, without all the *junk*, which helps. But having to re-upload and link a figure every time I find an error? Definitely a pain.

For this class, we also set up an e4ward.com address that pointed to my email box. This allowed me to publicly post the class address without fear of being spammed forevermore, and allowed me to easily identify all class email. Having a single instructor responsible for email correspondence worked well enough. As an early assignment, students were asked to send a question to the class email address. This is a nice way to get to know folks, and incentivize them to go to "digital office hours", e.g. ask good questions over email. We did have some issues with e4ward.com towards the end of the class. This *sucks* - students panic when emails get lost, and it's hard to sort out where the problem is. Honestly, I don't know the answer here.

As a sidenote, I pushed the use of Wikipedia (WP) heavily in this class, and referred to it often myself. This is not possible in all fields, but WP articles in many of the hard sciences are now both technically detailed and accessible. Probability and statistics articles are some of the best examples here, since the "introductory concept" articles are used by a large number of individuals/fields, and aren't the subjects of debate (compare this with the WP pages of, for example, Tibet or Obama). I also discovered WP's "Outlines" pages for the first time. If you ever use statistics, I strongly recommend spending some time with WP's Outline of Statistics. It's epic.

One final tool we used quite a bit was LimeSurvey. As I was planning the class, I went looking for an inexpensive yet flexible survey platform. I was generally disappointed by the offerings. My requirements included raw data export and more than 10 participants; these features tend *not* to be free, and survey tools can get pricey. Enter LimeSurvey (henceforth LS). It's open source, well-documented, versatile, and simple to use. I was reluctant to invest *too* much effort in tools I wasn't sure we'd use, but I got LS running in less than 2 hours. To be fair, our lab has an on-campus, always-on server, and apache2 is already installed and configured. This would have been an annoying out-of-pocket expense had I needed to rent a VPS, though you can now get a lot of VPS for $5/mo. [sidenote: getting our campus network to play nice with custom DNS was a whole other issue...]

LS allowed me to easily construct data entry surveys, allowing each student to enter, for example, their 25 flips of 3 coins, or their sequence of wins and losses playing the Monty Hall problem. Students can quickly enter this data from their browser at their convenience. At its best, visualizations of the class data can give students a sense of ownership and purpose of in-class exercises. LS also allowed us to conduct initial background surveys, as well as anonymous, customized class "satisfaction" surveys mid-course to find out what was working and what wasn't. We ran into a few administrative issues with LS, but it's an overall powerful and stable data collection platform. Students seemed happy with it, and it provided us with valuable feedback.

What would I do differently? I failed to set up an email list early on. It would have been trivial using, e.g. Google Groups. At the time, it seemed redundant, both for the students and us. Didn't they already have the blog? In retrospect, it would have proved useful at several points to communicate "New Blog Post" or "Due Date Changed, check calendar". I've learned this semester to expect that spoken instructions are not necessarily heeded, and that receiving administrative details in writing from multiple sources (blog, calendar, mailing list) is a Good Idea ™.

Along this vein, a minor needed improvement is breaking making a separate table for assignments. I originally combined assignments with the course calendar in the interest of simplicity, giving students all the relevant information in one place. In the end, it was just confusing.

Overall, the class went very well. We have received positive feedback from the students, and we now have a detailed digital record of the course. PDFs of their final project posters are now in the archive, where they will live in perpetuity. Personally, I'm not ready to teach a MOOC yet, but I'm sold on digital tools as useful supplements to in-class material. They allowed me to spend less time doing more. These tools helped the class "organize itself", and reduced communication overhead between instructors. To older or less technically-inclined teachers, some of this might seem difficult or confusing. On one hand, you really only need *one* instructor to coordinate the tech - it's not very hard for instructors to use the tools once they're set up. On the other hand, some of the above pieces are very easy to implement, and support from a department administrator or technically-inclined teaching assistant (TA) might be available for more challenging pieces. An installation of LS shared across a department, for example, should be trivial for any department that has its own linux server (e.g. math, physics, chem).

In my mind, a key goal here is that technology not get in the way. Many of the commercial "online learning products" that I've seen adopted by universities make simple task complicated, while lacking the flexibility. In trying to do everything for everyone, they often end up doing nothing for anyone (or take a high level of skill or experience to use effectively). I far preferred using several discrete tools, each of which does a single job well (class email, blog to communicate, git repo to hold files).

Are there any interesting tools worth trying out next time? I wonder how a class twitter-hashtag would work...

The Art of the Album

2013-11-26T19:50:00.001-07:00

When I'm working, I'll easily listen to 6 or more hours of music a day. Ten years ago, I listened to a *lot* of public radio, which broadened my ear a lot and introducing me to genres ranging from classical jazz to Native American and New Mexican music. Internet radio gave me more choice of stations (I'm now a happy KEXP micro-donor). Finally, there was Rdio (or Spotify, take your pick). Residents of the U.S. got to hear about the wonders of these all-you-can-eat music-as-a-utility services years before they became available here. Complex licensing agreements had to be signed with our musical overlords. But the future has finally arrived, and for $10/month I now have an internet full music (including offline access on phone, $5/mo for just wired computer).

The magic of a high-quality, easily searched and streamed music archive has transformed the way I listen to music. When I hear a song I like, it now takes me less than 30 seconds to find the album and begin playing on my office computer or phone.
There are a few drawbacks - not every album or artist is available (Joanna Newsom is a particularly galling example), and occasionally I find myself without a reliable cellphone or WiFi signal. But these are minor issues. Overall, the ultimate convenience, the *modern-ness* of it all still blows my mind. To me, this is better than a flying car (of course, I don't even own a normal car). And this convenience has, in the last few years, rekindled my love of the art of the album.

It seems to me that independent music in general has benefited from digital distribution by allowing artists to more easily break from the more conventional constraints of genre. I see a lot of experimentation here, running all the way up to the Dirty Projectors' avant-garde classical composition style. Growing up in the 90's, I enjoyed Pearl Jam and Nirvana well enough, but much of the "alternative" music that I listened to at the time sounded (and still sounds) rather similar to my ears, e.g. Grunge. The ones that sounded different really stood out, and I still cherish them for it (I'm looking at you, Pixies). Maybe I'm biased now by access to more music and better DJs, but I find the modern American music scene incredibly vibrant and diverse. Every month, I can look forward to new releases from favorite artists, as well as finding something or someone new to open my eyes and make my day.

What follows is an unordered list of albums that I've recently developed a strong relationship with. These albums cover a wide range of the acoustic/electronic spectrum. I enjoy repetitive, energetic music when I'm working or juggling or cleaning; I love the emotion and classic song-writing of "folk" and "country"; and I love the driving anthems of modern indie. Consequently, I like to think there's "something for everyone" here. And each of these is an *album*, a free-standing work of art worthy of repeated enjoyment in its uncensored, unedited entirety.

Obvious:

Macklemore & Ryan Lewis - The Heist (2012)

I'd like to find more music like this: the crossroads between pop and hip hop, independent music that gets radio play, catchy but meaningful. I can think of half a dozen song lyric lines that make great life slogans. Tis is a great album to blast in the car on a warm spring day.

alt-J - An Awesome Wave (2012)

I think of alt-J as the Neutral Milk Hotel of this decade: where the hell did they come from? It's such a beautiful, subtle album that came out of nowhere and bears repetition very well. I beg the gods for more in the future.

Daft Punk - Random Access Memories (2012)

I never really got into Daft Punk before this album, and I didn't even like it that much the first few plays. The songs on this album tend towards longish, some of them are slow, and I found myself getting bored. Then I began getting lines stuck in my head, and began dipping back in. In the end, I find this an immensely satisfying sort of pop-EDM-concept-album: a soothing mix of repetitive riffs that aren't too fast or insistent with a backdrop of pop anthem melodies. It strikes me as easy-listening Moby? This album is a little slow form me to "sit down" and listen to, but I find it excellent clean-the-house/driving music.

Phosphorescent - Muchacho (2013)

Rainy day + hot coffee. Sunset and a beer. Just got dumped, fired, graduated, engaged? This is such an extraordinarily luscious, eloquent album. It makes me remember that I have emotions. Lots of them.

Santigold - Master of My Make-Believe (2012)

I always perk up when I hear Santigold singles on the radio, but I was slow to listen to the albums. I like her self-title 2008 album, but it never really got under my skin. The second or third listen of Master, though, and I wanted to know more about this artist. After digging around a bit, I feel like I have a better idea of where she's coming from, and where she's going. The comparison to M.I.A. is inevitable, while the album art for Master suggests something more like Outkast. Master has tons of energy and is packed with pop-friendly riffs. But it's complex, and strikes me as walking the "don't define me" tightrope (or slackline, if you will; you can push *back* on a slackline). I enjoy that it doesn't settle down into a niche and stay there.

Less Obvious:

Shovels and Rope - O' Be Joyful (2012)

In my mental map of Americana, I file this near Wilco and Drive By Truckers. Sometimes slow and sweet, sometimes fast and rambunctious, but always melodic, this album is full of luscious 2-part harmonies with a low-fi, intimate feel. I'm always sad when it ends; I always want more.

First Aid Kit - The Lion's Roar (2012)

Can I call this indie-Americana? Less of the overt Southern influence of Shovels and Rope, but still full of tight vocal harmonies of country/folk. Apparently they're sisters, and apparently they're young, but this album has a big sound, full of driving melancholy. Playing two or three of their albums back-to-back is particularly satisfying. They seem to be growing as they go, and I'm excited to hear what comes next.

John Grant - Queen of Denmark (2010)

A very good anthem album. I don't often listen or pay attention to lyrics, but Grant has a John Prine-ish storytelling quality, a dark sense of humor and playful irony. Musically, it's tends towards simple, with a fast, light quality that reminds me of Paul Simon's Graceland. Thematically, though, it's a dark album. A far-off hint of redemption shines at the end of the tunnel, but just barely. Whistling in the dark.

Sharon Van Etten - Tramp (2013)

A powerful voice, and a powerful song-writer. This album is mature and intimate, and Van Etten's voice is strong and clear. Tight harmonies and vocal stylings that are luxurious without being excessive. The utterly enrapturing quality of controlled liquid of her voice reminds me a little of the Cowboy Junkies' Margo Timmins, with a bit of Joni Mitchell. In short, she's good.

Matthew Dear - Beams (2012)

My first introduction to Matthew Dear, this album is driving. Repetitive, almost grinding, the samples remind me of smoothed-out, slowed down industrial, or gears-and-grease voodoo. It reminds me of being in the belly of a very large machine. The tone palette is less pure than, say, Daft Punk, with lots of glitches and grinding noises. It's also harmonic, full of discordant melodies. And I *love* it. There are songs that I would love to hear on a dark dance floor in a small, crowded night-club. It's sexy as hell, with a floating touch of loss and nostalgia.

Jagwar Ma - Howlin (2013)

This is a somewhat confusing album. A mix of upbeat chorus-driven pop tunes and beat-and-sample driven pop-EDM, I find it a little schizophrenic at times. In the space of two songs, it goes from an drivingly upbeat guitar-and-vocals sound akin to Django Django's recent album, to something more akin to Caribou's hypnotic samples, with little in the way of transition. The situation reminds me a little of Hot Chip's recent album In Our Heads (which I still find deeply confusing). But Howlin is infectious throughout, with several singles that belong in the "party mix".

Dirty Projectors - Swing Lo Megellan (2012)

Beautiful melodies with a glorious sheen of tightly-controlled noise and discord, this approaches classical composition in broad-scale interest and ability to scare off pedestrians at a first listen. There's just enough rhythm and melody, though, to reel a music-lover in until one gains some familiarity with the subtleties. Then the album really starts to open up. To my ear, it's the opposite of a show-stopping dramatic pop album. It's playful and light, and strange, and curious and coy, going from simple to huge and back. It's complex and, sometimes subtly, very satisfying. This is real sit-down-and-listen music, kind of like going to see the symphony.

Junip - Self Title (2013)

It's not unlikely that you've already heard "Your Life Your Call". I'm sure it's in some movie or another, or will be soon. I get shivers every time I hear this song - like the soundtracks of the Breakfast Club and Trainspotting had a mutant child. Jose Gonzales has a number of solo albums (I'm quite fond of his 2005 Veneer - see below), though I never made the connection with Junip myself. His voice is as clear and emotional as ever, but the sound is bigger and more nuanced, a wonderful blend of semi-acoustic and smooth electronic sounds. This is an emotional album - not any *particular* emotion, but all of them, simultaneously, and a lot. Much like Muchacho, listening to it makes me feel decidedly and acutely human.

Yppah - Eighty One (2012)

Driving indie dream-pop, Yppah's sound is reminiscent of Heliosequence with drum machines. Something to get the shoe-gazers moving around!

Less new

but recently discovered or especially noteworthy, albums follow.
I'm ready to wrap this post up, so these get just a brief mention, but they're all worth a good listen.

Caribou - Swim (2010)

Smooth, fast, steady electronic. A masterful album.

Jose Gonzales - Veneer (2005)

Contains a cover of The Knife's song "Heartbeats" that I adore. Close and intimate and lush.

Crystal Castles - Self Title (2008)

One of my current favorite albums. I think of it as glitch-rock. It's more syth-y than punk, but has a lot of similar aesthetic sensibilities: loud, abrasive, driving, and inspiring. I particularly like to cue up all 3 Crystal Castles albums and listen to them in a go. Loudly.

Gold Panda - Lucky Shiner (2010)

Very smooth, incredibly-produced electronic music. Deeply satisfying, good work music.

Franz Ferdinand - Self Title (2004)

Anthemic indie-pop. I'm familiar with most of these songs, and was amazed that that they all came from a single album. Buddy Holly meets Lou Reed?

Jolie Holland - Escondida (2004)

Lead singer of The Be Good Tanyas, Jolie Holland's solo album is intimate and enwrapping.

Juana Molina - Tres Cosas (2004)

Quiet and playful yet insistent percussion is the constant backdrop against which Molina's voice plays. And is it ever playful. Her unassuming Spanish is hypnotizing. She has a new release out that I haven't digested yet, but here's another case where I happily queue up 2 or 3 albums in a row and let them blend effortlessly from one to the next.

Faster, better, cheaper: what is the true value of a computer?

2013-08-23T19:37:00.000-07:00

One thing I've had a lot of time to think about over the last 15 years is what exactly does faster & more powerful mean? After a decade of clockspeed wars, we've moved on to more cores, more RAM, longer battery life, less weight, backlit keyboards, etc. A new computer still costs about the same as it did 15 years ago... is it any better?

The more time I spend with the machines, the more I think about usability. A machine is only as powerful as the tasks it can accomplish. I have a $200 netbook (w/linux, of course) that excels at being light. It works great for travel but not netflix. I could not write a paper on it, but I can check email, upload photos/files, etc.

In our computer cluster here, we upgrade as we hit limits. Ran out of RAM? Buy more... it's cheap (for a desktop, at least). Chronic, awful wrist pain? Get an ergonomic keyboard. I find a 2-screen setup a very cost-effective productivity boost, whereas the idea of paying 20% more for a 10% bump in speed strikes me as silly.

The whole state of affairs reminds me a little of mean and variance. We always hear about the expected values of things, the mean, but rarely is variance ever reported, and variance is often the most important part. Things like weather, lifespan, salary, time-to-completion? The variance may be more important than the mean. Speed/power tells something about the "maximum potential" of a machine, but not how much use one might get out of it.

I see two different directions --

First, unexpected developments. Examples include multiple cores and SSD. No one really expected these to change the aesthetics of computers. Nonetheless, the reduced latency of SSD is pleasantly surprising, as is the increased responsiveness under load of a multi-core machine. I don't hit enter and wait. My machine flows more at the speed of thought than it used to, even if I have a browser with 50+ tabs open, music playing, etc.

Second, human interface. My new phone is just a little too tall, which makes it just a little more difficult to use, since I can't reach across the screen with one hand. Again, I never expected this to matter, but it's a human-machine interface question rather than a pure machine capabilities question. Backlit keyboards, ergonomic keyboards, featherweight laptops, and long battery lives are all about the human-machine interface. Which is more important than speed, since the human is the whole point....

The aesthetics of interface is how Apple came to rule the world. Their hardware is beautiful, intuitively responsive to human touch. Hell, even their stores are clean & informative and intuitive and full of fun toys. Personally, I can't stand their walled-garden approach to hardware, and I have the time and energy to coax my machines into greatness through a commandline interface (which remains one of the most powerful human-machine interface ever developed), but I *like* Apple hardware. They've dragged the PC industry kicking and screaming into the 21st century of "Humans matter more than machines".

Which they rather presciently highlighted in their 1984 Superbowl commercial:
http://www.youtube.com/watch?v=2zfqw8nhUwA

Finally, a mind-blowing historical view on the subject, including a 1983 Bill Gates pimping Apple hardware, and Steve Jobs describing how machines should help people rather than the other way around:
http://www.youtube.com/watch?v=xItV5U-V2W4

Everyday revision control

2013-08-23T19:12:00.000-07:00

This post has been a long time coming. Over the past year or so, I've gradually become familiar, even comfortable, with git. I've mainly used it for my own work, rather than as a collaborative tool. Most of the folks that I work with don't need to share code on a day-to-day basis, and there's a learning curve that few of my current colleagues seem interested in climbing at this point. This hasn't stopped me from *talking* to my colleagues about git as an important tool in reproducible research (henceforth referred to as RR).

I find the process of committing files and writing commit messages at the end of the day forces me to tidy up. It also allows me to more easily put a project on hold for weeks or months and to then return to it with a clear understanding of what I'd been working on, and what work remained. In short, I use my git commit messages very much like a lab notebook (a countervailing view on git and reproducible research is here, an interesting discussion of GNU make and RR here, and a nice post on RR from knitr author Yihui Xie here ).

Sidenote: I've hosted several projects at https://code.google.com/, and used their git archives, particularly for classes (I prefer the interface to github, though the two platforms are similar). I've also increasingly used Dropbox for collaborations, and I've struggled to integrate Dropbox and Git. Placing the same file under the control of two very different synchronization tools strikes me as a Bad Idea (TM), and Dropbox's handling of symlinks isn't very sophisticated. On the other hand, maintaining 2 different file-trees for one project is frustrating. I haven't found a good solution for this yet...

As far as tools go, most of the time I simply edit files with vim and commit from the commandline. In this sense, git has barely changed my work flow, other than demanding a bit of much-needed attention to organization on my part. Lately, I've started using GUI tools to easily visualize repositories, e.g. to simultaneously see a commit's date, message, files, and diff. Both gitk and giggle have similar functionality -- giggle is prettier, gitk is cleaner. Another interesting development is that Rstudio now includes git tools (as well as native latex and knitr support in the native Rstudio editor). This means that a default Rstudio install has all the tools necessary for a collaborator to quickly and easily check out an repository and start playing with it.

Adventures with Android

2013-08-09T04:20:00.001-07:00

After months of dealing with an increasingly sluggish and downright buggy Verizon HTC Rhyme, I finally took the leap and got a used Galaxy Nexus. First off, I think it's beautiful. The rhyme isn't exactly a high-end phone, so the small, unexciting screen isn't particularly surprising. By comparison, the circa 2011 Nexus is a work of art. My first impression of the AMOLED screen is great. Dark blacks and luscious color saturation (though I have found it to be annoyingly shiny -- screens shouldn't be mirrors!). It even has a barometer!

One big motivation for a phone upgrade (asides from the cracked screen and aforementioned lag) was being stuck at Android 2.3.4 (Gingerbread). As a technologist, I don't consider myself an early-adopted. I prefer to let others sort out the confusion of initial releases, and pick out the gems that emerge. But Gingerbread is well over 2 years old, and a lot has happened since then. The Galaxy Nexus (I have the Verizon CDMA model, codename Toro) is a skin-free, pure Android device. Which means that I am now in control of my phone's Android destiny!

How to go about this? I hit the web and cobbled together a cursory understanding of the Android/google-phone developer ecosystem as it currently stands. First off, there's xda-developers, a very active community of devs and users. There's an organizational page for information on the Galaxy Nexus here that helped me get oriented. This post made installing adb and fastboot a snap on ubuntu 12.04 (precise). There's also some udev magic from google under the "Configuring USB Access" section here that I followed, perhaps blindly (though http://source.android.com is a good primary reference...).

Next, I downloaded ClockworkMod for my device, rebooted my phone into the bootloader, and installed and booted into ClockworkMod:

## these commands are from computer connected to phone (via usb cable)
## check that phone is connected.  this should return "device"
adb get-state
## reboot the phone into the bootloader
adb reboot-bootloader

## in recovery, the phone should have funny pictures of the adroid bot opened up... reminds me of Bender.
## the actual file (the last argument) will vary by device
fastboot flash recovery recovery-clockwork-touch-6.0.3.5-toro.img
## boot into ClockworkMod 
fastboot boot recovery-clockwork-touch-6.0.3.5-toro.img

This brought me to the touchscreen interface of ClockworkMod. First, I did a factory reset/clear cache as per others' instructions. Then I flashed the files listed here (with the exception of root) via sideloading. There's an option in ClockworkMod that says something like "install from sideload". Selecting this gives instructions -- basically, use adb to send the files, and then ClockworkMod takes care of the rest:

## do this on computer after selecting "install from sideload" on phone
## the ROM, "pure_toro-v3-eng-4.3_r1.zip", varies by device
adb sideload pure_toro-v3-eng-4.3_r1.zip
## repeat for all files that need to be flashed

I rebooted into a shiny new install of Jelly Bean (4.3). It's so much cleaner and more pleasant than my old phone. I was also pleasantly surprised to see that Android Backup service auto-installed all my apps from the Rhyme.

In the process of researching this, I got a much better idea of what CyanogenMod does. I'm tempted to try it out now, but I reckon I'll wait for the 4.3 release, whenever that happens.

I also found http://www.htcdev.com/bootloader, which offers the prospect of unlocking and upgrading the HTC Rhyme, though I haven't found any ROMs that work for the CDMA version...

Secure webserver on the cheap: free SSL certificates

2013-06-13T04:18:00.001-07:00

Setting up an honest, fully-certified secure web server (e.g. https) on the cheap can be tricky, mainly due to certificates. Certificates are only issued to folks who can prove they are who they say they are. This verification generally takes time and energy, and therefore money. But the great folks at https://www.startssl.com/ have an automated system that verifies identity and auto-renders associated SSL certificates for free.

Validating an email is easy enough, but validating a domain is trickier -- it requires a receiving mailserver that startssl can mail a verification code to. Inbound port 25 (mail server) is blocked by my ISP, the University of New Mexico (and honestly, I'd rather not run an inbound mail server).

I manage my personal domain through http://freedns.afraid.org/. They provide full DNS management, as well as some great dynamic DNS tools. They're wonderful. But they don't provide any fine-grained email management, just MX records and the like.

The perfect companion of afraid.org is https://www.e4ward.com/. They have mail servers that will conditionally accept mail for specific addresses at personal domain, and forward that mail to an email account. This lets me route specific addresses @mydomain.com, things like postmaster@mydomain.com, to my personal gmail account. E4ward is a real class-act. They manually moderate/approve new accounts, so there's a bit of time lag. To add a domain, they also require proof of control via a TXT record (done through afraid.org).

This whole setup allowed me to prove that I owned my domain to startssl.com without running a mail server or paying for anything other than the domain. The result is my own SSL certificates. I'm running a pylons webapp with apache2 and mod_wsgi. In combination with python's repoze.what, I get secure user authentication over https without any snakeoil.

Hat-tip to this writeup, which introduced me to e4ward.com and their mail servers.

Finally, there are a number of online tools to query domains. dnsstuff.com was one of the better ones I found. It takes longer to load, but gives a detailed report of domain configuration, along with suggestions. A nice tool to verify that everything is working as expected.

Learning new fileserver tricks: RAID + LVM

2013-06-11T04:27:00.000-07:00

I've finally gotten comfortable with linux's software raid, aka mdadm. I've been hearing about LVM, and I finally took the plunge and figured out how to get the two to play together. Of course, a benefit of RAID is data security. The big benefit I see from LVM is getting to add/remove disk space without repartitioning. Once RAID is working, stacking LVM on top was easy enough, especially for my use case of a single-big-filesystem. I was able to move all my data onto one RAID array, built a new filesystem on top of a logical volume, move data to the new filesystem, and then add the final RAID array to the logical volume and resize the filesystem. Thus, I end up with 3 separate RAID arrays glommed together into a single, large filesystem.

## Tell LVM about RAID arrays 
sudo pvcreate /dev/md2
sudo pvcreate /dev/md3

## Create a volume group from empty RAID arrays
sudo vgcreate VolGroupArray /dev/md2 /dev/md3

## Create a logical volume named "archive", using all available space 
sudo lvcreate -l +100%FREE VolGroupArray -n archive
sudo lvdisplay 
## and create a filesystem on the new logical volume 
sudo mkfs.ext4 /dev/VolGroupArray/archive

## mount the new filesystem
## and move files from the mount-point of /dev/md1 to /dev/VolGroupArray/archive
## then unmount /dev/md1

## Add the last RAID array to the volume group
sudo pvcreate /dev/md1
sudo vgextend VolGroupArray /dev/md1

## Update the logical volume to use all available space 
sudo lvresize -l +100%FREE /dev/VolGroupArray/archive
## And resize the filesystem -- rather slow, maybe faster to unmount it first...
sudo resize2fs /dev/VolGroupArray/archive

## Finally, get blkid and update /etc/fstab with UUID and mount options (here, just noatime)
sudo blkid

I probably should have made backups before I did this, but everything went smoothly...
Also, I discovered this python tool to do conversions in-place. Again, this appears non-destructive, but back-ups never hurt. Also of interest for a file server is Smartmontools to monitor for hardware/disk failures: a nice review is here.

[REFS]
* http://home.gagme.com/greg/linux/raid-lvm.php
* https://wiki.archlinux.org/index.php/Software_RAID_and_LVM
* http://webworxshop.com/2009/10/10/online-filesystem-resizing-with-lvm

Symmetric set differences in R

2013-06-07T01:28:00.001-07:00

My .Rprofile contains a collection of convenience functions and function abbreviations. These are either functions I use dozens of times a day and prefer not to type in full:

## my abbreviation of head()
h <- function(x, n=10) head(x, n)
## and summary()
ss <- summary

Or problems that I'd rather figure out once, and only once:

## example:
## between( 1:10, 5.5, 6.5 )
between <- function(x, low, high, ineq=F) {
    ## like SQL between, return logical index
    if (ineq) {
        x >= low & x <= high
    } else {
        x > low & x < high
    }
}

One of these "problems" that's been rattling around in my head is the fact that setdiff(x, y) is asymmetric, and has no options to modify this. With some regularity, I want to know if two sets are equal, and if not, what are the differing elements. setequal(x, y) gives me a boolean answer to the first question. It would *seem* that setdiff(x, y) would identify those elements. However, I find the following result rather counter-intuitive:

> setdiff(1:5, 1:6) 
integer(0)

I personally dislike having to type both setdiff(x,y) and setdiff(y,x) to identify the differing elements, as well as remember which is the reference set (here, the second argument, which I find personally counterintuitive). With this in mind, here's a snappy little function that returns the symmetric set difference:

symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}
> symdiff(1:5, 1:6) == symdiff(1:6, 1:5)
[1] TRUE

Tada! A new function for my .Rprofile!

Successive Differences of a Randomly-Generated Timeseries

2012-11-25T00:44:00.000-07:00

I was wondering about the null distribution of successive differences of random sequences, and decided to do some numerical experiments. I quickly realized that successive differences equates to taking successively higher-order numerical derivatives, which functions as a high-pass filter. So, the null distribution really depends on the spectrum of the original timeseries.

Here, I've only played with random sequences, which are equivalent to white noise. Using the wonderful animation package, I created a movie that shows the timeseries resulting from differencing, along with their associated power spectra. You can see that, by the end, almost all of the power is concentrated in the highest frequencies. The code required to reproduce this video is shown below.

Note -- For optimum viewing, switch the player quality to high.

require(animation)
## large canvas, write output to this directory, 1 second per frame
ani.options(ani.width=1280, ani.height=720, loop=F, title='Successive differences of white noise', outdir='.', interval=1)
require(plyr)
require(xts)
## How many realizations to plot?
N=5
## random numbers
aa = sapply(1:26, function(x) rnorm(1e2));
colnames(aa) = LETTERS;

saveVideo( {
    ## for successive differences, do...
    for (ii in 1:50) {
        ## first make the differences and normalize
        aa1 = apply(aa, 2, function(x) {
            ret=diff(x, differences=ii);ret=ret/max(ret)
        }); 
        ## Turn into timeseries object for easy plotting
        aa1 = xts(aa1, as.Date(1:nrow(aa1)));

        ## next, compute spectrum
        aa2 = alply(aa1, 2, function(x) {
            ## of each column, don't modify original data 
            ret=spectrum(x, taper=0, fast=F, detrend=F, plot=F);
            ## turn into timeseries object
            ret= zoo(ret$spec, ret$freq)});
        ## combine into timeseries matrix
        aa2 = do.call(cbind, aa2 );
        colnames(aa2) = LETTERS;

        ## plot of timeseries differences
        ## manually set limits so plot area is exactly the same between successive figures
        myplot = xyplot(aa1[,1:N], layout=c(N,1),
                        xlab=sprintf('Difference order = %d', ii), 
                        ylab='Normalized Difference',
                        ylim=c(-1.5, 1.5), 
                        scales=list(alternating=F, x=list(rot=90), y=list(relation='same'))); 

        ## plot of spectrum
        myplot1 = xyplot(aa2[,1:N], layout=c(N,1),
                        ylim=c(-0.01, 5), xlim=c(0.1, 0.51),
                        xlab='Frequency', ylab='Spectral Density',
                        type=c('h','l'), 
                        scales=list(y=list(relation='same'), alternating=F));
        ## write them to canvas
        plot(myplot, split=c(1,1,1,2), more=T);
        plot(myplot1, split=c(1,2,1,2), more=F);
        ## provide feedback of process
        print(ii)}

## controls for ffmpeg
}, other.opts = "-b 5000k -bt 5000k")

Consumerist Dilemas

2012-11-08T23:36:00.000-07:00

Consumer society gives us lots of choices, but sometimes provides little opportunity for post-choice customization. We buy something, and we can either return it or keep it, but it is what it is. Want something a little shorter, a little tighter, or a little less stiff? Too bad. Unless you're loaded, and then you can hire someone to do a custom job. But, to my knowledge, even the super-rich no longer order custom automobiles.

For a small subset of expensive, long-lived personal items like cars and mattresses, this can make new purchases particularly stressful. This is one plausible explanation for the copious consumer-choice media devoted to, for example, cars, as well as the effort companies put into brand identity. If they sell you something that you like, then there's a reasonable chance you'll get another one, the "safe choice", when the old one breaks.

I've faced this dilemma recently with shoes and glasses. I typically have one or two pairs of each that I use primarily, daily, typically for at least 2 years. I'm very near-sighted, so I have to get the expensive high-index lenses. My big toes spread out, I assume from years of wearing flip-flops and Chacos, so most every shoe feels narrow and pointy. The decision to get a new pair of glasses or shoes fills me with an existential dread -- what if I'm stuck with costly discomfort for the next year or more?

This month I discovered that keyboards engender a similar dread. I'm at a keyboard for at least 20 hours on an average week, and I imagine it gets up to 60 hours on a long week. Not all of that is typing, but I've had a slowly building case of carpal tunnel / repetitive stress injury, particularly in my right hand, from long coding sessions that involve heavy use of shift, up-arrow, and enter keys. After a month of poking around the internet, checking reviews on Amazon and looking at eBay for used items, I felt totally conflicted. I just wanted to "try something on" before I shelled out cash to saddle myself with something that I ended up hating.

But anything would be better than my stock Dell monstrosity. In this spirit, I dropped by the chaotic indie computer parts/repair store across the street from campus, and found two ergonomic keyboards used and on the cheap:

#1 Microsoft Comfort Curve 2000 v1
#2 Adesso EKB-2100W

The perfect keyboard is neither of them. The Adesso has a nice split design, but it's huge. Enormous. Heavy -- it could be an instrument of murder in the game of Clue. #1 looks nice, and has really easy key-travel. Unfortunately, the left Caps-lock, which is mapped to control and is involved in something like ~5-20% of my (non-prose) keystrokes, is sticky and giving my pinky problems already.

As least they're cheap.

A friend pointed out that one key to preventing RSI is to change up the routine. With this in mind, maybe two keyboards isn't such a bad idea. Did I mention they're cheap? At very least, the existential dread has abated. No more keyboard choices to make in the foreseeable future. Now, if I could just find a low-profile pair of sneakers with a large toebox and a tasteful lack of neon and mesh.

Modeling Philosopy

2012-10-02T02:56:00.000-07:00

I've found it sometimes difficult to explain my modeling philosophy, especially in the face of "why don't you just do X?" or "isn't that just like Y and Z?" responses. Without going into X, Y, or Z in the slightest, I present the following quote from one of my intellectual heros. His short book (less than 100 pages) full of equations that I struggle to make sense of (not unlike going to the opera) is full of little morsels of wisdom. Here's one of my favorites:

It need hardly be repeated that no ~detailed~ statistical agreement with observation is to be expected from models which are based on admitted simplifications. But the situation is quite analogous to others where complex and interacting statistical systems are under consideration. What is to be looked for is a comparable pattern with the main features of the phenomena; when, as we shall see, the model predicts important features which are found to conform to reality, it becomes one worthy of further study and elaboration.

--M.S Bartlett, 1960.
Stochastic Population Models in Ecology and Epidemiology

Custom Amazon EC2 config for Rstudio

2012-02-29T00:14:00.016-07:00

Introduction

This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.

To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.

Edit -- another look

After playing with this setup more and trying to shrink the root partition (as per
these instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.

If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.

Modifications

I proceeded to lock it down -- add an admin user, prevent root login, etc.
Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).

Below is the commandline session that I used to do it, along with annotations and links to some files.

adduser myusername
adduser myusername sudoers
## add correct key to ~myusername/.ssh/authorized_keys

vi /etc/ssh/sshd_config 
## disable root login
/etc/init.d/ssh restart
## now log in as myusername via another terminal to make sure it works, and then log out as root

Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.

## change hostname to match afraid.org entry
sudo vi /etc/hostname
sudo /etc/init.d/hostname restart

## Now it's time to make Rstudio server a little more secure
## from http://rstudio.org/docs/server/running_with_proxy
sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev
sudo a2enmod proxy
sudo a2enmod proxy_http

## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/
openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \
'/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org'

## change permissions, move to default ubuntu locations
sudo chmod 444 crocus.crt
sudo chown root.ssl-cert crocus.crt
sudo mv crocus.crt /usr/share/ca-certificates

sudo chmod 640 crocus.key
sudo chown root.ssl-cert crocus.key
sudo mv crocus.key /etc/ssl/private/

sudo a2enmod rewrite
sudo a2enmod ssl

sudo vi /etc/apache2/sites-enabled/000-default
sudo /etc/init.d/apache2 restart

You can see my full apache config file here:

Conclusions

Now I access Rstudio on the EC2 instance with a browser via:
https://myhostname.mydomain.org

I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.

Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.

Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.

All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.

Adventures in R Studio Server: Apache2, Https, Security, and Amazon EC2.

2012-02-28T03:17:00.004-07:00

I just put a fresh install of Ubuntu Server (10.04.4 LTS) on one of our machines. As I was doing some post-install config, I accidentally installed Rstudio Server. And subsequently fell down an exciting little rabbit-hole of server configuration and "ooooh-lala!" playtime.

A friend sung the wonders of Rstudio Server to me recently, and I filed it under "things to ignore for now". Just another thing to learn, right? Turns out, the Rstudio folks do *great* work and write good docs, so I hardly had to learn anything. I just had to dust off my sysadmin skills and fire up some google.

I'm a little concerned about running web services on public-facing machines. Even more so, given that R provides fairly low-level access to operating system services. Still, I was impressed to see system user authentication.

I followed the docs for running apache2 as a proxy server, and learned a little about apache in the process. Since I made it this far, I figured I'd run it through https/ssl, add some memory limitations, etc. I'm still not entirely convinced this is secure -- it seems that running it in a virtual machine or chroot jail would be ideal.

On the other hand, I ran across this post on running Rstudio Server inside Amazon EC2 instances. Nighttime EC2 spot prices on "Quadruple Extra Large" instances (68.4 GB of memory,
8 virtual cores with 3.25 EC2 Compute Units each) fell below $1 an hour tonight, which is cheap enough to play with for an hour or two -- take it through some paces and see how well it does with a *very* *large* *job* or two. Instances can now be stopped and saved to EBS (elastic block storage), and so only need to be configured once, which really simplifies matters. In fact, I'm wondering if Rstudio (well, R, really) is my "killer app" for EC2.

Overall, I was really impressed at how fast and easy this was to get up and running. Fun times ahead!

SQL Koan

2011-10-29T00:13:00.002-07:00

It's not that profound, but I sure do like it: a simple, elegant example of a self-join that gives a truth table for the NOT DISTINCT FROM operator.

WITH x(v) AS (VALUES (1),(2),(NULL))
  SELECT l.v, r.v, l.v = r.v AS equality,
    l.v IS NOT DISTINCT FROM r.v AS isnotdistinctfrom
  FROM x l, x r;

Thanks to Sam Mason via the PostgreSQL - general mailing list.

Why can't ads be more fun?

2011-10-28T21:42:00.001-07:00

The worst part about advertisements these days, in my not-so-humble opinion, is the degree to which they belittle the user. Well-endowed lady wants to be my Facebook friend? "Click here to confirm request." Riiight. Lonely singles near me? Yeah, sure. OH GOOD LORD, A THREAT HAS BEEN IDENTIFIED! Oh, wait, y'all really are evil! On the infrequent occasions that I do watch TV, the ads are less flagrantly insidious, but they're nonetheless relentlessly patronizing.

The second-worst part of ads, IMNSHO, is the brute-force repetition. The same inane 30 seconds once? I can do that, but five times in an hour? You've got to be kidding me. Nope, no kidding here. Just 30 seconds of the same inanity, again and again.

Consequently, I find unique and intelligent ads rather inspiring. I know, it shouldn't be this way. One might thing that unique and inspiring would be more common amongst something so prevalent. Anyway...

Some of the GEICO ads that have played on Hulu lately have managed to avoid the "worst part". They're short and playful, vaguely reminiscent of [adult swim] commercials. "Is the sword mightier than the pen" is my current favorite; I still sometimes chuckle when I see it. They still fail occasionally in the repetition department. If you're a large international corporation, how hard is it to make more than a handful of moderately-entertaining 30-second spots?

Which brings me back to banner ads. I just saw this at the top of ars technica today for the first time, and it really caught my eye. I don't read or write Javascript, but I found myself puzzling through it and then following the link.

At the top of the page, I find this banner, which visually reminds me where I came from, that I've come to the right place, and gives me another little puzzle. Neat. Thank you, Heinz College of Carnegie Mellon.

None of this is revolutionary, but I do hope it is evolutionary. Ads don't have to be boring, and I'm guessing that interesting and intelligent ads are more effective even as they're less painful and insulting. My guess is that a generational change-over in marketing departments and their managers is underway, and will be slow. Still, I look forwards to a new generation of more modern advertising approaches that gives me something to think while my eyeballs are held hostage.

FFT / Power Spectrum Box-and-Whisker Plot with Gggplot2

2011-10-06T20:28:00.000-07:00

I have a bunch of time series whose power spectra (FFT via R's spectrum() function) I've been trying to visualize in an intuitive, aesthetically appealing way. At first, I just used lattice's bwplot, but the spacing of the X-axis here really matters. The spectra's frequencies aren't regularly-spaced categories, which is the default of bwplot. If all the series are of the same length, then the coefficients of the are all estimated at the same frequencies, and one can use plyr's split-apply-combine magic to compute the distribution of coefficients at each frequency.

I spent some time trying to faithfully reproduce the plotting layout of the figures produced by, e.g. spectrum(rnorm(1e3)) with respect to the axes. This was way more annoying than i expected...

To include a generic example, I started looking around for easily generated, interesting time series. The logistic map is so damned easy to compute and generally pretty interesting, capable of producing a range of frequencies. Still, I was rather surprised by the final output. Playing with these parameters produces a range of serious weirdness -- apparent ghosting, interesting asymmetries, etc..

Note that I've just used the logistic map for an interesting, self-contained example. You can use the exact same code to generate your own spectrum() box-and-whisker plot by substituting your matrix of time series for mytimeseries below. Do note here that series are by row, which is not exactly standard. For series by column, just change the margin from 1 to 2 in the call to adply below.

## box-and-whisker plot of FFT power spectrum by frequency/period
## vary these parameters -- r as per logistic map
## see http://en.wikipedia.org/wiki/Logistic_map for details
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) {
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## modify these for interesting behavior differences
rfrom = 3.4
rto = 3.7
rsteps=200
seqlen=1e4

require(plyr)
require(ggplot2)

mytimeseries = aaply(seq(from=rfrom, to=rto, length.out=rsteps), 1,
function(x) {
       logmap(seqlen, x)
})

## you can plug in any array here for mytimeseries
## each row is a timeseries
## for series by column, change the margin from 1 to 2, below
logspec = adply( mytimeseries, 1, function(x) {
       ## change spec.pgram parameters as per goals
       ret = spectrum(x, taper=0, demean=T, pad=0, fast=T, plot=F)
       return( data.frame(freq=ret$freq, spec=ret$spec))
})

## boxplot stats at each unique frequency
logspec.bw = ddply(logspec, 'freq', function(x) {
           ret = boxplot.stats(log10(x$spec));
           ## sometimes $out is empty, bind NA to keep dimensions correct
           return(data.frame(t(10^ret$stats), out=c(NA,10^ret$out)))
})

## plot -- outliers are dots
## be careful with spacing of hard returns -- ggplot has odd rules
## as close to default "spectrum" plot as possible
ggplot(logspec.bw, aes(x=1/(freq)))  +  geom_ribbon(aes(ymin=X1,
ymax=X5), alpha=0.35, fill='green')  +
       geom_ribbon(aes(ymin=X2, ymax=X4), alpha=0.5, fill='blue') +
geom_line(aes(y=X3))  +
       geom_point(aes(y=out), color='darkred') +
       scale_x_continuous( trans=c('inverse'), name='Period')  +
       scale_y_continuous(name='Power', trans='log10')

Following my nose further on the logistic map example, I also used the animate package to make a movie that walks through the power spectrum of the map for a range of r values. Maybe one day I'll set this to music and post it to Youtube! Though I think some strategic speeding up and slowing down in important parameter regions is warranted for a truly epic tour of the logistic map's power spectrum.

require(animation)
saveVideo({
       ani.options(interval = 1/5, ani.type='png', ani.width=1600, ani.height=1200)
        for (ii in c(seq(from=3.2, to=3.5, length.out=200), seq(from=3.5, to=4, length.out=1200))) {
            spectrum(logmap(1e4, ii), taper=0, demean=T, pad=0, fast=T, sub=round(ii, 3))
        }
},  video.name = "logmap2.mp4", other.opts = "-b 600k")

Bat Country

2011-10-06T16:49:00.001-07:00

I've spent a lot of time thinking about and using R's spectrum() function and the Fast Fourier Transform (FFT) in the last 5+ years. Lately, they've begun to remind me a little of a Theremin: simple to use, difficult to master.

While prepping a figure for the R Graph Gallery (which is awesome, by the way), I ran across this curious example -- its striking visual appearance was definitely not what I was expecting.

I decided to use the logistic map to generate an ensemble of time series with a range of frequencies. The logistic map goes through characteristic period doubling, and then exhibits broad-spectrum "noise" at the onset of chaos. And it's really easy to compute. So, I can tune my time series to have a range of frequency distributions.

In this example, I'm in the periodic regime (r=3.5, period 4) . So why am I getting this Batman motif?

Actually, it's weirdly simple. This is an artifact of the (default) tapering that R's spectrum() function applies to the series before the FFT is computed. In theory, the Fourier Transform assumes an infinite-length sequence, while in practice the FFT assumes the series is circular -- in essence, gluing the beginning and end of the series together to make a loop. If the series is not perfectly periodic, though, this gluing introduces a sharp discontinuity. Tapering is typically applied to reduce or control this discontinuity, so that both ends gently decline to zero. When the series is genuinely periodic, though, this does some weird effects. In essence, the taper transfers power from the fundamental frequencies in a curious way. Note that the fundamental period is 4, as seen by the peak at frequency 1/4, with a strong harmonic at period 2, frequency 1/2.

Zero-padding has a similar effect. If your series is zero (or relatively small) at both ends, then you could glue it together without introducing discontinuities, but peaks near the beginning and the end will still be seen as near by, resulting in spurious peaks. In any case, R pads your series by default (fast=T) unless the length of the series is highly composite (i.e. it can be factored into many small divisors). Below, it turns out that a series of length 1e5 is composite enough that R doesn't pad it. 1e5? Totally different results.

If I understand this correctly, both padding and tapering are attempts to reduce or control spectral leakage. We've constructed an example where no spectral leakage is expected from the original time series, which closely fits the assumptions of the FFT without modification.

The moral? Know thy functions (and their default parameter values, and their effects)!

## using R -- 
## logistic map function:
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) { 
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])
    }
    return(ret)
}

## compare:
spectrum(logmap(1e5, 3.5), main='R tapers to 0.1 by default')
spectrum(logmap(1e5, 3.5), taper=0, main='No taper')
spectrum(logmap(1e5+1, 3.5), taper=0, main='Non-composite length, fast=T pads by default')
spectrum(logmap(1e5, 3.5), taper=0.5, sub='Lots of padding approximately recovers "correct" results')

Efficient loops in R -- the complexity versus speed trade-off

2011-06-18T20:49:00.014-07:00

I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections.

My ballpark estimate was 1e6 agents using a weekly timestep for about 50 years. This is a stochastic model, so I'd like to replicate it at least 100 times. So, I'll need at least 50*50*100 = 250,000 steps. I figured I'd be happy if I could get my step runtimes down to ~1 second -- dividing the 100 runs over 4 cores, this would give a total runtime of ~17.5 hours. Not short, but not a problem.

At first, I was disheartened to see the runtime of my prototype step function stretching into the minutes. What's going on? Well, I'd used plyr in a very inappropriate way -- for a very large loop. I began to investigate, and discovered that writing an aesthetically unappealing while gave me a 30+x speed-up.

All of which got me thinking -- how expensive are loops and function calls in R? Next week I'm leading a tutorial in R and C++ using the wonderful Rcpp and inline packages here at Santa Fe Institute's 2011 Complex Systems Summer School. Might this make a nice example?

It does, and in spades. Below are the test functions, and you can see that the code complexity increases somewhat from one to the other, but outcomes are identical, again with 30+x speedup for each subsequent case. Here, I'm using the native R API. I also tested using Rcpp to import rbinom(), but that ended up taking twice as long as the naive while loop.

So, the moral of the story seems to be that if you can write a long loop in pure C++, it's a really easy win.

Note -- The as< double >(y); in src below doesn't seem to copy-and-paste correctly for some reason. If testfun2 doesn't compile, check to make sure this bit pasted correctly.

The pure R function definitions

## use aaply -- the simplest code
require(plyr)
testfun0 <- function(x, y)  aaply(x, 1, function(x) rbinom(1, x, y)

## rewrite the above as an explicit loop
testfun1 = function(nrep, x, y) {
    ## loop index
    ii<-0;
    ## result vector
    ret<-rep(0, nrep);
    while(ii < nrep) {
        ii<-ii+1;
        ## how many successes for each element of bb?
        ret[ii] <- rbinom(1, x[ii], y)
    }
    return(ret)
}

Rcpp function definitions (with lots of comments)

## define source code as string, pass to inline
src <-  ' 
    // Cast input parameters to correct types
    // clone prevents direct modification of the input object
    IntegerVector tmp(clone(x));
    double rate = as< double >(y); 
    // IntegerVector inherits from STL vector, so we get standard STL methods
    int tmpsize = tmp.size(); 
    // initialize RNG from R set.seed() call
    RNGScope scope; 
    // loop
    for (int ii =0; ii < tmpsize; ii++) {
        // Rf_rbinom is in the R api
        // For more details, See Rcpp-quickref.pdf, "Random Functions"
        // also, Rcpp provides "safe" accessors via, e.g., tmp(ii)
        tmp(ii) = Rf_rbinom(tmp(ii), rate);
    };
    // tmp points to a native R object, so we can return it as-is
    // if we wanted to return, e.g., the double rate, use:
    // return wrap(rate)
    return tmp;
'

require(inline)
## compile the function, inspect the process with verbose=T
testfun2 = cxxfunction(signature(x='integer', y='numeric'), src, plugin='Rcpp', verbose=T)

## timing function
ps <- function(x) print(system.time(x))

The tests

## Input vector
bb <- rbinom(1e5, 20, 0.5)
## test each case for 2 different loop lengths 
for( nrep in c(1e4, 1e5)){
    ## initialize RNG each time with the same seed
    ## plyr
    set.seed(20); ps(cc0<- testfun0(nrep[1:nrep], bb, 0.1))
    ## explicit while loop
    set.seed(20); ps(cc1<- testfun1(nrep, bb, 0.1))
    ## Rcpp
    set.seed(20); ps(cc2<- testfun2(bb[1:nrep], 0.1))
    print(all.equal(cc1, cc2))
}


## output
   user  system elapsed 
  3.392   0.020   3.417 
   user  system elapsed 
  0.116   0.000   0.119 
   user  system elapsed 
  0.000   0.000   0.002 
[1] TRUE
   user  system elapsed 
 37.534   0.064  37.601 
   user  system elapsed 
  1.228   0.000   1.230 
   user  system elapsed 
  0.020   0.000   0.021 
[1] TRUE

Postlude

After posting this to the very friendly Rcpp-devel mailing list, I got an in-depth reply from Douglas Bates pointing out that, in this case, the performance of vanilla apply() beats the while loop by a narrow margin. He also gives an interesting example of how to use an STL template and an iterator to achieve the same result. I admit that templates are still near-magic to me, and for now I prefer the clarity of the above. Still, if you're curious, this should whet your appetite for some of the wonders of Rcpp.

## Using Rcpp, inline and STL algorithms and containers
 
 ## use the std::binary_function templated struct
inc <-  '
  struct Rb : std::binary_function< double, double, double > {
     double operator() (double size, double prob) const { return ::Rf_rbinom(size, prob); }
  };
'

## define source code as a character vector, pass to inline
 src <-  ' 
     NumericVector sz(x);
     RNGScope scope;
 
     return NumericVector::import_transform(sz.begin(), sz.end(), std::bind2nd(Rb(), as< double >(y)));
'
 
## compile the function, inspect the process with verbose=TRUE
f5 <- cxxfunction(signature(x='numeric', y='numeric'), src, 'Rcpp', inc, verbose=TRUE)

The importance of being unoriginal (and befriending google)

2011-06-11T19:29:00.001-07:00

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I've never had any reason to go under the hood until now. Lately, I've been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call("bincounts"), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    };
    return(cbind(bin=bb, dens=aa/length(x)))
}


## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don't understand. On the other hand, trunc(x) doesn't take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid's red wagon with a landrover...

Dear Google...

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn't return the bin boundaries by default, but it's good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there's something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn't quite as simple as one first thought. On the other hand, sometimes we just want answers. When that's the case, asking google is a pretty good bet.

## their method
require(ash)
get2 = bin1(test, c(0,1), 1e3+1)$nc

Problems with plyr -- the memory/complexity trade-off

2011-05-10T03:42:00.001-07:00

Two types of R users

My overwhelming impression from UseR 2010 is that, generally speaking, there are 2 types of regular R users -- those who have heard and are made uncomfortable by the idea of the *apply() functions, and those who really get it. In UNM R programming group that I've been leading for about a year now, I've really tried to get people over the hump and into the second group. Once there, folks seem to really appreciate the amazing power of vectorization in R, and begin to enjoy writing code. The conceptual clarity of:

mymatrix = matrix(1:10, nrow=10)
  apply(mymatrix, 1, sum)
  apply(mymatrix, 2, sum)

over the clunky:

rowSums(mymatrix)
  colSums(mymatrix)

may not be immediately apparent. Eventually, though, folks go searching for things like rowMedian() and become frustrated that R doesn't "have all the functions" that they need. Well, R does, once you grok apply().

Hadley's Magic Brainchild

In the last year, I've had some serious ahah! moments with plyr. Lately, I've added the reshape package to the mix to achieve some serious R 1-liner Zen. Multidimensional arrays to long-form dataframes?

myarray = array(0, dim=c(3,4,5,6,10), dimnames=list(a=1:3, b=1:4, c=1:5, d=letters[1:6], e=LETTERS[1:10]))
  melt( adply(myarray, 1:4, I))

Sure. No problem. It's really that easy?!

An unexpeceted bonus? This way of thinking lends itself nicely to thinking of explicit parallelism. If you can describe the problem as a single recipe that's done for each "piece" of a whole, then you're one step away from using all the cores on your machine to solve a problem with foreach. Need to apply some sort of long-running analysis to each element of a list, and want to return the results as a list? Why write a for loop when you can do:

## prep
## warning -- this  may not be fast
## use smaller matrices for smaller machines! 
  nn = 2^20
  rr = 2^10
  mylist = list(a=matrix(rnorm(nn), nrow=rr), b=matrix(rnorm(nn), nrow=rr), c=matrix(rnorm(nn), nrow=rr))

## analysis
  myeigs = foreach( ii = iter(mylist), .combine=c) %do% { print('ping'); return(list(eigen(ii)))}

and then, once it works, change %do% to %dopar%, add the following, and you're off to the races!

require(multicore)
  require(doMC)
  ## use the appropriate # of cores, of course
  registerDoMC(cores=4)
  myeigs1 = foreach( ii = iter(mylist), .combine=c) %dopar% { print('ping'); return(list(eigen(ii)))}

Compared to something like llply, your dimnames don't automatically propagate to the results, but I think this is still pretty amazing. Complete control and simple parallelism.

Debugging with %dopar% is tricky, of course, because there are separate stacks for each item (i think), and messages (such as calls to print()) don't return to the console as you might expect them to. So, when in doubt, back off to %do%.

What could possibly go wrong?

The only problem with all of this is, when tasks are embarassingly parallel, that data also becomes embarassingly parallel to point where it no longer fits into memory. Thus, I returned today to a long-running bootstrap computation to find R consuming ~7GB RAM, 5+ GB swap, and this message:

Error: cannot allocate vector of size 818.6 Mb
Enter a frame number, or 0 to exit   
...
4: mk.bootrep(zz, 75)
5: mk.boot.R#165: aaply(tmpreps, .margins = 1, function(x) {
6: laply(.data = pieces, .fun = .fun, ..., .progress = .progress, .drop = .dro
7: list_to_array(res, attr(.data, "split_labels"), .drop)
8: unname(unlist(res))

What's happening is that plyr is trying to do everything at once. As anyone who's used grep can tell you, doing one row at a time, or streaming data is often a much better idea. I got the intended results from above by pre-allocating an array and writing each item of my results list into the array in a loop in seconds, and barely broke 3 GB of RAM usage.

Now, nothing here is really news. The dangers of "Growing Objects" is covered in Circle 2 of Burns Statistics' wonderful R Inferno. Still, plyr strikes me as an interesting case where reducing conceptual complexity can lead to a rather steep increase in computational complexity. And the most interesting thing of all is that it happens quite suddenly above a certain threshold.

Parting thoughts

I wonder if there's any major barrier to a stream=TRUE argument to the plyr functions -- I haven't thought about it too much, but imagine that you'd also need a finalizer function to prepare the return object to be written/streamed into. At what point is it easier to just do by hand with a for loop?

Honestly, I don't know the answer. I don't do too many things that break plyr, but I've learned how important it is to understand when I'm likely exceed its limits. .variables in ddply is another one that I've learned to be careful with. If, after subdividing your input data.frame, you end up with 1e5 or 1e6 pieces, things start to break down pretty fast.

Honestly, I love writing solutions with the *apply() family and the ddply functions. I think it makes code cleaner and more logical. Watching the light of this dawn in other R users' eyes is truly exciting. Yet it pays to remember that, like all things, it has its limits. In this case, one seems to reach the limits suddenly and harshly.

Why PKI matters.

2011-03-29T04:16:00.002-07:00

This article landed in my inbox this week in a newsletter from the EFF. I usually don't read them, but the term "meltdown" caught my eye, what with all the nuke new this month. They also managed to work in "too big to fail", and neither reference was hyperbolic. The internet depends on a level of trust, and (surprise) there are people working to co-opt that trust.

A number of my friends think I'm a little weird for using real passwords, for not sharing them, etc. But I read Cryptonomicon and friends; I have at least one box exposed to the real, honest-to-god, jungle-out-there internet; I have a healthy fear of all the things that could go wrong. As part of my lab group's data entry project, I recently registered my first SSL certificate from http://www.startssl.com/ (free!), and learned quite a bit about PKI in the process.

Still, reading an article like this really drives home both the complexities and importance of the global PKI system. Trust is difficult when it's strung across the globe on fiber optics cables, and enforced by our inability to quickly factor very large numbers. But old-school techniques of impersonation and breaking and entering will always be with us. I may trust google.com, but how do I know that it's actually them? PKI.

How cool is that?
Very.