Pages

Labels

R (14) Admin (11) programming (9) Rant (6) personal (5) parallelism (3) spectrum (3) HPC (2) Rcpp (2) SQL (2) amazon (2) code (2) email (2) frequency (2) git (2) plotting (2) postgresql (2) rstudio (2) school (2) DNS (1) Egypt (1) Flying Cars (1) Future (1) Internet (1) LVM (1) Modeling (1) Mubarak (1) Music (1) Obama (1) Politics (1) RAID (1) Reproducible Research (1) Science (1) Teaching (1) animation (1) apache (1) aws (1) cloud (1) ebay shame phone nyt (1) ec2 (1) economics (1) eff (1) encryption (1) example (1) graphing (1) hardware (1) https (1) knitr (1) knoxville xmas08 mobile pic (1) link (1) linux (1) math (1) performance (1) pki (1) plyr (1) security (1) ssh (1) syntax (1) tech audio (1) ubuntu (1) useR (1)

03 July 2014

Efficient Ragged Arrays in R and Rcpp

When is R Slow, and Why?

Computational speed is a common complaint lodged against R. Some recent posts on r-bloggers.com have compared the speed of R with some other programming languages [1], and showed the favorable impact of the new compiler package on run-times [2]. I and others have written about using Rcpp to easily write C++ functions to speed-up bottlenecks in R [3,4]. With the new Rcpp attributes framework, writing fully vectorized C++ functions and incorporating them in R code is now very easy [5].

On a day-to-day basis, though, R's performance is largely a function of coding style. R allows novices users to write horribly inefficient code [6] that produces the correct answer (eventually). Yet by failing to utilize vectorization and pre-allocation of data structures, naive R code can be many orders of magnitude slower than need be. R-help is littered with the tears of novices, and there's even a (fantastic) parody of Dante's Inferno outlining the common "Deadly Sins of R" [7].

Problem Statement: Appending to Ragged Arrays

I recently stumbled onto an interesting code optimization problem that I *didn't* have a quick solution for, and that I'm sure others have encountered. What is the "R way" to vectorize computations on ragged array? One example of a ragged array is a list of vectors that have varying and different lengths. Say you need to dynamically grow many vectors by varying lengths over the course of a stochastic simulation. Using a simple tool like lapply, the entire data structure will be allocated anew with every assignment. This problem is briefly touched on in the official Introduction to R documentation, which simply notes that "when the subclass sizes [e.g. vector sizes] are all the same the indexing may be done implicitly and much more efficiently". But what if you're data *isn't* rectangular? How might one intelligently vectorize a ragged array to prevent (sloooow) memory allocations at every step?

The obvious answer is to pre-allocate a rectangular matrix (or array) that is larger than the maximum possible vector length, and store each vector as a row (or column?) in the matrix. Now we can use matrix assignment, and for each vector track the index of the start of free space. If we try to write past the end of the matrix, R emits the appropriate error. This method requires some book-keeping on our part. One nice addition would be an S4 class with slots for the data matrix and the vector of free-space indices, as well as methods to dynamically expand the data matrix and validate the object. As an aside, this solution is essentially the inverse of a sparse matrix. Sparse matrices use much less memory at the expense of slower access times [8]. Here, we're using more memory than is strictly needed to achieve much faster access times.

Is pre-allocation and book-keeping worth the trouble? object.size(matrix(1.5, nrow=1e3, ncol=1e3)) shows that a data structure of 1,000 vectors, each of length approximately 1,000, occupies about 8Mb of memory. Let's say I resize this structure 1,000 times. Now I'm looking at almost a gigabyte of memory allocations. Perhaps you're getting a sense of what a terrible idea it is to *not* pre-allocate a frequently-resized ragged list?

Three Solutions and Some Tests

Using the above logic, I prototyped a solution as an R function, and then transcribed the result into a C++ function (boundary checks are important in C++). The result is three methods: a "naive" list-append method, an R method that uses matrix assignment, and a final C++ method that modifies the pre-allocated matrix in-place. In C++/Rcpp, functions can use pass-by-reference semantics [9], which can have major speed advantages by allowing functions to modify their arguments in-place. Full disclosure: pass-by-reference semantics requires some caution on the user's part. Pass-by-reference is very different from R's "function programming" semantics (pass-by-value, copy-on-modify), where side-effects are minimized and an explicit assignment call is required to modify an object [10].

I added a unit test to ensure identical results between all three methods, and then used the fantastic rbenchmark package to time each solution. As expected, the naive method is laughably slow. By comparison, and perhaps counter-intuitively, the R and C++ pre-allocation methods are close in performance. Only with more iterations and larger data structures does the C++ method really start to pull ahead. And by that time, the naive R method takes *forever*.

Refactoring existing code to use the pre-allocated compound data structure (matrix plus indices) is a more challenging exercise that's "left to the reader", as mathematics textbooks oft say. lapply() is conceptually simple, and is often fast *enough*. Some work is required to transcribe code from this simpler style to use the "anti-sparse" matrix (and indices). There's a temptation to prototype a solution using lapply() and then "fix" things later. But if you're using ragged arrays and doing any heavy lifting (large data structures, many iterations), the timings show that pre-allocation is more than worth the effort.

Code

Note: you can find the full code here and here.

Setup: two helper functions are used to generate ragged arrays via random draws. First, draws from the negative binomial distribution determine the length of the each new vector (with a minimum length of 1, gen.lengths()), and draws from the normal distribution fill each vector with data (gen.dat()).
## helper functions
gen.lengths <- function(particles, ntrials=3, prob=0.5) {
    ## a vector of random draws
    pmax(1, rnbinom(particles, ntrials, prob))
}
gen.dat <- function(nelem, rate=1) {
    ## a list of vectors, vector i has length nelem[i]
    ## each vector is filled with random draws
    lapply(nelem, rexp, rate=rate)
}

Three solutions: a naive lapply() method, followed by pre-allocation in R.
## naive method
appendL <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow dat by appending to list element i
    ## memory will be reallocated at each call
    dat <- mapply( append, dat, new.dat )
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}

## dynamically append to preallocated matrix
## maintain a vector of the number of "filled" elements in each row
## emit error if overfilled
## R solution
appendR <- function(new.dat, new.lengths, dat, dat.lengths) {
    ## grow pre-allocated dat by inserting data in the correct place
    for (irow in 1:length(new.dat)) {
        ## insert one vector at a time
        ## col indices for where to insert new.dat
        cols.ii <- (dat.lengths[irow]+1):(dat.lengths[irow]+new.lengths[irow])
        dat[irow, cols.ii] = new.dat[[irow]]
    }
    ## update lengths
    dat.lengths <- dat.lengths + new.lengths
    return(list(dat=dat, len=dat.lengths))
}


Next, the solution as a C++ function. This goes in a separate file that I'll call helpers.cpp (compiled below).
#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void appendRcpp(  List fillVecs, NumericVector newLengths, NumericMatrix retmat, NumericVector retmatLengths) {
    // "append" fill oldmat w/  
    // we will loop through rows, filling retmat in with the vectors in list
    // then update retmat_size to index the next free
    // newLenths isn't used, added for compatibility
    NumericVector fillTmp;
    int sizeOld, sizeAdd, sizeNew;
    // pull out dimensions of matrix to fill
    int nrow = retmat.nrow();
    int ncol = retmat.ncol();
    // check that dimensions match
    if ( nrow != retmatLengths.size() || nrow != fillVecs.size()) { 
        throw std::range_error("In appendC(): dimension mismatch");
    }
    for (int ii = 0; ii= ncol) {
            throw std::range_error("In appendC(): exceeded max cols");
        }
        // iterator for row to fill
        NumericMatrix::Row retRow = retmat(ii, _);
        // fill row of return matrix, starting at first non-zero elem
        std::copy( fillTmp.begin(), fillTmp.end(), retRow.begin() + sizeOld);
        // update size of retmat
        retmatLengths[ii] = sizeNew;
    }
}


Putting the pieces together: a unit test ensures the results of all three methods are identical, and a function that runs each solution with identical data will be used for timing.
## unit test
test.correct.append <- function(nrep, particles=1e3, max.cols=1e3, do.c=F) {
    ## list of empty vectors, fill with append
    dat.list <- lapply(1:particles, function(x) numeric())
    ## preallocated matrix, fill rows from left to right
    dat.r <- dat.c <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## length of each element/row
    N.list <- N.r <- N.c <- rep(0, particles)
    ## repeat process, "appending" as we go
    for (ii in 1:nrep) {
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## in R, list of vectors
        tmp <- appendL(dat.new, N.new, dat.list, N.list)
        ## unpack, update
        dat.list <- tmp$dat
        N.list <- tmp$len
        ## in R, preallocate
        tmp <- appendR(dat.new, N.new, dat.r, N.r)
        ## unpack, update
        dat.r <- tmp$dat
        N.r <- tmp$len
        ## as above for C, modify dat.c and N.c in place
        appendRcpp(dat.new, N.new, dat.c, N.c)
    }
    ## pull pre-allocated data back into list
    dat.r.list <- apply(dat.r, 1, function(x) { x <- na.omit(x); attributes(x) <- NULL; x } )
    ## check that everything is  
    identical(dat.r, dat.c) && identical(N.r, N.c) &&
    identical(dat.list, dat.r.list) && identical(N.list, N.r)
}

## timing function, test each method
test.time.append <- function(nrep, particles=1e2, max.cols=1e3, append.fun, do.update=T, seed=2) {
    ## object to modify
    N.test <- rep(0, particles)
    dat.test <- matrix(numeric(), nrow=particles, ncol=max.cols)
    ## speed is affected by size, 
    ## so ensure that each run the same elements
    set.seed(seed)
    for (irep in 1:nrep) {
        ## generate draws
        N.new <- gen.lengths(particles)
        dat.new <- gen.dat(N.new)
        ## bind in using given method
        tmp <- append.fun(dat.new, N.new, dat.test, N.test)
        if(do.update) {
            ## skip update for C
            dat.test <- tmp$dat
            N.test <- tmp$len
        }
    }
}


Finally, we run it all:
library(rbenchmark)
## Obviously, Rcpp requires a C++ compiler
library(Rcpp)
## compilation, linking, and loading of the C++ function into R is done behind the scenes
sourceCpp("helpers.cpp")

## run unit test, verify both functions return identical results
is.identical <- test.correct.append(1e1, max.cols=1e3)
print(is.identical)

## test timings of each solution
test.nreps <- 10
test.ncols <- 1e3
timings = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    list=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendL),
    replications=10
)

## Just compare the two faster methods with larger data structures.
test.nreps <- 1e2
test.ncols <- 1e4
timings.fast = benchmark(
    r=test.time.append(test.nreps, max.cols=test.ncols, append.fun=appendR),
    c=test.time.append(test.nreps, max.cols=test.ncols, do.update=F, append.fun=appendRcpp),
    replications=1e1
)


Benchmark results show that the list-append method is 500 times slower than the improved R method, and 1,000 times slower than the C++ method (timings). As we move to larger data structures (timings.fast), the advantage of modifying in-place with C++ rather than having to explicitly assign the results quickly add up.
> timings
  test replications elapsed relative user.self sys.self
2    c           10   0.057    1.000     0.056    0.000
3 list           10  52.792  926.175    52.674    0.036
1    r           10   0.128    2.246     0.123    0.003

> timings.fast
  test replications elapsed relative user.self sys.self
2    c           10   0.684    1.000     0.683    0.000
1    r           10  24.962   36.494    24.934    0.027

References


[1] How slow is R really?
[2] Speeding up R computations Pt II: compiling
[3] Efficient loops in R - the complexity versus speed trade-off
[4] Faster (recursive) function calls: Another quick Rcpp case study
[5] Rcpp attributes: A simple example 'making pi'
[6] StackOverflow: Speed up the loop operation in R
[7] The R Inferno
[8] Sparse matrix formats: pros and cons
[9] Advanced R: OO field guide: Reference Classes
[10] Advanced R: Functions: Return Values

30 May 2014

Tools for Online Teaching

Last semester (Fall 2014), I organized and taught an interdisciplinary, collaborative class titled Probability for Scientists. Getting 4 separate teachers on the same page was a challenge, but as scientists we're used to communicating over email, and CC'ing everyone worked well enough. Throw in a shared dropbox folder, a shared google calendar, and a weekly planning meeting, and instructor collaboration went pretty smoothly.

It was more challenging to organize the class so that we could easily provide students with up-to-date course information and supplemental material. We ended up using blogger, which has some key benefits and disadvantages. It was *really* easy to set up and add permissions for multiple individuals. This allowed any of us to put up images (for example, photos of the whiteboard at the end of class) and post instructions. One downside we heard from students was an apparent lack of organization. I attempted to organize the blog with intelligent post labels, along with the "Labels widget" (which shows all possible labels and the number of posts per label on the right hand sidebar). I also included the "Posts by Date" sidebar, so all the content could be accessed chronologically. I understand their comments, though I'm not convinced that a single, monolithic page of information is the right direction either.

One feature of blogger that proved helpful was its "Pages", static html files or links that appear (by default) at the top of the blog. These links are always present no matter where in the blog you navigate to. We added the syllabus and course calendar here. And here's where things start to get interesting.

All four instructors need the ability to collaboratively edit and then publish public-facing material with students (like the syllabus), as edit private course material, like grades. Dropbox makes a natural choice for private material, whereas a collaborative source repository like GitHub makes a natural choice for public material. Personally, I'm more familiar/comfortable with Google Code, but I don't think the choice of services here is material. [Sidenote: If you have a pro GitHub account, the choice seems easy: use a private GitHub repo.]

Using a public git repo gave us a few things that I really liked. First off, our class schedule was a simple HTML table in the repo that we pointed at with a static link via the Pages widget, above. When I needed to update an assignment due date, I did a quick edit-and-commit on my laptop, pushed it to the archive, and magically appeared on the blog. If several instructors are simultaneously working on course material, this essentially provides an audit trail of who did what. This is kindergarden-level stuff for software developmers. But these are tools that teachers could benefit from and aren't very familiar with. For example, getting colleagues to set up git represents a non-trivial challenge. Nonetheless, there are other benefits of learning git if you're a scientist (that I won't address here, though see here for some thoughts).

In our class, I also used R for data visualizations. I let the awesome knitr package build the appropriate .png files from my R code (along with pdfs for download, if needed). Again, it sounds simple, but adding the generated figures to the git archive allowed me to quickly link to them in blog posts, and then update them later if needed.

I would have preferred having the blog posts themselves under revision control (only "Pages" can point to an external source html without javascript, which I didn't have time for). But, for the simplicity of setup and the low (e.g. free) cost of use, I didn't find posts to be much of an issue. Blogger allows composition in pure html, without all the *junk*, which helps. But having to re-upload and link a figure every time I find an error? Definitely a pain.

For this class, we also set up an e4ward.com address that pointed to my email box. This allowed me to publicly post the class address without fear of being spammed forevermore, and allowed me to easily identify all class email. Having a single instructor responsible for email correspondence worked well enough. As an early assignment, students were asked to send a question to the class email address. This is a nice way to get to know folks, and incentivize them to go to "digital office hours", e.g. ask good questions over email. We did have some issues with e4ward.com towards the end of the class. This *sucks* - students panic when emails get lost, and it's hard to sort out where the problem is. Honestly, I don't know the answer here.

As a sidenote, I pushed the use of Wikipedia (WP) heavily in this class, and referred to it often myself. This is not possible in all fields, but WP articles in many of the hard sciences are now both technically detailed and accessible. Probability and statistics articles are some of the best examples here, since the "introductory concept" articles are used by a large number of individuals/fields, and aren't the subjects of debate (compare this with the WP pages of, for example, Tibet or Obama). I also discovered WP's "Outlines" pages for the first time. If you ever use statistics, I strongly recommend spending some time with WP's Outline of Statistics. It's epic.

One final tool we used quite a bit was LimeSurvey. As I was planning the class, I went looking for an inexpensive yet flexible survey platform. I was generally disappointed by the offerings. My requirements included raw data export and more than 10 participants; these features tend *not* to be free, and survey tools can get pricey. Enter LimeSurvey (henceforth LS). It's open source, well-documented, versatile, and simple to use. I was reluctant to invest *too* much effort in tools I wasn't sure we'd use, but I got LS running in less than 2 hours. To be fair, our lab has an on-campus, always-on server, and apache2 is already installed and configured. This would have been an annoying out-of-pocket expense had I needed to rent a VPS, though you can now get a lot of VPS for $5/mo. [sidenote: getting our campus network to play nice with custom DNS was a whole other issue...]

LS allowed me to easily construct data entry surveys, allowing each student to enter, for example, their 25 flips of 3 coins, or their sequence of wins and losses playing the Monty Hall problem. Students can quickly enter this data from their browser at their convenience. At its best, visualizations of the class data can give students a sense of ownership and purpose of in-class exercises. LS also allowed us to conduct initial background surveys, as well as anonymous, customized class "satisfaction" surveys mid-course to find out what was working and what wasn't. We ran into a few administrative issues with LS, but it's an overall powerful and stable data collection platform. Students seemed happy with it, and it provided us with valuable feedback.

What would I do differently? I failed to set up an email list early on. It would have been trivial using, e.g. Google Groups. At the time, it seemed redundant, both for the students and us. Didn't they already have the blog? In retrospect, it would have proved useful at several points to communicate "New Blog Post" or "Due Date Changed, check calendar". I've learned this semester to expect that spoken instructions are not necessarily heeded, and that receiving administrative details in writing from multiple sources (blog, calendar, mailing list) is a Good Idea ™.

Along this vein, a minor needed improvement is breaking making a separate table for assignments. I originally combined assignments with the course calendar in the interest of simplicity, giving students all the relevant information in one place. In the end, it was just confusing.

Overall, the class went very well. We have received positive feedback from the students, and we now have a detailed digital record of the course. PDFs of their final project posters are now in the archive, where they will live in perpetuity. Personally, I'm not ready to teach a MOOC yet, but I'm sold on digital tools as useful supplements to in-class material. They allowed me to spend less time doing more. These tools helped the class "organize itself", and reduced communication overhead between instructors. To older or less technically-inclined teachers, some of this might seem difficult or confusing. On one hand, you really only need *one* instructor to coordinate the tech - it's not very hard for instructors to use the tools once they're set up. On the other hand, some of the above pieces are very easy to implement, and support from a department administrator or technically-inclined teaching assistant (TA) might be available for more challenging pieces. An installation of LS shared across a department, for example, should be trivial for any department that has its own linux server (e.g. math, physics, chem).

In my mind, a key goal here is that technology not get in the way. Many of the commercial "online learning products" that I've seen adopted by universities make simple task complicated, while lacking the flexibility. In trying to do everything for everyone, they often end up doing nothing for anyone (or take a high level of skill or experience to use effectively). I far preferred using several discrete tools, each of which does a single job well (class email, blog to communicate, git repo to hold files).

Are there any interesting tools worth trying out next time? I wonder how a class twitter-hashtag would work...

26 November 2013

The Art of the Album

When I'm working, I'll easily listen to 6 or more hours of music a day. Ten years ago, I listened to a *lot* of public radio, which broadened my ear a lot and introducing me to genres ranging from classical jazz to Native American and New Mexican music. Internet radio gave me more choice of stations (I'm now a happy KEXP micro-donor). Finally, there was Rdio (or Spotify, take your pick). Residents of the U.S. got to hear about the wonders of these all-you-can-eat music-as-a-utility services years before they became available here. Complex licensing agreements had to be signed with our musical overlords. But the future has finally arrived, and for $10/month I now have an internet full music (including offline access on phone, $5/mo for just wired computer).

The magic of a high-quality, easily searched and streamed music archive has transformed the way I listen to music. When I hear a song I like, it now takes me less than 30 seconds to find the album and begin playing on my office computer or phone.
There are a few drawbacks - not every album or artist is available (Joanna Newsom is a particularly galling example), and occasionally I find myself without a reliable cellphone or WiFi signal. But these are minor issues. Overall, the ultimate convenience, the *modern-ness* of it all still blows my mind. To me, this is better than a flying car (of course, I don't even own a normal car). And this convenience has, in the last few years, rekindled my love of the art of the album.

It seems to me that independent music in general has benefited from digital distribution by allowing artists to more easily break from the more conventional constraints of genre. I see a lot of experimentation here, running all the way up to the Dirty Projectors' avant-garde classical composition style. Growing up in the 90's, I enjoyed Pearl Jam and Nirvana well enough, but much of the "alternative" music that I listened to at the time sounded (and still sounds) rather similar to my ears, e.g. Grunge. The ones that sounded different really stood out, and I still cherish them for it (I'm looking at you, Pixies). Maybe I'm biased now by access to more music and better DJs, but I find the modern American music scene incredibly vibrant and diverse. Every month, I can look forward to new releases from favorite artists, as well as finding something or someone new to open my eyes and make my day.

What follows is an unordered list of albums that I've recently developed a strong relationship with. These albums cover a wide range of the acoustic/electronic spectrum. I enjoy repetitive, energetic music when I'm working or juggling or cleaning; I love the emotion and classic song-writing of "folk" and "country"; and I love the driving anthems of modern indie. Consequently, I like to think there's "something for everyone" here. And each of these is an *album*, a free-standing work of art worthy of repeated enjoyment in its uncensored, unedited entirety.

Obvious:

Macklemore & Ryan Lewis - The Heist (2012)

I'd like to find more music like this: the crossroads between pop and hip hop, independent music that gets radio play, catchy but meaningful. I can think of half a dozen song lyric lines that make great life slogans. Tis is a great album to blast in the car on a warm spring day.

alt-J - An Awesome Wave (2012)

I think of alt-J as the Neutral Milk Hotel of this decade: where the hell did they come from? It's such a beautiful, subtle album that came out of nowhere and bears repetition very well. I beg the gods for more in the future.

Daft Punk - Random Access Memories (2012)

I never really got into Daft Punk before this album, and I didn't even like it that much the first few plays. The songs on this album tend towards longish, some of them are slow, and I found myself getting bored. Then I began getting lines stuck in my head, and began dipping back in. In the end, I find this an immensely satisfying sort of pop-EDM-concept-album: a soothing mix of repetitive riffs that aren't too fast or insistent with a backdrop of pop anthem melodies. It strikes me as easy-listening Moby? This album is a little slow form me to "sit down" and listen to, but I find it excellent clean-the-house/driving music.

Phosphorescent - Muchacho (2013)

Rainy day + hot coffee. Sunset and a beer. Just got dumped, fired, graduated, engaged? This is such an extraordinarily luscious, eloquent album. It makes me remember that I have emotions. Lots of them.

Santigold - Master of My Make-Believe (2012)

I always perk up when I hear Santigold singles on the radio, but I was slow to listen to the albums. I like her self-title 2008 album, but it never really got under my skin. The second or third listen of Master, though, and I wanted to know more about this artist. After digging around a bit, I feel like I have a better idea of where she's coming from, and where she's going. The comparison to M.I.A. is inevitable, while the album art for Master suggests something more like Outkast. Master has tons of energy and is packed with pop-friendly riffs. But it's complex, and strikes me as walking the "don't define me" tightrope (or slackline, if you will; you can push *back* on a slackline). I enjoy that it doesn't settle down into a niche and stay there.

Less Obvious:

Shovels and Rope - O' Be Joyful (2012)

In my mental map of Americana, I file this near Wilco and Drive By Truckers. Sometimes slow and sweet, sometimes fast and rambunctious, but always melodic, this album is full of luscious 2-part harmonies with a low-fi, intimate feel. I'm always sad when it ends; I always want more.

First Aid Kit - The Lion's Roar (2012)

Can I call this indie-Americana? Less of the overt Southern influence of Shovels and Rope, but still full of tight vocal harmonies of country/folk. Apparently they're sisters, and apparently they're young, but this album has a big sound, full of driving melancholy. Playing two or three of their albums back-to-back is particularly satisfying. They seem to be growing as they go, and I'm excited to hear what comes next.

John Grant - Queen of Denmark (2010)

A very good anthem album. I don't often listen or pay attention to lyrics, but Grant has a John Prine-ish storytelling quality, a dark sense of humor and playful irony. Musically, it's tends towards simple, with a fast, light quality that reminds me of Paul Simon's Graceland. Thematically, though, it's a dark album. A far-off hint of redemption shines at the end of the tunnel, but just barely. Whistling in the dark.


Sharon Van Etten - Tramp (2013)

A powerful voice, and a powerful song-writer. This album is mature and intimate, and Van Etten's voice is strong and clear. Tight harmonies and vocal stylings that are luxurious without being excessive. The utterly enrapturing quality of controlled liquid of her voice reminds me a little of the Cowboy Junkies' Margo Timmins, with a bit of Joni Mitchell. In short, she's good.

Matthew Dear - Beams (2012)

My first introduction to Matthew Dear, this album is driving. Repetitive, almost grinding, the samples remind me of smoothed-out, slowed down industrial, or gears-and-grease voodoo. It reminds me of being in the belly of a very large machine. The tone palette is less pure than, say, Daft Punk, with lots of glitches and grinding noises. It's also harmonic, full of discordant melodies. And I *love* it. There are songs that I would love to hear on a dark dance floor in a small, crowded night-club. It's sexy as hell, with a floating touch of loss and nostalgia.

Jagwar Ma - Howlin (2013)

This is a somewhat confusing album. A mix of upbeat chorus-driven pop tunes and beat-and-sample driven pop-EDM, I find it a little schizophrenic at times. In the space of two songs, it goes from an drivingly upbeat guitar-and-vocals sound akin to Django Django's recent album, to something more akin to Caribou's hypnotic samples, with little in the way of transition. The situation reminds me a little of Hot Chip's recent album In Our Heads (which I still find deeply confusing). But Howlin is infectious throughout, with several singles that belong in the "party mix".


Dirty Projectors - Swing Lo Megellan (2012)

Beautiful melodies with a glorious sheen of tightly-controlled noise and discord, this approaches classical composition in broad-scale interest and ability to scare off pedestrians at a first listen. There's just enough rhythm and melody, though, to reel a music-lover in until one gains some familiarity with the subtleties. Then the album really starts to open up. To my ear, it's the opposite of a show-stopping dramatic pop album. It's playful and light, and strange, and curious and coy, going from simple to huge and back. It's complex and, sometimes subtly, very satisfying. This is real sit-down-and-listen music, kind of like going to see the symphony.

Junip - Self Title (2013)

It's not unlikely that you've already heard "Your Life Your Call". I'm sure it's in some movie or another, or will be soon. I get shivers every time I hear this song - like the soundtracks of the Breakfast Club and Trainspotting had a mutant child. Jose Gonzales has a number of solo albums (I'm quite fond of his 2005 Veneer - see below), though I never made the connection with Junip myself. His voice is as clear and emotional as ever, but the sound is bigger and more nuanced, a wonderful blend of semi-acoustic and smooth electronic sounds. This is an emotional album - not any *particular* emotion, but all of them, simultaneously, and a lot. Much like Muchacho, listening to it makes me feel decidedly and acutely human.


Yppah - Eighty One (2012)

Driving indie dream-pop, Yppah's sound is reminiscent of Heliosequence with drum machines. Something to get the shoe-gazers moving around!

Less new

but recently discovered or especially noteworthy, albums follow.
I'm ready to wrap this post up, so these get just a brief mention, but they're all worth a good listen.

Caribou - Swim (2010)

Smooth, fast, steady electronic. A masterful album.

Jose Gonzales - Veneer (2005)

Contains a cover of The Knife's song "Heartbeats" that I adore. Close and intimate and lush.

Crystal Castles - Self Title (2008)

One of my current favorite albums. I think of it as glitch-rock. It's more syth-y than punk, but has a lot of similar aesthetic sensibilities: loud, abrasive, driving, and inspiring. I particularly like to cue up all 3 Crystal Castles albums and listen to them in a go. Loudly.

Gold Panda - Lucky Shiner (2010)

Very smooth, incredibly-produced electronic music. Deeply satisfying, good work music.

Franz Ferdinand - Self Title (2004)

Anthemic indie-pop. I'm familiar with most of these songs, and was amazed that that they all came from a single album. Buddy Holly meets Lou Reed?

Jolie Holland - Escondida (2004)

Lead singer of The Be Good Tanyas, Jolie Holland's solo album is intimate and enwrapping.

Juana Molina - Tres Cosas (2004)

Quiet and playful yet insistent percussion is the constant backdrop against which Molina's voice plays. And is it ever playful. Her unassuming Spanish is hypnotizing. She has a new release out that I haven't digested yet, but here's another case where I happily queue up 2 or 3 albums in a row and let them blend effortlessly from one to the next.

23 August 2013

Faster, better, cheaper: what is the true value of a computer?

One thing I've had a lot of time to think about over the last 15 years is what exactly does faster & more powerful mean? After a decade of clockspeed wars, we've moved on to more cores, more RAM, longer battery life, less weight, backlit keyboards, etc. A new computer still costs about the same as it did 15 years ago... is it any better?

The more time I spend with the machines, the more I think about usability. A machine is only as powerful as the tasks it can accomplish. I have a $200 netbook (w/linux, of course) that excels at being light. It works great for travel but not netflix. I could not write a paper on it, but I can check email, upload photos/files, etc.

In our computer cluster here, we upgrade as we hit limits. Ran out of RAM? Buy more... it's cheap (for a desktop, at least). Chronic, awful wrist pain? Get an ergonomic keyboard. I find a 2-screen setup a very cost-effective productivity boost, whereas the idea of paying 20% more for a 10% bump in speed strikes me as silly.

The whole state of affairs reminds me a little of mean and variance. We always hear about the expected values of things, the mean, but rarely is variance ever reported, and variance is often the most important part. Things like weather, lifespan, salary, time-to-completion? The variance may be more important than the mean. Speed/power tells something about the "maximum potential" of a machine, but not how much use one might get out of it.

I see two different directions --

First, unexpected developments. Examples include multiple cores and SSD. No one really expected these to change the aesthetics of computers. Nonetheless, the reduced latency of SSD is pleasantly surprising, as is the increased responsiveness under load of a multi-core machine. I don't hit enter and wait. My machine flows more at the speed of thought than it used to, even if I have a browser with 50+ tabs open, music playing, etc.

Second, human interface. My new phone is just a little too tall, which makes it just a little more difficult to use, since I can't reach across the screen with one hand. Again, I never expected this to matter, but it's a human-machine interface question rather than a pure machine capabilities question. Backlit keyboards, ergonomic keyboards, featherweight laptops, and long battery lives are all about the human-machine interface. Which is more important than speed, since the human is the whole point....

The aesthetics of interface is how Apple came to rule the world. Their hardware is beautiful, intuitively responsive to human touch. Hell, even their stores are clean & informative and intuitive and full of fun toys. Personally, I can't stand their walled-garden approach to hardware, and I have the time and energy to coax my machines into greatness through a commandline interface (which remains one of the most powerful human-machine interface ever developed), but I *like* Apple hardware. They've dragged the PC industry kicking and screaming into the 21st century of "Humans matter more than machines".

Which they rather presciently highlighted in their 1984 Superbowl commercial:
http://www.youtube.com/watch?v=2zfqw8nhUwA

Finally, a mind-blowing historical view on the subject, including a 1983 Bill Gates pimping Apple hardware, and Steve Jobs describing how machines should help people rather than the other way around:
http://www.youtube.com/watch?v=xItV5U-V2W4

Everyday revision control

This post has been a long time coming. Over the past year or so, I've gradually become familiar, even comfortable, with git. I've mainly used it for my own work, rather than as a collaborative tool. Most of the folks that I work with don't need to share code on a day-to-day basis, and there's a learning curve that few of my current colleagues seem interested in climbing at this point. This hasn't stopped me from *talking* to my colleagues about git as an important tool in reproducible research (henceforth referred to as RR).

I find the process of committing files and writing commit messages at the end of the day forces me to tidy up. It also allows me to more easily put a project on hold for weeks or months and to then return to it with a clear understanding of what I'd been working on, and what work remained. In short, I use my git commit messages very much like a lab notebook (a countervailing view on git and reproducible research is here, an interesting discussion of GNU make and RR here, and a nice post on RR from knitr author Yihui Xie here ).


Sidenote: I've hosted several projects at https://code.google.com/, and used their git archives, particularly for classes (I prefer the interface to github, though the two platforms are similar). I've also increasingly used Dropbox for collaborations, and I've struggled to integrate Dropbox and Git. Placing the same file under the control of two very different synchronization tools strikes me as a Bad Idea (TM), and Dropbox's handling of symlinks isn't very sophisticated. On the other hand, maintaining 2 different file-trees for one project is frustrating. I haven't found a good solution for this yet...

As far as tools go, most of the time I simply edit files with vim and commit from the commandline. In this sense, git has barely changed my work flow, other than demanding a bit of much-needed attention to organization on my part. Lately, I've started using GUI tools to easily visualize repositories, e.g. to simultaneously see a commit's date, message, files, and diff. Both gitk and giggle have similar functionality -- giggle is prettier, gitk is cleaner. Another interesting development is that Rstudio now includes git tools (as well as native latex and knitr support in the native Rstudio editor). This means that a default Rstudio install has all the tools necessary for a collaborator to quickly and easily check out an repository and start playing with it.

09 August 2013

Adventures with Android

After months of dealing with an increasingly sluggish and downright buggy Verizon HTC Rhyme, I finally took the leap and got a used Galaxy Nexus. First off, I think it's beautiful. The rhyme isn't exactly a high-end phone, so the small, unexciting screen isn't particularly surprising. By comparison, the circa 2011 Nexus is a work of art. My first impression of the AMOLED screen is great. Dark blacks and luscious color saturation (though I have found it to be annoyingly shiny -- screens shouldn't be mirrors!). It even has a barometer!

One big motivation for a phone upgrade (asides from the cracked screen and aforementioned lag) was being stuck at Android 2.3.4 (Gingerbread). As a technologist, I don't consider myself an early-adopted. I prefer to let others sort out the confusion of initial releases, and pick out the gems that emerge. But Gingerbread is well over 2 years old, and a lot has happened since then. The Galaxy Nexus (I have the Verizon CDMA model, codename Toro) is a skin-free, pure Android device. Which means that I am now in control of my phone's Android destiny!

How to go about this? I hit the web and cobbled together a cursory understanding of the Android/google-phone developer ecosystem as it currently stands. First off, there's xda-developers, a very active community of devs and users. There's an organizational page for information on the Galaxy Nexus here that helped me get oriented. This post made installing adb and fastboot a snap on ubuntu 12.04 (precise). There's also some udev magic from google under the "Configuring USB Access" section here that I followed, perhaps blindly (though http://source.android.com is a good primary reference...).

Next, I downloaded ClockworkMod for my device, rebooted my phone into the bootloader, and installed and booted into ClockworkMod:

## these commands are from computer connected to phone (via usb cable)
## check that phone is connected.  this should return "device"
adb get-state
## reboot the phone into the bootloader
adb reboot-bootloader

## in recovery, the phone should have funny pictures of the adroid bot opened up... reminds me of Bender.
## the actual file (the last argument) will vary by device
fastboot flash recovery recovery-clockwork-touch-6.0.3.5-toro.img
## boot into ClockworkMod 
fastboot boot recovery-clockwork-touch-6.0.3.5-toro.img

This brought me to the touchscreen interface of ClockworkMod. First, I did a factory reset/clear cache as per others' instructions. Then I flashed the files listed here (with the exception of root) via sideloading. There's an option in ClockworkMod that says something like "install from sideload". Selecting this gives instructions -- basically, use adb to send the files, and then ClockworkMod takes care of the rest:

## do this on computer after selecting "install from sideload" on phone
## the ROM, "pure_toro-v3-eng-4.3_r1.zip", varies by device
adb sideload pure_toro-v3-eng-4.3_r1.zip
## repeat for all files that need to be flashed

I rebooted into a shiny new install of Jelly Bean (4.3). It's so much cleaner and more pleasant than my old phone. I was also pleasantly surprised to see that Android Backup service auto-installed all my apps from the Rhyme.

In the process of researching this, I got a much better idea of what CyanogenMod does. I'm tempted to try it out now, but I reckon I'll wait for the 4.3 release, whenever that happens.


I also found http://www.htcdev.com/bootloader, which offers the prospect of unlocking and upgrading the HTC Rhyme, though I haven't found any ROMs that work for the CDMA version...

13 June 2013

Secure webserver on the cheap: free SSL certificates

Setting up an honest, fully-certified secure web server (e.g. https) on the cheap can be tricky, mainly due to certificates. Certificates are only issued to folks who can prove they are who they say they are. This verification generally takes time and energy, and therefore money. But the great folks at https://www.startssl.com/ have an automated system that verifies identity and auto-renders associated SSL certificates for free.

Validating an email is easy enough, but validating a domain is trickier -- it requires a receiving mailserver that startssl can mail a verification code to. Inbound port 25 (mail server) is blocked by my ISP, the University of New Mexico (and honestly, I'd rather not run an inbound mail server).

I manage my personal domain through http://freedns.afraid.org/. They provide full DNS management, as well as some great dynamic DNS tools. They're wonderful. But they don't provide any fine-grained email management, just MX records and the like.

The perfect companion of afraid.org is https://www.e4ward.com/. They have mail servers that will conditionally accept mail for specific addresses at personal domain, and forward that mail to an email account. This lets me route specific addresses @mydomain.com, things like postmaster@mydomain.com, to my personal gmail account. E4ward is a real class-act. They manually moderate/approve new accounts, so there's a bit of time lag. To add a domain, they also require proof of control via a TXT record (done through afraid.org).

This whole setup allowed me to prove that I owned my domain to startssl.com without running a mail server or paying for anything other than the domain. The result is my own SSL certificates. I'm running a pylons webapp with apache2 and mod_wsgi. In combination with python's repoze.what, I get secure user authentication over https without any snakeoil.

Hat-tip to this writeup, which introduced me to e4ward.com and their mail servers.

Finally, there are a number of online tools to query domains. dnsstuff.com was one of the better ones I found. It takes longer to load, but gives a detailed report of domain configuration, along with suggestions. A nice tool to verify that everything is working as expected.


11 June 2013

Learning new fileserver tricks: RAID + LVM

I've finally gotten comfortable with linux's software raid, aka mdadm. I've been hearing about LVM, and I finally took the plunge and figured out how to get the two to play together. Of course, a benefit of RAID is data security. The big benefit I see from LVM is getting to add/remove disk space without repartitioning. Once RAID is working, stacking LVM on top was easy enough, especially for my use case of a single-big-filesystem. I was able to move all my data onto one RAID array, built a new filesystem on top of a logical volume, move data to the new filesystem, and then add the final RAID array to the logical volume and resize the filesystem. Thus, I end up with 3 separate RAID arrays glommed together into a single, large filesystem.

## Tell LVM about RAID arrays 
sudo pvcreate /dev/md2
sudo pvcreate /dev/md3

## Create a volume group from empty RAID arrays
sudo vgcreate VolGroupArray /dev/md2 /dev/md3

## Create a logical volume named "archive", using all available space 
sudo lvcreate -l +100%FREE VolGroupArray -n archive
sudo lvdisplay 
## and create a filesystem on the new logical volume 
sudo mkfs.ext4 /dev/VolGroupArray/archive

## mount the new filesystem
## and move files from the mount-point of /dev/md1 to /dev/VolGroupArray/archive
## then unmount /dev/md1

## Add the last RAID array to the volume group
sudo pvcreate /dev/md1
sudo vgextend VolGroupArray /dev/md1

## Update the logical volume to use all available space 
sudo lvresize -l +100%FREE /dev/VolGroupArray/archive
## And resize the filesystem -- rather slow, maybe faster to unmount it first...
sudo resize2fs /dev/VolGroupArray/archive

## Finally, get blkid and update /etc/fstab with UUID and mount options (here, just noatime)
sudo blkid

I probably should have made backups before I did this, but everything went smoothly...
Also, I discovered this python tool to do conversions in-place. Again, this appears non-destructive, but back-ups never hurt. Also of interest for a file server is Smartmontools to monitor for hardware/disk failures: a nice review is here.

[REFS]
* http://home.gagme.com/greg/linux/raid-lvm.php
* https://wiki.archlinux.org/index.php/Software_RAID_and_LVM
* http://webworxshop.com/2009/10/10/online-filesystem-resizing-with-lvm

07 June 2013

Symmetric set differences in R

My .Rprofile contains a collection of convenience functions and function abbreviations. These are either functions I use dozens of times a day and prefer not to type in full:
## my abbreviation of head()
h <- function(x, n=10) head(x, n)
## and summary()
ss <- summary
Or problems that I'd rather figure out once, and only once:
## example:
## between( 1:10, 5.5, 6.5 )
between <- function(x, low, high, ineq=F) {
    ## like SQL between, return logical index
    if (ineq) {
        x >= low & x <= high
    } else {
        x > low & x < high
    }
}
One of these "problems" that's been rattling around in my head is the fact that setdiff(x, y) is asymmetric, and has no options to modify this. With some regularity, I want to know if two sets are equal, and if not, what are the differing elements. setequal(x, y) gives me a boolean answer to the first question. It would *seem* that setdiff(x, y) would identify those elements. However, I find the following result rather counter-intuitive:
> setdiff(1:5, 1:6) 
integer(0)
I personally dislike having to type both setdiff(x,y) and setdiff(y,x) to identify the differing elements, as well as remember which is the reference set (here, the second argument, which I find personally counterintuitive). With this in mind, here's a snappy little function that returns the symmetric set difference:
symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}
> symdiff(1:5, 1:6) == symdiff(1:6, 1:5)
[1] TRUE

Tada! A new function for my .Rprofile!

25 November 2012

Successive Differences of a Randomly-Generated Timeseries

I was wondering about the null distribution of successive differences of random sequences, and decided to do some numerical experiments. I quickly realized that successive differences equates to taking successively higher-order numerical derivatives, which functions as a high-pass filter. So, the null distribution really depends on the spectrum of the original timeseries.

Here, I've only played with random sequences, which are equivalent to white noise. Using the wonderful animation package, I created a movie that shows the timeseries resulting from differencing, along with their associated power spectra. You can see that, by the end, almost all of the power is concentrated in the highest frequencies. The code required to reproduce this video is shown below.

Note -- For optimum viewing, switch the player quality to high.
require(animation)
## large canvas, write output to this directory, 1 second per frame
ani.options(ani.width=1280, ani.height=720, loop=F, title='Successive differences of white noise', outdir='.', interval=1)
require(plyr)
require(xts)
## How many realizations to plot?
N=5
## random numbers
aa = sapply(1:26, function(x) rnorm(1e2));
colnames(aa) = LETTERS;

saveVideo( {
    ## for successive differences, do...
    for (ii in 1:50) {
        ## first make the differences and normalize
        aa1 = apply(aa, 2, function(x) {
            ret=diff(x, differences=ii);ret=ret/max(ret)
        }); 
        ## Turn into timeseries object for easy plotting
        aa1 = xts(aa1, as.Date(1:nrow(aa1)));

        ## next, compute spectrum
        aa2 = alply(aa1, 2, function(x) {
            ## of each column, don't modify original data 
            ret=spectrum(x, taper=0, fast=F, detrend=F, plot=F);
            ## turn into timeseries object
            ret= zoo(ret$spec, ret$freq)});
        ## combine into timeseries matrix
        aa2 = do.call(cbind, aa2 );
        colnames(aa2) = LETTERS;

        ## plot of timeseries differences
        ## manually set limits so plot area is exactly the same between successive figures
        myplot = xyplot(aa1[,1:N], layout=c(N,1),
                        xlab=sprintf('Difference order = %d', ii), 
                        ylab='Normalized Difference',
                        ylim=c(-1.5, 1.5), 
                        scales=list(alternating=F, x=list(rot=90), y=list(relation='same'))); 

        ## plot of spectrum
        myplot1 = xyplot(aa2[,1:N], layout=c(N,1),
                        ylim=c(-0.01, 5), xlim=c(0.1, 0.51),
                        xlab='Frequency', ylab='Spectral Density',
                        type=c('h','l'), 
                        scales=list(y=list(relation='same'), alternating=F));
        ## write them to canvas
        plot(myplot, split=c(1,1,1,2), more=T);
        plot(myplot1, split=c(1,2,1,2), more=F);
        ## provide feedback of process
        print(ii)}

## controls for ffmpeg
}, other.opts = "-b 5000k -bt 5000k")