R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

29 October 2011

SQL Koan

It's not that profound, but I sure do like it: a simple, elegant example of a self-join that gives a truth table for the NOT DISTINCT FROM operator.

WITH x(v) AS (VALUES (1),(2),(NULL))
  SELECT l.v, r.v, l.v = r.v AS equality,
    l.v IS NOT DISTINCT FROM r.v AS isnotdistinctfrom
  FROM x l, x r;

Thanks to Sam Mason via the PostgreSQL - general mailing list.

28 October 2011

Why can't ads be more fun?

The worst part about advertisements these days, in my not-so-humble opinion, is the degree to which they belittle the user. Well-endowed lady wants to be my Facebook friend? "Click here to confirm request." Riiight. Lonely singles near me? Yeah, sure. OH GOOD LORD, A THREAT HAS BEEN IDENTIFIED! Oh, wait, y'all really are evil! On the infrequent occasions that I do watch TV, the ads are less flagrantly insidious, but they're nonetheless relentlessly patronizing.

The second-worst part of ads, IMNSHO, is the brute-force repetition. The same inane 30 seconds once? I can do that, but five times in an hour? You've got to be kidding me. Nope, no kidding here. Just 30 seconds of the same inanity, again and again.

Consequently, I find unique and intelligent ads rather inspiring. I know, it shouldn't be this way. One might thing that unique and inspiring would be more common amongst something so prevalent. Anyway...

Some of the GEICO ads that have played on Hulu lately have managed to avoid the "worst part". They're short and playful, vaguely reminiscent of [adult swim] commercials. "Is the sword mightier than the pen" is my current favorite; I still sometimes chuckle when I see it. They still fail occasionally in the repetition department. If you're a large international corporation, how hard is it to make more than a handful of moderately-entertaining 30-second spots?

Which brings me back to banner ads. I just saw this at the top of ars technica today for the first time, and it really caught my eye. I don't read or write Javascript, but I found myself puzzling through it and then following the link.

At the top of the page, I find this banner, which visually reminds me where I came from, that I've come to the right place, and gives me another little puzzle. Neat. Thank you, Heinz College of Carnegie Mellon.

None of this is revolutionary, but I do hope it is evolutionary. Ads don't have to be boring, and I'm guessing that interesting and intelligent ads are more effective even as they're less painful and insulting. My guess is that a generational change-over in marketing departments and their managers is underway, and will be slow. Still, I look forwards to a new generation of more modern advertising approaches that gives me something to think while my eyeballs are held hostage.

06 October 2011

FFT / Power Spectrum Box-and-Whisker Plot with Gggplot2

I have a bunch of time series whose power spectra (FFT via R's spectrum() function) I've been trying to visualize in an intuitive, aesthetically appealing way. At first, I just used lattice's bwplot, but the spacing of the X-axis here really matters. The spectra's frequencies aren't regularly-spaced categories, which is the default of bwplot. If all the series are of the same length, then the coefficients of the are all estimated at the same frequencies, and one can use plyr's split-apply-combine magic to compute the distribution of coefficients at each frequency.

I spent some time trying to faithfully reproduce the plotting layout of the figures produced by, e.g. spectrum(rnorm(1e3)) with respect to the axes. This was way more annoying than i expected...

To include a generic example, I started looking around for easily generated, interesting time series. The logistic map is so damned easy to compute and generally pretty interesting, capable of producing a range of frequencies. Still, I was rather surprised by the final output. Playing with these parameters produces a range of serious weirdness -- apparent ghosting, interesting asymmetries, etc..

Note that I've just used the logistic map for an interesting, self-contained example. You can use the exact same code to generate your own spectrum() box-and-whisker plot by substituting your matrix of time series for mytimeseries below. Do note here that series are by row, which is not exactly standard. For series by column, just change the margin from 1 to 2 in the call to adply below.

## box-and-whisker plot of FFT power spectrum by frequency/period
## vary these parameters -- r as per logistic map
## see for details
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) {
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])

## modify these for interesting behavior differences
rfrom = 3.4
rto = 3.7


mytimeseries = aaply(seq(from=rfrom, to=rto, length.out=rsteps), 1,
function(x) {
       logmap(seqlen, x)

## you can plug in any array here for mytimeseries
## each row is a timeseries
## for series by column, change the margin from 1 to 2, below
logspec = adply( mytimeseries, 1, function(x) {
       ## change spec.pgram parameters as per goals
       ret = spectrum(x, taper=0, demean=T, pad=0, fast=T, plot=F)
       return( data.frame(freq=ret$freq, spec=ret$spec))

## boxplot stats at each unique frequency = ddply(logspec, 'freq', function(x) {
           ret = boxplot.stats(log10(x$spec));
           ## sometimes $out is empty, bind NA to keep dimensions correct
           return(data.frame(t(10^ret$stats), out=c(NA,10^ret$out)))

## plot -- outliers are dots
## be careful with spacing of hard returns -- ggplot has odd rules
## as close to default "spectrum" plot as possible
ggplot(, aes(x=1/(freq)))  +  geom_ribbon(aes(ymin=X1,
ymax=X5), alpha=0.35, fill='green')  +
       geom_ribbon(aes(ymin=X2, ymax=X4), alpha=0.5, fill='blue') +
geom_line(aes(y=X3))  +
       geom_point(aes(y=out), color='darkred') +
       scale_x_continuous( trans=c('inverse'), name='Period')  +
       scale_y_continuous(name='Power', trans='log10')

Following my nose further on the logistic map example, I also used the animate package to make a movie that walks through the power spectrum of the map for a range of r values. Maybe one day I'll set this to music and post it to Youtube! Though I think some strategic speeding up and slowing down in important parameter regions is warranted for a truly epic tour of the logistic map's power spectrum.

       ani.options(interval = 1/5, ani.type='png', ani.width=1600, ani.height=1200)
        for (ii in c(seq(from=3.2, to=3.5, length.out=200), seq(from=3.5, to=4, length.out=1200))) {
            spectrum(logmap(1e4, ii), taper=0, demean=T, pad=0, fast=T, sub=round(ii, 3))
}, = "logmap2.mp4", other.opts = "-b 600k")

Bat Country

I've spent a lot of time thinking about and using R's spectrum() function and the Fast Fourier Transform (FFT) in the last 5+ years. Lately, they've begun to remind me a little of a Theremin: simple to use, difficult to master.

While prepping a figure for the R Graph Gallery (which is awesome, by the way), I ran across this curious example -- its striking visual appearance was definitely not what I was expecting.

I decided to use the logistic map to generate an ensemble of time series with a range of frequencies. The logistic map goes through characteristic period doubling, and then exhibits broad-spectrum "noise" at the onset of chaos. And it's really easy to compute. So, I can tune my time series to have a range of frequency distributions.

In this example, I'm in the periodic regime (r=3.5, period 4) . So why am I getting this Batman motif?

Actually, it's weirdly simple. This is an artifact of the (default) tapering that R's spectrum() function applies to the series before the FFT is computed. In theory, the Fourier Transform assumes an infinite-length sequence, while in practice the FFT assumes the series is circular -- in essence, gluing the beginning and end of the series together to make a loop. If the series is not perfectly periodic, though, this gluing introduces a sharp discontinuity. Tapering is typically applied to reduce or control this discontinuity, so that both ends gently decline to zero. When the series is genuinely periodic, though, this does some weird effects. In essence, the taper transfers power from the fundamental frequencies in a curious way. Note that the fundamental period is 4, as seen by the peak at frequency 1/4, with a strong harmonic at period 2, frequency 1/2.

Zero-padding has a similar effect. If your series is zero (or relatively small) at both ends, then you could glue it together without introducing discontinuities, but peaks near the beginning and the end will still be seen as near by, resulting in spurious peaks. In any case, R pads your series by default (fast=T) unless the length of the series is highly composite (i.e. it can be factored into many small divisors). Below, it turns out that a series of length 1e5 is composite enough that R doesn't pad it. 1e5? Totally different results.

If I understand this correctly, both padding and tapering are attempts to reduce or control spectral leakage. We've constructed an example where no spectral leakage is expected from the original time series, which closely fits the assumptions of the FFT without modification.

The moral? Know thy functions (and their default parameter values, and their effects)!

## using R -- 
## logistic map function:
logmap = function(n, r, x0=0.5) { 
    ret = rep(x0,n)
    for (ii in 2:n) { 
        ret[ii] = r*ret[ii-1]*(1-ret[ii-1])

## compare:
spectrum(logmap(1e5, 3.5), main='R tapers to 0.1 by default')
spectrum(logmap(1e5, 3.5), taper=0, main='No taper')
spectrum(logmap(1e5+1, 3.5), taper=0, main='Non-composite length, fast=T pads by default')
spectrum(logmap(1e5, 3.5), taper=0.5, sub='Lots of padding approximately recovers "correct" results')