Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

11 June 2011

The importance of being unoriginal (and befriending google)

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I've never had any reason to go under the hood until now. Lately, I've been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call("bincounts"), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    };
    return(cbind(bin=bb, dens=aa/length(x)))
}


## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don't understand. On the other hand, trunc(x) doesn't take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid's red wagon with a landrover...

Dear Google...

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn't return the bin boundaries by default, but it's good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there's something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn't quite as simple as one first thought. On the other hand, sometimes we just want answers. When that's the case, asking google is a pretty good bet.

## their method
require(ash)
get2 = bin1(test, c(0,1), 1e3+1)$nc

3 comments:

  1. You should really be more productive on your caffeine buzzes.

    ReplyDelete
  2. Cause these long-winded musings are a waste of your processing power. Imagine how many whales you could have saved or whatever you do.

    ReplyDelete
  3. Is table(cut(x, bb)) doing what you want? Format is different but result are counts of x in bins.

    ReplyDelete