Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

29 March 2011

Why PKI matters.

This article landed in my inbox this week in a newsletter from the EFF. I usually don't read them, but the term "meltdown" caught my eye, what with all the nuke new this month. They also managed to work in "too big to fail", and neither reference was hyperbolic. The internet depends on a level of trust, and (surprise) there are people working to co-opt that trust.

A number of my friends think I'm a little weird for using real passwords, for not sharing them, etc. But I read Cryptonomicon and friends; I have at least one box exposed to the real, honest-to-god, jungle-out-there internet; I have a healthy fear of all the things that could go wrong. As part of my lab group's data entry project, I recently registered my first SSL certificate from http://www.startssl.com/ (free!), and learned quite a bit about PKI in the process.

Still, reading an article like this really drives home both the complexities and importance of the global PKI system. Trust is difficult when it's strung across the globe on fiber optics cables, and enforced by our inability to quickly factor very large numbers. But old-school techniques of impersonation and breaking and entering will always be with us. I may trust google.com, but how do I know that it's actually them? PKI.

How cool is that?
Very.

20 March 2011

Looking at the "Curse of Dimensionality" with R, foreach, and lattice

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.




It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.

In general, I'm a big fan of R's lattice package, though I'm never quite happy with the legend/key settings.

The code is below. One thing that I've learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ##
    require(foreach)
    require(doMC)
    registerDoMC
    registerDoMC(cores=cores)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep <= reps) {
                ## 2 points, one per row, randomly chosen
                tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
                ## calc dist
                mydist = as.numeric(dist(tmp, method=method))
                ## fill return dataframe
                ret[ii, ] = list(ss, .dim, .rep, mydist, method)
                ## update rep counter
                .rep = .rep +1
                ## update ret row counter
                ii = ii +1
            }
          }
        }
        return(ret)
    } ## rbind happens here
    return(finret)
}

For plotting, I've used:
xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)