Life in Code: Looking at the "Curse of Dimensionality" with R, foreach, and lattice

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.

It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.

In general, I'm a big fan of R's lattice package, though I'm never quite happy with the legend/key settings.

The code is below. One thing that I've learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ##
    require(foreach)
    require(doMC)
    registerDoMC
    registerDoMC(cores=cores)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep <= reps) {
                ## 2 points, one per row, randomly chosen
                tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
                ## calc dist
                mydist = as.numeric(dist(tmp, method=method))
                ## fill return dataframe
                ret[ii, ] = list(ss, .dim, .rep, mydist, method)
                ## update rep counter
                .rep = .rep +1
                ## update ret row counter
                ii = ii +1
            }
          }
        }
        return(ret)
    } ## rbind happens here
    return(finret)
}

For plotting, I've used:

xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)

Life in Code

Pages

Labels

20 March 2011

Looking at the "Curse of Dimensionality" with R, foreach, and lattice

No comments:

Post a Comment

Blog Archive

My Blog List

Followers

Search This Blog

About Me