Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's
Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.
It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the
maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.
In general, I'm a big fan of R's
lattice package, though I'm never quite happy with the legend/key settings.
The code is below. One thing that I've learned from this exercise is that I prefer not to have the
foreach call in the outermost loop. For one, error and/or progress reporting from inside a
foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.
dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
## examine distance between 2 points in dims dimensions for different metrics
## chosen at random according to rfun (rnorm by default)
##
require(foreach)
require(doMC)
registerDoMC
registerDoMC(cores=cores)
## use multicore/foreach on the outside (each sd)
finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
## pre-allocate, don't append/rbind!
ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
ii = 1
for (.dim in dims) {
for (method in .methods) {
## reset .rep counter
.rep = 1
while (.rep <= reps) {
## 2 points, one per row, randomly chosen
tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
## calc dist
mydist = as.numeric(dist(tmp, method=method))
## fill return dataframe
ret[ii, ] = list(ss, .dim, .rep, mydist, method)
## update rep counter
.rep = .rep +1
## update ret row counter
ii = ii +1
}
}
}
return(ret)
} ## rbind happens here
return(finret)
}
For plotting, I've used:
xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
scales=list(y=list(relation='free'), alternating=1),
auto.key=list(space='bottom', columns=5, points=F, lines=T),
xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)
xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
scales=list(y=list(relation='free'), alternating=1),
auto.key=list(space='bottom', columns=3, points=F, lines=T),
xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)