Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

24 February 2011

snow and ssh -- secure inter-machine parallelism with R

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:
"
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

... It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

... the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login privileges are required!
"

Requirements:

R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail...


Step-by-step:

The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don't have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)

Conclusions?

This is a funny example of %dopar%; it's minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don't want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave's stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what's happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it's packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn't been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you'd like to see a doSSH package.
+10 if you'd like to help!

8 comments:

  1. have you tried the doRedis package?
    it is very much what you want.

    ReplyDelete
  2. From the redis package manual:
    "Redis supports a trivially simple and insecure authentication method. This function implements it. [...] You should not use this function. If you need a secure key/value database, it’s best not to use Redis for now."

    So no, using redis for IPC isn't what I'm looking for :)

    Just to clarify, "what I want" is the *thinnest* possible stack to securely deploy %dopar% on multiple machines, with as few privileges as possible. I'm curious to hear folks' thoughts on MPI -- at first blush, it sounded like "just another piece of software to learn and install", but I'm curious now...

    ReplyDelete
  3. Really nice post.

    I have been trying to do this for quite some time now, but I always come back to emacs and tramp, because I am not able to get this working.

    When I follow your steps closely I can establish the connection and even register the backend, but when I try to use it with %dopar% I always get
    Error in serialize(data, node$con) : ignoring SIGPIPE signal

    I suspect this to be something connected with the firewall or the connection beeing between seperate sub-networks, as I have been using snow within one network successfully for a while.
    Stange thing is, that I use reverse tunnelling regularly for other purposes than R without problems.

    Any ideas by any chance?

    ReplyDelete
  4. Thanks!

    Dirk's post really helped me step through the trouble-shooting by deconstructing the process into orthogonal steps. The ssh connections seem to be the most fragile part, and in some senses easiest to test.

    I'm curious of your setup. Do these steps work within your network and only fail when you try to cross sub-networks?

    If you're interested in this, I'd appreciate your thoughts and testing. The easiest way to get my attention is twitter -- @prosopis :)

    ReplyDelete
  5. In "do ssh -f remotehost ...", shouldn't "remotehost" depend on i?
    I mean something like

    do ssh -f "remotehost$i" ...

    ReplyDelete
  6. In "do ssh -f remotehost ...", shouldn't "remotehost" depend on i?
    I mean something like

    do ssh -f "remotehost$i" ...

    ReplyDelete
  7. In the example I gave, I was spawning multiple slaves on one machine named "remotehost". You could wrap another bash do loop around this with multiple hostnames, but I prefer to handle that part in R -- then each machine name in the R list of remotehosts can have a distinct number of slaves associated with it, and an R function calls the bash loop once per host.

    ReplyDelete
    Replies
    1. Thank you so much for your post. I don't have any real experience with Unix systems but after some trying I've managed to repeat your code on AWS Ubuntu instances and RStudio server. That was cool :-)
      Could you please share an example of your R-code to manage multiple remotehosts? I feel myself more comfortable with R than with SSH terminal.
      Thanks a lot. Regards, Tim.

      Delete