R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

24 February 2011

snow and ssh -- secure inter-machine parallelism with R

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

... It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

... the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login privileges are required!


R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail...


The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don't have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)


This is a funny example of %dopar%; it's minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don't want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave's stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what's happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it's packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn't been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you'd like to see a doSSH package.
+10 if you'd like to help!

ssh port forwarding

Is it just me, or is the openssh port forwarding syntax *really* confusing?  Every time I have to work through examples afresh.  It's ridiculously powerful -- who can argue with key-based authentication on a user-land vpn, and it's available for all major OSes.  Still, I wish I *really* understood the syntax.  

I suppose practice makes perfect.  I'm working on a small project now.  With any luck, examples to come.

Openssh FAQ
Red Hat Magazine

03 February 2011

Postgreql Hacks -- firing a trigger "for each transaction"

The problem

I spent today learning about triggers in postgresql - there are triggers "for each statement" and "for each row". I went searching for triggers "for each transaction", and found a few other folks with the same question but no compelling answer.

My use case is this - I have 2 tables, a parent (entries) and child (reports), with a one-to-many relationship. I need to add multiple reports for one entry in a single transaction, and then update the associated entry if the transaction succeeds. Here are the table definitions:

create schema transtest;
set searchpath to transtest,public;
    -- it's nice to have a sandbox
    -- remember to reset searchpath when done

create table myents ( id serial primary key, sum integer);
    -- the entries table
create table myrecs ( id serial primary key, entid integer references myents (id), cases integer);
    -- the reports table, many reports per entry
create table myids ( id integer primary key references myents (id));
    -- table used to store trigger state

The solution

It appears that deferred constraint triggers are *almost* "per transaction". But constraint triggers are *always* for each row. Yet Postgresql 9.0 allows conditions on triggers - so all I need is to store some state of whether to fire the trigger or not for each row. First, the function that I'll be using in the condition:

-- function for when-clause (condition) in trigger
-- add id to the "to-do" list if it's not there already
-- return True only if it's been added to the least
create or replace function f_condition(myid integer) returns boolean as $_$
        myret boolean := True;
        newid integer;
        RAISE NOTICE 'Return Value here is %', myret;
        select count(*) into newid from myids where id = myid;
        if newid > 0 then -- id is there, return False 
           myret := False; 
        else    -- add id to queue, return True
            insert into myids (id) values (myid);
        end if;
        return myret;
$_$ language plpgsql;

Next, the trigger function, which will clean up after the above function when it's done:

-- trigger function to rebuild myent.sum for this
create or replace function f_refresh() returns trigger as $_$
        mysum integer;
        myid integer;
        IF (TG_OP = 'DELETE') THEN
            myid := OLD.entid;
            myid := NEW.entid;
        END IF;
        RAISE NOTICE 'Return Value here is %', myid;
        select sum(cases) into mysum from myrecs where entid = myid;
            -- compute entry sum from records and update it.
            -- PL/pgSQL rules on queries are quirky.  Read the docs.
        update myents set sum = mysum where id = myid;
        delete from myids where id = myid;
            -- clean up
$_$ language plpgsql;

Finally, the trigger definitions using the above.

-- only fire trigger once per using when clause (condition)
-- eval is deferred until transaction concludes
create constraint trigger t_cache after insert or update 
    on myrecs initially deferred for each row 
    when ( f_condition(NEW.entid) )  -- comment out for pre pg 9.0
    execute procedure f_refresh();

-- need 2, one for each for NEW (insert and update) and OLD (delete)
create constraint trigger t_cache_del after delete
    on myrecs initially deferred for each row 
    when ( f_condition(OLD.entid) )
    execute procedure f_refresh();

The test

All software needs to be tester, right?
I don't have pg9.0 yet, and it won't hit Ubuntu mainline until Ubuntu Natty, but it looks easy enough to install via a backports repo. So, this isn't fully tested. Comment out the "when" lines in the trigger defines and it will work, albeit running once per row *after* the commit.

In any event, here goes! The sum should be zero until after the commit. If the when clause works correctly, then the trigger should only fire (and emit a notice) once. Likewise, the myids table should be empty.

insert into myents (sum) VALUES ( 0);
    insert into myrecs (entid, cases) VALUES ( 1, 0), (1, 1), (1,2);
    select * from myents;  -- 0
    select * from myids;   -- entid from above
select * from myents;   -- new sum
select * from myids;   -- empty

References -- Postgresql Docs:

02 February 2011

I've been obsessively following Egypt for the past few days. Both the BBC and Al Jazeera have excellent coverage, as well as Google Realtime. After days of seeing BBC's "Have your say" links on the BBC ( and reading tech and business stories to boot ), I sent one in. Here goes: ### Begin Comments The latest comments from both Mubarak and Obama indicate that change is on the horizon. How close the horizon actually is seems to be the biggest question on people's minds. Several days ago, in the space of minutes, some 80 million people vanished from the internet (see This likely happened at direct orders from the current regime - who else wields such power? This is one of the most significant "black marks" against the current regime - a totalitarian attempt to curtail information flow. A resumption of basic information services is an absolutely prerequisite to any transition. If the current regime is serious about stability, they will "turn the lights back on" ASAP. The fact that they haven't done so yet casts serious doubt on their intentions.