Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

24 February 2011

snow and ssh -- secure inter-machine parallelism with R

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:
"
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

... It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

... the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login privileges are required!
"

Requirements:

R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail...


Step-by-step:

The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don't have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)

Conclusions?

This is a funny example of %dopar%; it's minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don't want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave's stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what's happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it's packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn't been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you'd like to see a doSSH package.
+10 if you'd like to help!

ssh port forwarding

Is it just me, or is the openssh port forwarding syntax *really* confusing?  Every time I have to work through examples afresh.  It's ridiculously powerful -- who can argue with key-based authentication on a user-land vpn, and it's available for all major OSes.  Still, I wish I *really* understood the syntax.  


I suppose practice makes perfect.  I'm working on a small project now.  With any luck, examples to come.

References:
Openssh FAQ
Red Hat Magazine

03 February 2011

Postgreql Hacks -- firing a trigger "for each transaction"

The problem

I spent today learning about triggers in postgresql - there are triggers "for each statement" and "for each row". I went searching for triggers "for each transaction", and found a few other folks with the same question but no compelling answer.

My use case is this - I have 2 tables, a parent (entries) and child (reports), with a one-to-many relationship. I need to add multiple reports for one entry in a single transaction, and then update the associated entry if the transaction succeeds. Here are the table definitions:

create schema transtest;
set searchpath to transtest,public;
    -- it's nice to have a sandbox
    -- remember to reset searchpath when done

create table myents ( id serial primary key, sum integer);
    -- the entries table
create table myrecs ( id serial primary key, entid integer references myents (id), cases integer);
    -- the reports table, many reports per entry
create table myids ( id integer primary key references myents (id));
    -- table used to store trigger state

The solution

It appears that deferred constraint triggers are *almost* "per transaction". But constraint triggers are *always* for each row. Yet Postgresql 9.0 allows conditions on triggers - so all I need is to store some state of whether to fire the trigger or not for each row. First, the function that I'll be using in the condition:

-- function for when-clause (condition) in trigger
-- add id to the "to-do" list if it's not there already
-- return True only if it's been added to the least
create or replace function f_condition(myid integer) returns boolean as $_$
    DECLARE
        myret boolean := True;
        newid integer;
    BEGIN
        RAISE NOTICE 'Return Value here is %', myret;
        select count(*) into newid from myids where id = myid;
        if newid > 0 then -- id is there, return False 
           myret := False; 
        else    -- add id to queue, return True
            insert into myids (id) values (myid);
        end if;
        return myret;
    END;
$_$ language plpgsql;

Next, the trigger function, which will clean up after the above function when it's done:

-- trigger function to rebuild myent.sum for this myrecs.id
create or replace function f_refresh() returns trigger as $_$
    DECLARE
        mysum integer;
        myid integer;
    BEGIN
        IF (TG_OP = 'DELETE') THEN
            myid := OLD.entid;
        ELSE 
            myid := NEW.entid;
        END IF;
        RAISE NOTICE 'Return Value here is %', myid;
        select sum(cases) into mysum from myrecs where entid = myid;
            -- compute entry sum from records and update it.
            -- PL/pgSQL rules on queries are quirky.  Read the docs.
        update myents set sum = mysum where id = myid;
        delete from myids where id = myid;
            -- clean up
    RETURN NULL;
    END;
$_$ language plpgsql;

Finally, the trigger definitions using the above.

-- only fire trigger once per myent.id using when clause (condition)
-- eval is deferred until transaction concludes
create constraint trigger t_cache after insert or update 
    on myrecs initially deferred for each row 
    when ( f_condition(NEW.entid) )  -- comment out for pre pg 9.0
    execute procedure f_refresh();

-- need 2, one for each for NEW (insert and update) and OLD (delete)
create constraint trigger t_cache_del after delete
    on myrecs initially deferred for each row 
    when ( f_condition(OLD.entid) )
    execute procedure f_refresh();

The test

All software needs to be tester, right?
I don't have pg9.0 yet, and it won't hit Ubuntu mainline until Ubuntu Natty, but it looks easy enough to install via a backports repo. So, this isn't fully tested. Comment out the "when" lines in the trigger defines and it will work, albeit running once per row *after* the commit.

In any event, here goes! The sum should be zero until after the commit. If the when clause works correctly, then the trigger should only fire (and emit a notice) once. Likewise, the myids table should be empty.

insert into myents (sum) VALUES ( 0);
begin;
    insert into myrecs (entid, cases) VALUES ( 1, 0), (1, 1), (1,2);
    select * from myents;  -- 0
    select * from myids;   -- entid from above
commit;
select * from myents;   -- new sum
select * from myids;   -- empty

References -- Postgresql Docs:

02 February 2011

I've been obsessively following Egypt for the past few days. Both the BBC and Al Jazeera have excellent coverage, as well as Google Realtime. After days of seeing BBC's "Have your say" links on the BBC ( and reading tech and business stories to boot ), I sent one in. Here goes: ### Begin Comments The latest comments from both Mubarak and Obama indicate that change is on the horizon. How close the horizon actually is seems to be the biggest question on people's minds. Several days ago, in the space of minutes, some 80 million people vanished from the internet (see http://www.renesys.com/blog/). This likely happened at direct orders from the current regime - who else wields such power? This is one of the most significant "black marks" against the current regime - a totalitarian attempt to curtail information flow. A resumption of basic information services is an absolutely prerequisite to any transition. If the current regime is serious about stability, they will "turn the lights back on" ASAP. The fact that they haven't done so yet casts serious doubt on their intentions.

06 June 2010

User Interface Zen

Before Apple made pretty screens on cellphonesand album art that you could flip through, there was Synaptics. Have you ever seen someone drawing on a pad bound to a laptop in a coffeeshop? Synaptics. Have them you ever done a 2-fingered scroll? Synaptics. I was a perennial hater of laptop touchpads. Imprecise, too much work. A mouse gives wide sweeps, high precision. Then i slowly worked myself up to almost-full-time keyboard usage, a long barrage of shortcuts that i could fire off without ever lifting my fingers, frame after frame of text editor, browser, google scholar, google maps, and back.

I've always really *liked* keyboards, and now i absolutely depend upon them. Re-enter the 2.2 pound netbook. That's right - the 1kg laptop. Alright, it has an annoyingly small battery compared to a very slightly larger 2.8 pound laptop, and the weight gain is more or less obviated by the more frequent need to carry a 0.3 pound charger. Whatever. It still lasts for twice as long as my "real" laptop (a thinkpad t61 that i would *never* put in my lap, whose screen and keyboard i nonetheless adore ).

Point is... The touchpad has come a long way. So far, in fact, that touchpads are now officially cool, as far as i can tell. With 2 fingers, i can now give a *lot* of information, while moving my fingers less than a mouse. The secret, from my point of view, is synaptics, and more specifically, synclient. Under ubuntu, the following sets thresholds of two finger motions, adds rules for "two finger tap as middle click" (Xwindows paste), and disables end scrolling (otherwise known as the devil). I'm using an Asus 1005HAB (N270, $250 @ Best Buy with WinXP, after some deliberation, and some pretty aweful "the customer is a retard" corporate customer service.

As an extended sidenote, do yourself a favor and remove the "rescue partition" before it wipes your computer clean with a random, unconfirmed hard drive repartition when you accidentally boot into it with grub). It's the model with the little bumps on a touchpad that is flush with the case. I'm surprised at how not-annoying the bumps are. After turning up the sensitivity, it's, um, magic? I think "scroll down" and it suddenly happens. The wonders of multitouch. The happiness of the not-a-huge-tome laptop!

Here's my multitouch script: #!/bin/bash ## synclient EmulateTwoFingerMinZ=5 synclient EmulateTwoFingerMinW=10 synclient VertTwoFingerScroll=1 synclient HorizTwoFingerScroll=1 synclient VertScrollDelta=75 synclient HorizScrollDelta=100 synclient JumpyCursorThreshold=100 synclient VertEdgeScroll=0 synclient HorizEdgeScroll=0 synclient TapButton2=2 synclient TapButton3=3

20 April 2010

Little R == r

There's big R, the R that I use to do most my work, the environment that makes pretty graphics, et. al. It's like matlab, only cooler. Or more cool. Or less uncool. You can see my prejudices here.

Today i discovered little R. It's like big R, only little. Holy shit.

Dirk gives a thorough rundown here http://dirk.eddelbuettel.com/code/littler.html Suffice to say, for someone who's been using pipes and #!/usr/bin constructs for years (though not quite yet decades), this is cooler than cool. One might say, super-cool.

It's also a nice intro to R for some of the systems geeks out there. Need a million random numbers uniformly distributed between 0 and 1, specified to 7 decimal points? Need it in a file? Need it fast? r can help:
time r -q -e 'for (i in rnorm(1e6)) cat(sprintf("%1.7f\n", i))' >> randomnums  
Or perhaps you have a million numbers in some file that you would like to plot as a histogram, fast, every day, in an automated fashion, from the command line...
cat randomnums | r  -e 'myrandoms <- as.numeric(readLines()); png(filename="myplot.png"); 
plot(histogram(myrandoms)); 
dev.off()' >/dev/null   
No, it won't mungle strings with the ease of python, but it can chew a spreadsheet and spit it out *fast*. And since it's a stream, you can always pipe it to/from python. If you ask me, pretty fucking cool.

07 April 2010

The new face of CouchSurfing

Really? I've gotten a few borderline disrespectful CS requests lately (my profile is here). Sure, young and inexperienced, have mercy on them, etc. etc. Nonetheless, I think it's time for a bit of gentle schooling... The Request >USERNAME: *** >GENDER: Male >AGE: 23 >LOCATION: United States - New York - *** > >ARRIVAL DATE: 4/9/10 >DEPARTURE DATE: 4/13/10 >NUMBER OF PEOPLE: 1 >ARRIVING VIA: Plane > >Hi, > >My name is *** ***. I'm going to Albuquerque this weekend to attend a seminar at the **** Institute. I know it's short notice, but I was wondering if you would be able to host. It's a lot of last minute planning, I just created this account because my girlfriend suggested I use couchsurfing to try to find a place to stay, so sorry that my profile isn't very extensive. > >Thanks! > >*** My Reply First rule of a successful couchsearch - read the profile of the person that you're addressing. Second rule of a successful couchsearch - tell the person what specifically about them you think would make you a good match. what do you bring to the table? What about them interests you? Third rule of a successful couchsearch - add relevant information and a picture or 2 to your profile. A couchsearch is asking someone to take time out of a busy life to provide you with a place to stay. It makes sense to reciprocate the effort by spending some time to write out a real, decent profile that tells your prospective host something about who you are. Good luck! christian

13 January 2010

Postgres + PL/R = magic information swiss army knife

I just wrote up an extensive example using PL/R to build a logistic map for a range of values of r. The end result is a pretty picture generated within the database and handed back to the webserver. Very cool. For complete details, see http://www.joeconway.com/web/guest/pl/r/-/wiki/Main/Bytea+Graphing+Example

27 October 2009

thesis take 7 part 2

human agriculture, which is central to human ecological dominance, in fact relies upon ecosystem disturbance.

as humans stabilize niches against disturbance, characteristic changes in species abundance and guild specialization may be expected. using two model guild strategies for exploiting disturbance (fire and flood), i aim to characterize the effect of disturbance supression. negotiation of abiotic and intra-guild constraints are the primary mechanisms of ecological prosperity that are considered.

thesis topic take 7

environmental disturbance is a key ecological determinant of species composition. long lives and lack of motility in woody plants can lead to characteristic relationships to disturbance, either through close life-cycle coupling to dominant abiotic disturbance (i.e. flooding), or through biotic forcing of disturbance (i.e. fire).
these relationships to disturbance are mediated by nutrient and organism cycling, yielding causal, fitness-driven mechanisms of species abundance in woody plants. woody plants provide a fundamental, irreplacable ecosystem service of energy production. driven by intra-guild competition and abiotic limitations such as water, nitrogen, and phosporus, many woody plants have adapted to exploit a priori disturbance, or to drive new disturbance.

contrary to contemporary memes, many terrestrial ecosystems have been subject forceful stabilization as a result of human activity. humans supression of fire and flooding has ecological consequences, especially for woody plan