Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

18 June 2011

Efficient loops in R -- the complexity versus speed trade-off

I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections.

My ballpark estimate was 1e6 agents using a weekly timestep for about 50 years. This is a stochastic model, so I'd like to replicate it at least 100 times. So, I'll need at least 50*50*100 = 250,000 steps. I figured I'd be happy if I could get my step runtimes down to ~1 second -- dividing the 100 runs over 4 cores, this would give a total runtime of ~17.5 hours. Not short, but not a problem.

At first, I was disheartened to see the runtime of my prototype step function stretching into the minutes. What's going on? Well, I'd used plyr in a very inappropriate way -- for a very large loop. I began to investigate, and discovered that writing an aesthetically unappealing while gave me a 30+x speed-up.

All of which got me thinking -- how expensive are loops and function calls in R? Next week I'm leading a tutorial in R and C++ using the wonderful Rcpp and inline packages here at Santa Fe Institute's 2011 Complex Systems Summer School. Might this make a nice example?

It does, and in spades. Below are the test functions, and you can see that the code complexity increases somewhat from one to the other, but outcomes are identical, again with 30+x speedup for each subsequent case. Here, I'm using the native R API. I also tested using Rcpp to import rbinom(), but that ended up taking twice as long as the naive while loop.

So, the moral of the story seems to be that if you can write a long loop in pure C++, it's a really easy win.

Note -- The as< double >(y); in src below doesn't seem to copy-and-paste correctly for some reason. If testfun2 doesn't compile, check to make sure this bit pasted correctly.


The pure R function definitions

## use aaply -- the simplest code
require(plyr)
testfun0 <- function(x, y)  aaply(x, 1, function(x) rbinom(1, x, y)

## rewrite the above as an explicit loop
testfun1 = function(nrep, x, y) {
    ## loop index
    ii<-0;
    ## result vector
    ret<-rep(0, nrep);
    while(ii < nrep) {
        ii<-ii+1;
        ## how many successes for each element of bb?
        ret[ii] <- rbinom(1, x[ii], y)
    }
    return(ret)
}

Rcpp function definitions (with lots of comments)

## define source code as string, pass to inline
src <-  ' 
    // Cast input parameters to correct types
    // clone prevents direct modification of the input object
    IntegerVector tmp(clone(x));
    double rate = as< double >(y); 
    // IntegerVector inherits from STL vector, so we get standard STL methods
    int tmpsize = tmp.size(); 
    // initialize RNG from R set.seed() call
    RNGScope scope; 
    // loop
    for (int ii =0; ii < tmpsize; ii++) {
        // Rf_rbinom is in the R api
        // For more details, See Rcpp-quickref.pdf, "Random Functions"
        // also, Rcpp provides "safe" accessors via, e.g., tmp(ii)
        tmp(ii) = Rf_rbinom(tmp(ii), rate);
    };
    // tmp points to a native R object, so we can return it as-is
    // if we wanted to return, e.g., the double rate, use:
    // return wrap(rate)
    return tmp;
'

require(inline)
## compile the function, inspect the process with verbose=T
testfun2 = cxxfunction(signature(x='integer', y='numeric'), src, plugin='Rcpp', verbose=T)

## timing function
ps <- function(x) print(system.time(x))

The tests

## Input vector
bb <- rbinom(1e5, 20, 0.5)
## test each case for 2 different loop lengths 
for( nrep in c(1e4, 1e5)){
    ## initialize RNG each time with the same seed
    ## plyr
    set.seed(20); ps(cc0<- testfun0(nrep[1:nrep], bb, 0.1))
    ## explicit while loop
    set.seed(20); ps(cc1<- testfun1(nrep, bb, 0.1))
    ## Rcpp
    set.seed(20); ps(cc2<- testfun2(bb[1:nrep], 0.1))
    print(all.equal(cc1, cc2))
}


## output
   user  system elapsed 
  3.392   0.020   3.417 
   user  system elapsed 
  0.116   0.000   0.119 
   user  system elapsed 
  0.000   0.000   0.002 
[1] TRUE
   user  system elapsed 
 37.534   0.064  37.601 
   user  system elapsed 
  1.228   0.000   1.230 
   user  system elapsed 
  0.020   0.000   0.021 
[1] TRUE

Postlude

After posting this to the very friendly Rcpp-devel mailing list, I got an in-depth reply from Douglas Bates pointing out that, in this case, the performance of vanilla apply() beats the while loop by a narrow margin. He also gives an interesting example of how to use an STL template and an iterator to achieve the same result. I admit that templates are still near-magic to me, and for now I prefer the clarity of the above. Still, if you're curious, this should whet your appetite for some of the wonders of Rcpp.
## Using Rcpp, inline and STL algorithms and containers
 
 ## use the std::binary_function templated struct
inc <-  '
  struct Rb : std::binary_function< double, double, double > {
     double operator() (double size, double prob) const { return ::Rf_rbinom(size, prob); }
  };
'

## define source code as a character vector, pass to inline
 src <-  ' 
     NumericVector sz(x);
     RNGScope scope;
 
     return NumericVector::import_transform(sz.begin(), sz.end(), std::bind2nd(Rb(), as< double >(y)));
'
 
## compile the function, inspect the process with verbose=TRUE
f5 <- cxxfunction(signature(x='numeric', y='numeric'), src, 'Rcpp', inc, verbose=TRUE)

11 June 2011

The importance of being unoriginal (and befriending google)

In search of bin counts

I look at histograms and density functions of my data in R on a regular basis. I have some idea of the algorithms behind these, but I've never had any reason to go under the hood until now. Lately, I've been looking using the bin counts for things like Shannon entropy ( in the very nice entropy package. I figured that binning and counting data would either be supported via a native, dedicated R package, or quite simple to code. Not finding the former ( base graphics hist() uses .Call("bincounts"), which appears undocumented and has a boatload of arguments ), I naively failed to search for a package and coded up the following.

myhist = function(x, dig=3)  {
    x=trunc(x, digits=dig);
    ## x=round(x, digits=dig);
    aa = bb = seq(0,1,1/10^dig);
    for (ii in 1:length(aa)) {
        aa[ii] = sum(x==aa[ii])
    };
    return(cbind(bin=bb, dens=aa/length(x)))
}


## random variates
test = sort(runif(1e4))
get1 = myhist(test)

Trouble in paradise

Truncate the data to a specified precision, and count how many are in each bin. Well, first I tried round(x) instead of trunc(x), which sorta makes sense but gives results that I still don't understand. On the other hand, trunc(x) doesn't take a digits argument? WTF? Of course, I could use sprintf(x) to make a character of known precision and convert back to numeric, but string-handling is waaaaaay too much computational overhead. Like towing a kid's red wagon with a landrover...

Dear Google...

An hour of irritation and confusion later, I ask google and, small wonder, the second search result links to the ash package that contains said tool. And it runs somewhere between 100 and 1,000 times faster. It doesn't return the bin boundaries by default, but it's good enough for a quick-and-dirty empirical probability mass distribution.

To be fair, there's something to be said for cooking up a simple solution to a simple problem, and then realizing that, for one reason or another, the problem isn't quite as simple as one first thought. On the other hand, sometimes we just want answers. When that's the case, asking google is a pretty good bet.

## their method
require(ash)
get2 = bin1(test, c(0,1), 1e3+1)$nc

10 May 2011

Problems with plyr -- the memory/complexity trade-off

Two types of R users


My overwhelming impression from UseR 2010 is that, generally speaking, there are 2 types of regular R users -- those who have heard and are made uncomfortable by the idea of the *apply() functions, and those who really get it. In UNM R programming group that I've been leading for about a year now, I've really tried to get people over the hump and into the second group. Once there, folks seem to really appreciate the amazing power of vectorization in R, and begin to enjoy writing code. The conceptual clarity of:

mymatrix = matrix(1:10, nrow=10)
  apply(mymatrix, 1, sum)
  apply(mymatrix, 2, sum)
 

over the clunky:

rowSums(mymatrix)
  colSums(mymatrix)

may not be immediately apparent. Eventually, though, folks go searching for things like rowMedian() and become frustrated that R doesn't "have all the functions" that they need. Well, R does, once you grok apply().

Hadley's Magic Brainchild

In the last year, I've had some serious ahah! moments with plyr. Lately, I've added the reshape package to the mix to achieve some serious R 1-liner Zen. Multidimensional arrays to long-form dataframes?

myarray = array(0, dim=c(3,4,5,6,10), dimnames=list(a=1:3, b=1:4, c=1:5, d=letters[1:6], e=LETTERS[1:10]))
  melt( adply(myarray, 1:4, I))

Sure. No problem. It's really that easy?!


An unexpeceted bonus? This way of thinking lends itself nicely to thinking of explicit parallelism. If you can describe the problem as a single recipe that's done for each "piece" of a whole, then you're one step away from using all the cores on your machine to solve a problem with foreach. Need to apply some sort of long-running analysis to each element of a list, and want to return the results as a list? Why write a for loop when you can do:

## prep
## warning -- this  may not be fast
## use smaller matrices for smaller machines! 
  nn = 2^20
  rr = 2^10
  mylist = list(a=matrix(rnorm(nn), nrow=rr), b=matrix(rnorm(nn), nrow=rr), c=matrix(rnorm(nn), nrow=rr))

## analysis
  myeigs = foreach( ii = iter(mylist), .combine=c) %do% { print('ping'); return(list(eigen(ii)))}

and then, once it works, change %do% to %dopar%, add the following, and you're off to the races!

require(multicore)
  require(doMC)
  ## use the appropriate # of cores, of course
  registerDoMC(cores=4)
  myeigs1 = foreach( ii = iter(mylist), .combine=c) %dopar% { print('ping'); return(list(eigen(ii)))}

Compared to something like llply, your dimnames don't automatically propagate to the results, but I think this is still pretty amazing. Complete control and simple parallelism.

Debugging with %dopar% is tricky, of course, because there are separate stacks for each item (i think), and messages (such as calls to print()) don't return to the console as you might expect them to. So, when in doubt, back off to %do%.

What could possibly go wrong?

The only problem with all of this is, when tasks are embarassingly parallel, that data also becomes embarassingly parallel to point where it no longer fits into memory. Thus, I returned today to a long-running bootstrap computation to find R consuming ~7GB RAM, 5+ GB swap, and this message:

Error: cannot allocate vector of size 818.6 Mb
Enter a frame number, or 0 to exit   
...
4: mk.bootrep(zz, 75)
5: mk.boot.R#165: aaply(tmpreps, .margins = 1, function(x) {
6: laply(.data = pieces, .fun = .fun, ..., .progress = .progress, .drop = .dro
7: list_to_array(res, attr(.data, "split_labels"), .drop)
8: unname(unlist(res))

What's happening is that plyr is trying to do everything at once. As anyone who's used grep can tell you, doing one row at a time, or streaming data is often a much better idea. I got the intended results from above by pre-allocating an array and writing each item of my results list into the array in a loop in seconds, and barely broke 3 GB of RAM usage.

Now, nothing here is really news. The dangers of "Growing Objects" is covered in Circle 2 of Burns Statistics' wonderful R Inferno. Still, plyr strikes me as an interesting case where reducing conceptual complexity can lead to a rather steep increase in computational complexity. And the most interesting thing of all is that it happens quite suddenly above a certain threshold.

Parting thoughts

I wonder if there's any major barrier to a stream=TRUE argument to the plyr functions -- I haven't thought about it too much, but imagine that you'd also need a finalizer function to prepare the return object to be written/streamed into. At what point is it easier to just do by hand with a for loop?

Honestly, I don't know the answer. I don't do too many things that break plyr, but I've learned how important it is to understand when I'm likely exceed its limits. .variables in ddply is another one that I've learned to be careful with. If, after subdividing your input data.frame, you end up with 1e5 or 1e6 pieces, things start to break down pretty fast.

Honestly, I love writing solutions with the *apply() family and the ddply functions. I think it makes code cleaner and more logical. Watching the light of this dawn in other R users' eyes is truly exciting. Yet it pays to remember that, like all things, it has its limits. In this case, one seems to reach the limits suddenly and harshly.

29 March 2011

Why PKI matters.

This article landed in my inbox this week in a newsletter from the EFF. I usually don't read them, but the term "meltdown" caught my eye, what with all the nuke new this month. They also managed to work in "too big to fail", and neither reference was hyperbolic. The internet depends on a level of trust, and (surprise) there are people working to co-opt that trust.

A number of my friends think I'm a little weird for using real passwords, for not sharing them, etc. But I read Cryptonomicon and friends; I have at least one box exposed to the real, honest-to-god, jungle-out-there internet; I have a healthy fear of all the things that could go wrong. As part of my lab group's data entry project, I recently registered my first SSL certificate from http://www.startssl.com/ (free!), and learned quite a bit about PKI in the process.

Still, reading an article like this really drives home both the complexities and importance of the global PKI system. Trust is difficult when it's strung across the globe on fiber optics cables, and enforced by our inability to quickly factor very large numbers. But old-school techniques of impersonation and breaking and entering will always be with us. I may trust google.com, but how do I know that it's actually them? PKI.

How cool is that?
Very.

20 March 2011

Looking at the "Curse of Dimensionality" with R, foreach, and lattice

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R.




It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root for an L-norm metric (e.g. Euclidean = L2 norm). It took me a while to "get" how the maximum metric worked -- it's simpler than I was trying to make it, though I'm still not sure I can explain it properly.

In general, I'm a big fan of R's lattice package, though I'm never quite happy with the legend/key settings.

The code is below. One thing that I've learned from this exercise is that I prefer not to have the foreach call in the outermost loop. For one, error and/or progress reporting from inside a foreach isn't immediately emitted to the console, and it's annoying/confusing to have a long run with no feedback. Second and less aesthetically, it seems that foreach excels when it has jobs of medium complexity iterated over a medium number of elements. Too few elements + very complex job means one or more cores will finish early and sit idle. Too many elements + very simple job means too many jobs will be spawned, and (I'm guessing here) a lot of time will be wasted with job setup and teardown.

dimcurse = function(dims=1:1e3, .sd=c(1e-2, 0.1, .2, .5, 1), reps=1e3, cores=4,
                    rfun=rnorm, .methods=c('manhattan', 'euclidean', 'maximum'), .verbose=F){
    ## examine distance between 2 points in dims dimensions for different metrics
    ## chosen at random according to rfun (rnorm by default)
    ##
    require(foreach)
    require(doMC)
    registerDoMC
    registerDoMC(cores=cores)
    ## use multicore/foreach on the outside (each sd)
    finret = foreach(ss=iter(.sd), .combine=rbind, .verbose=.verbose) %dopar% {
        ## pre-allocate, don't append/rbind!
        ret = data.frame(.sd=1:(reps*length(dims))*length(methods), .dim=1, .rep=1, .dist=1, method=factor(NA, levels=.methods))
        ii = 1
        for (.dim in dims) {
          for (method in .methods) {
            ## reset .rep counter
            .rep = 1
            while (.rep <= reps) {
                ## 2 points, one per row, randomly chosen
                tmp = matrix(rfun(2*.dim, sd=ss), nrow=2)
                ## calc dist
                mydist = as.numeric(dist(tmp, method=method))
                ## fill return dataframe
                ret[ii, ] = list(ss, .dim, .rep, mydist, method)
                ## update rep counter
                .rep = .rep +1
                ## update ret row counter
                ii = ii +1
            }
          }
        }
        return(ret)
    } ## rbind happens here
    return(finret)
}

For plotting, I've used:
xyplot(.dist ~ (.dim)|method, groups=.sd, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(3,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=5, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying sd (legend) using different metrics (columns)',
)

xyplot(.dist ~ (.dim)|as.character(.sd), groups=method, data =aa, pch='.',
    par.settings=list(background=list(col='darkgrey')), layout=c(5,1),
    scales=list(y=list(relation='free'), alternating=1),
    auto.key=list(space='bottom', columns=3, points=F, lines=T),
    xlab='Dimension', ylab='Distance', main='Distance between 2 N-dimensional points \n drawn from a normal distribution with varying SD (columns) using different metrics (legend)',
)

24 February 2011

snow and ssh -- secure inter-machine parallelism with R

I just threw a post up on Revolutions, which got a lot longer than I planned. And got me thinking. And reading (see refs in previous post). And trying. Turns out that it was way easier than I thought!

The problem:

From the blog post:
"
OpenSSH is now available on all platforms. A sensible solution is to have *one* cluster type -- ssh. Let ssh handle inter-computer connection and proper forwarding of ports, and for each machine to connect to a local port that's been forwarded.

... It seems like most of the heavy lifting is done by foreach, and that engineering a simple makeSSHcluster and doSSH to forward ports, start slaves, and open socket connections should be tractable.

... the SSHcluster method would require minimal invasion on those machines -- just ability to execute ssh and Rscript on the remote machines -- not even login privileges are required!
"

Requirements:

R must be installed on both machines, as well as the snow package. On the machine running the primary R session (henceforth referred to as thishost, the foreach and doSNOW packages must be installed, as well as an ssh client. On remotehost, Rscript must be installed, an ssh server must be running, and the user must have permissions to run Rscript and read the R libraries. Note that Rscript is far from a secure program to allow untrusted users to run. For the paranoid, place it in a chroot jail...


Step-by-step:

The solution here piggy-backs on existing snow functionality. All commands below are to be run from thishost, either from commandline (shown by $) or the R session in which work is to be done (shown by \>). You will need to replace remotehost with the hostname or ip address of remotehost, but localhost should be entered verbatim.

Note that the ssh connections are much easier using key-based authentication (which you were using already, right?), and a key agent so you don't have to keep typing your key password again and again. Even better, set up an .ssh/config file for host-specific aliases and configuration.

Now, onto the code!

$ ssh -f -R 10187:localhost:10187 remotehost sleep 600  ## 10 minutes to set up connection

> require(snow); require(foreach); require(doSNOW);
> cl = makeCluster(rep('localhost',3), type='SOCK', manual=T)  ## wait for 3 slaves

## Use ssh to start the slaves on remotehost and connect to the (local) forwarded port

$ for i in {1..3}; 
$  do ssh -f remotehost "/usr/lib64/R/bin/Rscript ~/R/x86_64-pc-linux-gnu-library/2.12/snow/RSOCKnode.R 
$       MASTER=localhost PORT=10187 OUT=/dev/null SNOWLIB=~/R/x86_64-pc-linux-gnu-library/2.12"; 
$ done

## R should have picked up the slaves and returned.  If not, something is wrong.
## Back up and test ssh connectivity, etc.

> registerDoSNOW(cl)

>  a <- matrix(1:4e6, nrow=4)
>  b <- t(a)
>  bb <-   foreach(b=iter(b, by='col'), .combine=cbind) %dopar%
>    (a %*% b)

Conclusions?

This is a funny example of %dopar%; it's minimal and computationally useless, which is nice for testing. Still, it allows you to directly test the communication costs of your setup by varying the size of a. In explicit parallelism, gain is proportional to (computational cost)/(communication cost), so the above example is exactly what you don't want to do.

Next up:
  • Write an R function that uses calls to system to set the tunnel up and spawn the slave processes, with string substitution for remotehost, etc.
  • Set up a second port and socketConnection to read each slave's stdout to pick up errors. Perhaps call warning() on the output? Trouble-shooting is so much easier when you can see what's happening!
  • Clean up the path encodings for SNOWLIB and RSOCKnode.R. R knows where it's packages live, right?
  • As written, a little tricky to roll multiple hosts into the cluster (including thishost). It seems possible to chose port numbers sequentially within a range, and connect each new remotehost through a new port, with slaves on thishost connecting to a separate, non-forwarded port.
Those are low-hanging fruit. It would be nice to cut out the dependency on snow and doSNOW. The doSNOW package contains a single 1-line function. On the other hand, the snow package hasn't been updated since July 2008, and a google search show up many confused users and few success stories (with the exception of this very helpful thread, which led me down this path in the first place).

+1 if you'd like to see a doSSH package.
+10 if you'd like to help!

ssh port forwarding

Is it just me, or is the openssh port forwarding syntax *really* confusing?  Every time I have to work through examples afresh.  It's ridiculously powerful -- who can argue with key-based authentication on a user-land vpn, and it's available for all major OSes.  Still, I wish I *really* understood the syntax.  


I suppose practice makes perfect.  I'm working on a small project now.  With any luck, examples to come.

References:
Openssh FAQ
Red Hat Magazine

03 February 2011

Postgreql Hacks -- firing a trigger "for each transaction"

The problem

I spent today learning about triggers in postgresql - there are triggers "for each statement" and "for each row". I went searching for triggers "for each transaction", and found a few other folks with the same question but no compelling answer.

My use case is this - I have 2 tables, a parent (entries) and child (reports), with a one-to-many relationship. I need to add multiple reports for one entry in a single transaction, and then update the associated entry if the transaction succeeds. Here are the table definitions:

create schema transtest;
set searchpath to transtest,public;
    -- it's nice to have a sandbox
    -- remember to reset searchpath when done

create table myents ( id serial primary key, sum integer);
    -- the entries table
create table myrecs ( id serial primary key, entid integer references myents (id), cases integer);
    -- the reports table, many reports per entry
create table myids ( id integer primary key references myents (id));
    -- table used to store trigger state

The solution

It appears that deferred constraint triggers are *almost* "per transaction". But constraint triggers are *always* for each row. Yet Postgresql 9.0 allows conditions on triggers - so all I need is to store some state of whether to fire the trigger or not for each row. First, the function that I'll be using in the condition:

-- function for when-clause (condition) in trigger
-- add id to the "to-do" list if it's not there already
-- return True only if it's been added to the least
create or replace function f_condition(myid integer) returns boolean as $_$
    DECLARE
        myret boolean := True;
        newid integer;
    BEGIN
        RAISE NOTICE 'Return Value here is %', myret;
        select count(*) into newid from myids where id = myid;
        if newid > 0 then -- id is there, return False 
           myret := False; 
        else    -- add id to queue, return True
            insert into myids (id) values (myid);
        end if;
        return myret;
    END;
$_$ language plpgsql;

Next, the trigger function, which will clean up after the above function when it's done:

-- trigger function to rebuild myent.sum for this myrecs.id
create or replace function f_refresh() returns trigger as $_$
    DECLARE
        mysum integer;
        myid integer;
    BEGIN
        IF (TG_OP = 'DELETE') THEN
            myid := OLD.entid;
        ELSE 
            myid := NEW.entid;
        END IF;
        RAISE NOTICE 'Return Value here is %', myid;
        select sum(cases) into mysum from myrecs where entid = myid;
            -- compute entry sum from records and update it.
            -- PL/pgSQL rules on queries are quirky.  Read the docs.
        update myents set sum = mysum where id = myid;
        delete from myids where id = myid;
            -- clean up
    RETURN NULL;
    END;
$_$ language plpgsql;

Finally, the trigger definitions using the above.

-- only fire trigger once per myent.id using when clause (condition)
-- eval is deferred until transaction concludes
create constraint trigger t_cache after insert or update 
    on myrecs initially deferred for each row 
    when ( f_condition(NEW.entid) )  -- comment out for pre pg 9.0
    execute procedure f_refresh();

-- need 2, one for each for NEW (insert and update) and OLD (delete)
create constraint trigger t_cache_del after delete
    on myrecs initially deferred for each row 
    when ( f_condition(OLD.entid) )
    execute procedure f_refresh();

The test

All software needs to be tester, right?
I don't have pg9.0 yet, and it won't hit Ubuntu mainline until Ubuntu Natty, but it looks easy enough to install via a backports repo. So, this isn't fully tested. Comment out the "when" lines in the trigger defines and it will work, albeit running once per row *after* the commit.

In any event, here goes! The sum should be zero until after the commit. If the when clause works correctly, then the trigger should only fire (and emit a notice) once. Likewise, the myids table should be empty.

insert into myents (sum) VALUES ( 0);
begin;
    insert into myrecs (entid, cases) VALUES ( 1, 0), (1, 1), (1,2);
    select * from myents;  -- 0
    select * from myids;   -- entid from above
commit;
select * from myents;   -- new sum
select * from myids;   -- empty

References -- Postgresql Docs:

02 February 2011

I've been obsessively following Egypt for the past few days. Both the BBC and Al Jazeera have excellent coverage, as well as Google Realtime. After days of seeing BBC's "Have your say" links on the BBC ( and reading tech and business stories to boot ), I sent one in. Here goes: ### Begin Comments The latest comments from both Mubarak and Obama indicate that change is on the horizon. How close the horizon actually is seems to be the biggest question on people's minds. Several days ago, in the space of minutes, some 80 million people vanished from the internet (see http://www.renesys.com/blog/). This likely happened at direct orders from the current regime - who else wields such power? This is one of the most significant "black marks" against the current regime - a totalitarian attempt to curtail information flow. A resumption of basic information services is an absolutely prerequisite to any transition. If the current regime is serious about stability, they will "turn the lights back on" ASAP. The fact that they haven't done so yet casts serious doubt on their intentions.

06 June 2010

User Interface Zen

Before Apple made pretty screens on cellphonesand album art that you could flip through, there was Synaptics. Have you ever seen someone drawing on a pad bound to a laptop in a coffeeshop? Synaptics. Have them you ever done a 2-fingered scroll? Synaptics. I was a perennial hater of laptop touchpads. Imprecise, too much work. A mouse gives wide sweeps, high precision. Then i slowly worked myself up to almost-full-time keyboard usage, a long barrage of shortcuts that i could fire off without ever lifting my fingers, frame after frame of text editor, browser, google scholar, google maps, and back.

I've always really *liked* keyboards, and now i absolutely depend upon them. Re-enter the 2.2 pound netbook. That's right - the 1kg laptop. Alright, it has an annoyingly small battery compared to a very slightly larger 2.8 pound laptop, and the weight gain is more or less obviated by the more frequent need to carry a 0.3 pound charger. Whatever. It still lasts for twice as long as my "real" laptop (a thinkpad t61 that i would *never* put in my lap, whose screen and keyboard i nonetheless adore ).

Point is... The touchpad has come a long way. So far, in fact, that touchpads are now officially cool, as far as i can tell. With 2 fingers, i can now give a *lot* of information, while moving my fingers less than a mouse. The secret, from my point of view, is synaptics, and more specifically, synclient. Under ubuntu, the following sets thresholds of two finger motions, adds rules for "two finger tap as middle click" (Xwindows paste), and disables end scrolling (otherwise known as the devil). I'm using an Asus 1005HAB (N270, $250 @ Best Buy with WinXP, after some deliberation, and some pretty aweful "the customer is a retard" corporate customer service.

As an extended sidenote, do yourself a favor and remove the "rescue partition" before it wipes your computer clean with a random, unconfirmed hard drive repartition when you accidentally boot into it with grub). It's the model with the little bumps on a touchpad that is flush with the case. I'm surprised at how not-annoying the bumps are. After turning up the sensitivity, it's, um, magic? I think "scroll down" and it suddenly happens. The wonders of multitouch. The happiness of the not-a-huge-tome laptop!

Here's my multitouch script: #!/bin/bash ## synclient EmulateTwoFingerMinZ=5 synclient EmulateTwoFingerMinW=10 synclient VertTwoFingerScroll=1 synclient HorizTwoFingerScroll=1 synclient VertScrollDelta=75 synclient HorizScrollDelta=100 synclient JumpyCursorThreshold=100 synclient VertEdgeScroll=0 synclient HorizEdgeScroll=0 synclient TapButton2=2 synclient TapButton3=3