Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

23 August 2013

Everyday revision control

This post has been a long time coming. Over the past year or so, I've gradually become familiar, even comfortable, with git. I've mainly used it for my own work, rather than as a collaborative tool. Most of the folks that I work with don't need to share code on a day-to-day basis, and there's a learning curve that few of my current colleagues seem interested in climbing at this point. This hasn't stopped me from *talking* to my colleagues about git as an important tool in reproducible research (henceforth referred to as RR).

I find the process of committing files and writing commit messages at the end of the day forces me to tidy up. It also allows me to more easily put a project on hold for weeks or months and to then return to it with a clear understanding of what I'd been working on, and what work remained. In short, I use my git commit messages very much like a lab notebook (a countervailing view on git and reproducible research is here, an interesting discussion of GNU make and RR here, and a nice post on RR from knitr author Yihui Xie here ).


Sidenote: I've hosted several projects at https://code.google.com/, and used their git archives, particularly for classes (I prefer the interface to github, though the two platforms are similar). I've also increasingly used Dropbox for collaborations, and I've struggled to integrate Dropbox and Git. Placing the same file under the control of two very different synchronization tools strikes me as a Bad Idea (TM), and Dropbox's handling of symlinks isn't very sophisticated. On the other hand, maintaining 2 different file-trees for one project is frustrating. I haven't found a good solution for this yet...

As far as tools go, most of the time I simply edit files with vim and commit from the commandline. In this sense, git has barely changed my work flow, other than demanding a bit of much-needed attention to organization on my part. Lately, I've started using GUI tools to easily visualize repositories, e.g. to simultaneously see a commit's date, message, files, and diff. Both gitk and giggle have similar functionality -- giggle is prettier, gitk is cleaner. Another interesting development is that Rstudio now includes git tools (as well as native latex and knitr support in the native Rstudio editor). This means that a default Rstudio install has all the tools necessary for a collaborator to quickly and easily check out an repository and start playing with it.

09 August 2013

Adventures with Android

After months of dealing with an increasingly sluggish and downright buggy Verizon HTC Rhyme, I finally took the leap and got a used Galaxy Nexus. First off, I think it's beautiful. The rhyme isn't exactly a high-end phone, so the small, unexciting screen isn't particularly surprising. By comparison, the circa 2011 Nexus is a work of art. My first impression of the AMOLED screen is great. Dark blacks and luscious color saturation (though I have found it to be annoyingly shiny -- screens shouldn't be mirrors!). It even has a barometer!

One big motivation for a phone upgrade (asides from the cracked screen and aforementioned lag) was being stuck at Android 2.3.4 (Gingerbread). As a technologist, I don't consider myself an early-adopted. I prefer to let others sort out the confusion of initial releases, and pick out the gems that emerge. But Gingerbread is well over 2 years old, and a lot has happened since then. The Galaxy Nexus (I have the Verizon CDMA model, codename Toro) is a skin-free, pure Android device. Which means that I am now in control of my phone's Android destiny!

How to go about this? I hit the web and cobbled together a cursory understanding of the Android/google-phone developer ecosystem as it currently stands. First off, there's xda-developers, a very active community of devs and users. There's an organizational page for information on the Galaxy Nexus here that helped me get oriented. This post made installing adb and fastboot a snap on ubuntu 12.04 (precise). There's also some udev magic from google under the "Configuring USB Access" section here that I followed, perhaps blindly (though http://source.android.com is a good primary reference...).

Next, I downloaded ClockworkMod for my device, rebooted my phone into the bootloader, and installed and booted into ClockworkMod:

## these commands are from computer connected to phone (via usb cable)
## check that phone is connected.  this should return "device"
adb get-state
## reboot the phone into the bootloader
adb reboot-bootloader

## in recovery, the phone should have funny pictures of the adroid bot opened up... reminds me of Bender.
## the actual file (the last argument) will vary by device
fastboot flash recovery recovery-clockwork-touch-6.0.3.5-toro.img
## boot into ClockworkMod 
fastboot boot recovery-clockwork-touch-6.0.3.5-toro.img

This brought me to the touchscreen interface of ClockworkMod. First, I did a factory reset/clear cache as per others' instructions. Then I flashed the files listed here (with the exception of root) via sideloading. There's an option in ClockworkMod that says something like "install from sideload". Selecting this gives instructions -- basically, use adb to send the files, and then ClockworkMod takes care of the rest:

## do this on computer after selecting "install from sideload" on phone
## the ROM, "pure_toro-v3-eng-4.3_r1.zip", varies by device
adb sideload pure_toro-v3-eng-4.3_r1.zip
## repeat for all files that need to be flashed

I rebooted into a shiny new install of Jelly Bean (4.3). It's so much cleaner and more pleasant than my old phone. I was also pleasantly surprised to see that Android Backup service auto-installed all my apps from the Rhyme.

In the process of researching this, I got a much better idea of what CyanogenMod does. I'm tempted to try it out now, but I reckon I'll wait for the 4.3 release, whenever that happens.


I also found http://www.htcdev.com/bootloader, which offers the prospect of unlocking and upgrading the HTC Rhyme, though I haven't found any ROMs that work for the CDMA version...

13 June 2013

Secure webserver on the cheap: free SSL certificates

Setting up an honest, fully-certified secure web server (e.g. https) on the cheap can be tricky, mainly due to certificates. Certificates are only issued to folks who can prove they are who they say they are. This verification generally takes time and energy, and therefore money. But the great folks at https://www.startssl.com/ have an automated system that verifies identity and auto-renders associated SSL certificates for free.

Validating an email is easy enough, but validating a domain is trickier -- it requires a receiving mailserver that startssl can mail a verification code to. Inbound port 25 (mail server) is blocked by my ISP, the University of New Mexico (and honestly, I'd rather not run an inbound mail server).

I manage my personal domain through http://freedns.afraid.org/. They provide full DNS management, as well as some great dynamic DNS tools. They're wonderful. But they don't provide any fine-grained email management, just MX records and the like.

The perfect companion of afraid.org is https://www.e4ward.com/. They have mail servers that will conditionally accept mail for specific addresses at personal domain, and forward that mail to an email account. This lets me route specific addresses @mydomain.com, things like postmaster@mydomain.com, to my personal gmail account. E4ward is a real class-act. They manually moderate/approve new accounts, so there's a bit of time lag. To add a domain, they also require proof of control via a TXT record (done through afraid.org).

This whole setup allowed me to prove that I owned my domain to startssl.com without running a mail server or paying for anything other than the domain. The result is my own SSL certificates. I'm running a pylons webapp with apache2 and mod_wsgi. In combination with python's repoze.what, I get secure user authentication over https without any snakeoil.

Hat-tip to this writeup, which introduced me to e4ward.com and their mail servers.

Finally, there are a number of online tools to query domains. dnsstuff.com was one of the better ones I found. It takes longer to load, but gives a detailed report of domain configuration, along with suggestions. A nice tool to verify that everything is working as expected.


11 June 2013

Learning new fileserver tricks: RAID + LVM

I've finally gotten comfortable with linux's software raid, aka mdadm. I've been hearing about LVM, and I finally took the plunge and figured out how to get the two to play together. Of course, a benefit of RAID is data security. The big benefit I see from LVM is getting to add/remove disk space without repartitioning. Once RAID is working, stacking LVM on top was easy enough, especially for my use case of a single-big-filesystem. I was able to move all my data onto one RAID array, built a new filesystem on top of a logical volume, move data to the new filesystem, and then add the final RAID array to the logical volume and resize the filesystem. Thus, I end up with 3 separate RAID arrays glommed together into a single, large filesystem.

## Tell LVM about RAID arrays 
sudo pvcreate /dev/md2
sudo pvcreate /dev/md3

## Create a volume group from empty RAID arrays
sudo vgcreate VolGroupArray /dev/md2 /dev/md3

## Create a logical volume named "archive", using all available space 
sudo lvcreate -l +100%FREE VolGroupArray -n archive
sudo lvdisplay 
## and create a filesystem on the new logical volume 
sudo mkfs.ext4 /dev/VolGroupArray/archive

## mount the new filesystem
## and move files from the mount-point of /dev/md1 to /dev/VolGroupArray/archive
## then unmount /dev/md1

## Add the last RAID array to the volume group
sudo pvcreate /dev/md1
sudo vgextend VolGroupArray /dev/md1

## Update the logical volume to use all available space 
sudo lvresize -l +100%FREE /dev/VolGroupArray/archive
## And resize the filesystem -- rather slow, maybe faster to unmount it first...
sudo resize2fs /dev/VolGroupArray/archive

## Finally, get blkid and update /etc/fstab with UUID and mount options (here, just noatime)
sudo blkid

I probably should have made backups before I did this, but everything went smoothly...
Also, I discovered this python tool to do conversions in-place. Again, this appears non-destructive, but back-ups never hurt. Also of interest for a file server is Smartmontools to monitor for hardware/disk failures: a nice review is here.

[REFS]
* http://home.gagme.com/greg/linux/raid-lvm.php
* https://wiki.archlinux.org/index.php/Software_RAID_and_LVM
* http://webworxshop.com/2009/10/10/online-filesystem-resizing-with-lvm

07 June 2013

Symmetric set differences in R

My .Rprofile contains a collection of convenience functions and function abbreviations. These are either functions I use dozens of times a day and prefer not to type in full:
## my abbreviation of head()
h <- function(x, n=10) head(x, n)
## and summary()
ss <- summary
Or problems that I'd rather figure out once, and only once:
## example:
## between( 1:10, 5.5, 6.5 )
between <- function(x, low, high, ineq=F) {
    ## like SQL between, return logical index
    if (ineq) {
        x >= low & x <= high
    } else {
        x > low & x < high
    }
}
One of these "problems" that's been rattling around in my head is the fact that setdiff(x, y) is asymmetric, and has no options to modify this. With some regularity, I want to know if two sets are equal, and if not, what are the differing elements. setequal(x, y) gives me a boolean answer to the first question. It would *seem* that setdiff(x, y) would identify those elements. However, I find the following result rather counter-intuitive:
> setdiff(1:5, 1:6) 
integer(0)
I personally dislike having to type both setdiff(x,y) and setdiff(y,x) to identify the differing elements, as well as remember which is the reference set (here, the second argument, which I find personally counterintuitive). With this in mind, here's a snappy little function that returns the symmetric set difference:
symdiff <- function( x, y) { setdiff( union(x, y), intersect(x, y))}
> symdiff(1:5, 1:6) == symdiff(1:6, 1:5)
[1] TRUE

Tada! A new function for my .Rprofile!

25 November 2012

Successive Differences of a Randomly-Generated Timeseries

I was wondering about the null distribution of successive differences of random sequences, and decided to do some numerical experiments. I quickly realized that successive differences equates to taking successively higher-order numerical derivatives, which functions as a high-pass filter. So, the null distribution really depends on the spectrum of the original timeseries.

Here, I've only played with random sequences, which are equivalent to white noise. Using the wonderful animation package, I created a movie that shows the timeseries resulting from differencing, along with their associated power spectra. You can see that, by the end, almost all of the power is concentrated in the highest frequencies. The code required to reproduce this video is shown below.

Note -- For optimum viewing, switch the player quality to high.
require(animation)
## large canvas, write output to this directory, 1 second per frame
ani.options(ani.width=1280, ani.height=720, loop=F, title='Successive differences of white noise', outdir='.', interval=1)
require(plyr)
require(xts)
## How many realizations to plot?
N=5
## random numbers
aa = sapply(1:26, function(x) rnorm(1e2));
colnames(aa) = LETTERS;

saveVideo( {
    ## for successive differences, do...
    for (ii in 1:50) {
        ## first make the differences and normalize
        aa1 = apply(aa, 2, function(x) {
            ret=diff(x, differences=ii);ret=ret/max(ret)
        }); 
        ## Turn into timeseries object for easy plotting
        aa1 = xts(aa1, as.Date(1:nrow(aa1)));

        ## next, compute spectrum
        aa2 = alply(aa1, 2, function(x) {
            ## of each column, don't modify original data 
            ret=spectrum(x, taper=0, fast=F, detrend=F, plot=F);
            ## turn into timeseries object
            ret= zoo(ret$spec, ret$freq)});
        ## combine into timeseries matrix
        aa2 = do.call(cbind, aa2 );
        colnames(aa2) = LETTERS;

        ## plot of timeseries differences
        ## manually set limits so plot area is exactly the same between successive figures
        myplot = xyplot(aa1[,1:N], layout=c(N,1),
                        xlab=sprintf('Difference order = %d', ii), 
                        ylab='Normalized Difference',
                        ylim=c(-1.5, 1.5), 
                        scales=list(alternating=F, x=list(rot=90), y=list(relation='same'))); 

        ## plot of spectrum
        myplot1 = xyplot(aa2[,1:N], layout=c(N,1),
                        ylim=c(-0.01, 5), xlim=c(0.1, 0.51),
                        xlab='Frequency', ylab='Spectral Density',
                        type=c('h','l'), 
                        scales=list(y=list(relation='same'), alternating=F));
        ## write them to canvas
        plot(myplot, split=c(1,1,1,2), more=T);
        plot(myplot1, split=c(1,2,1,2), more=F);
        ## provide feedback of process
        print(ii)}

## controls for ffmpeg
}, other.opts = "-b 5000k -bt 5000k")

08 November 2012

Consumerist Dilemas

Consumer society gives us lots of choices, but sometimes provides little opportunity for post-choice customization. We buy something, and we can either return it or keep it, but it is what it is. Want something a little shorter, a little tighter, or a little less stiff? Too bad. Unless you're loaded, and then you can hire someone to do a custom job. But, to my knowledge, even the super-rich no longer order custom automobiles.

For a small subset of expensive, long-lived personal items like cars and mattresses, this can make new purchases particularly stressful. This is one plausible explanation for the copious consumer-choice media devoted to, for example, cars, as well as the effort companies put into brand identity. If they sell you something that you like, then there's a reasonable chance you'll get another one, the "safe choice", when the old one breaks.

I've faced this dilemma recently with shoes and glasses. I typically have one or two pairs of each that I use primarily, daily, typically for at least 2 years. I'm very near-sighted, so I have to get the expensive high-index lenses. My big toes spread out, I assume from years of wearing flip-flops and Chacos, so most every shoe feels narrow and pointy. The decision to get a new pair of glasses or shoes fills me with an existential dread -- what if I'm stuck with costly discomfort for the next year or more?

This month I discovered that keyboards engender a similar dread. I'm at a keyboard for at least 20 hours on an average week, and I imagine it gets up to 60 hours on a long week. Not all of that is typing, but I've had a slowly building case of carpal tunnel / repetitive stress injury, particularly in my right hand, from long coding sessions that involve heavy use of shift, up-arrow, and enter keys. After a month of poking around the internet, checking reviews on Amazon and looking at eBay for used items, I felt totally conflicted. I just wanted to "try something on" before I shelled out cash to saddle myself with something that I ended up hating.

But anything would be better than my stock Dell monstrosity. In this spirit, I dropped by the chaotic indie computer parts/repair store across the street from campus, and found two ergonomic keyboards used and on the cheap:

#1 Microsoft Comfort Curve 2000 v1
#2 Adesso EKB-2100W

The perfect keyboard is neither of them. The Adesso has a nice split design, but it's huge. Enormous. Heavy -- it could be an instrument of murder in the game of Clue. #1 looks nice, and has really easy key-travel. Unfortunately, the left Caps-lock, which is mapped to control and is involved in something like ~5-20% of my (non-prose) keystrokes, is sticky and giving my pinky problems already.

As least they're cheap.

A friend pointed out that one key to preventing RSI is to change up the routine. With this in mind, maybe two keyboards isn't such a bad idea. Did I mention they're cheap? At very least, the existential dread has abated. No more keyboard choices to make in the foreseeable future. Now, if I could just find a low-profile pair of sneakers with a large toebox and a tasteful lack of neon and mesh.

02 October 2012

Modeling Philosopy

I've found it sometimes difficult to explain my modeling philosophy, especially in the face of "why don't you just do X?" or "isn't that just like Y and Z?" responses. Without going into X, Y, or Z in the slightest, I present the following quote from one of my intellectual heros. His short book (less than 100 pages) full of equations that I struggle to make sense of (not unlike going to the opera) is full of little morsels of wisdom. Here's one of my favorites:
It need hardly be repeated that no ~detailed~ statistical agreement with observation is to be expected from models which are based on admitted simplifications. But the situation is quite analogous to others where complex and interacting statistical systems are under consideration. What is to be looked for is a comparable pattern with the main features of the phenomena; when, as we shall see, the model predicts important features which are found to conform to reality, it becomes one worthy of further study and elaboration.
--M.S Bartlett, 1960.
Stochastic Population Models in Ecology and Epidemiology

29 February 2012

Custom Amazon EC2 config for Rstudio

Introduction

This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.

To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.

Edit -- another look

After playing with this setup more and trying to shrink the root partition (as per
these instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.


If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.

Modifications

I proceeded to lock it down -- add an admin user, prevent root login, etc.
Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).

Below is the commandline session that I used to do it, along with annotations and links to some files.
adduser myusername
adduser myusername sudoers
## add correct key to ~myusername/.ssh/authorized_keys

vi /etc/ssh/sshd_config 
## disable root login
/etc/init.d/ssh restart
## now log in as myusername via another terminal to make sure it works, and then log out as root


Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.

## change hostname to match afraid.org entry
sudo vi /etc/hostname
sudo /etc/init.d/hostname restart

## Now it's time to make Rstudio server a little more secure
## from http://rstudio.org/docs/server/running_with_proxy
sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev
sudo a2enmod proxy
sudo a2enmod proxy_http

## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/
openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \
'/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org'

## change permissions, move to default ubuntu locations
sudo chmod 444 crocus.crt
sudo chown root.ssl-cert crocus.crt
sudo mv crocus.crt /usr/share/ca-certificates

sudo chmod 640 crocus.key
sudo chown root.ssl-cert crocus.key
sudo mv crocus.key /etc/ssl/private/

sudo a2enmod rewrite
sudo a2enmod ssl

sudo vi /etc/apache2/sites-enabled/000-default
sudo /etc/init.d/apache2 restart

You can see my full apache config file here:

Conclusions

Now I access Rstudio on the EC2 instance with a browser via:
https://myhostname.mydomain.org

I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.

Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.

Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.

All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.

28 February 2012

Adventures in R Studio Server: Apache2, Https, Security, and Amazon EC2.

I just put a fresh install of Ubuntu Server (10.04.4 LTS) on one of our machines.  As I was doing some post-install config, I accidentally installed Rstudio Server.  And subsequently fell down an exciting little rabbit-hole of server configuration and "ooooh-lala!" playtime.

A friend sung the wonders of Rstudio Server to me recently, and I filed it under "things to ignore for now".  Just another thing to learn, right?  Turns out, the Rstudio folks do *great* work and write good docs, so I hardly had to learn anything.  I just had to dust off my sysadmin skills and fire up some google.

I'm a little concerned about running web services on public-facing machines.  Even more so, given that R provides fairly low-level access to operating system services.  Still, I was impressed to see system user authentication.

I followed the docs for running apache2 as a proxy server, and learned a little about apache in the process.  Since I made it this far, I figured I'd run it through https/ssl, add some memory limitations, etc.  I'm still not entirely convinced this is secure -- it seems that running it in a virtual machine or chroot jail would be ideal.

 On the other hand, I ran across this post on running Rstudio Server inside Amazon EC2 instances.  Nighttime EC2 spot prices on "Quadruple Extra Large" instances (68.4 GB of memory,
8 virtual cores with 3.25 EC2 Compute Units each) fell below $1 an hour tonight, which is cheap enough to play with for an hour or two -- take it through some paces and see how well it does with a *very* *large* *job* or two.  Instances can now be stopped and saved to EBS (elastic block storage), and so only need to be configured once, which really simplifies matters. In fact, I'm wondering if Rstudio (well, R, really) is my "killer app" for EC2. 

Overall, I was really impressed at how fast and easy this was to get up and running. Fun times ahead!