Introduction
This post is a work in progress building on the previous post. It's my attempt to simultaneously learn Amazon's AWS tools and set up R and Rstudio Server on a customized "cloud" instance. I look forward to testing some R jobs that have large memory requirements or are very parallelizable in the future.To start, I followed the instructions here to get a vanilla Ubuntu/Bioconductor EC2 image up and running. This was a nice exercise -- props to the bioconductor folks for setting up this image.
Edit -- another look
After playing with this setup more and trying to shrink the root partition (as perthese instructions), I realized there's a tremendous amount of cruft in this AMI. It's over 20GB, there's a weird (big) /downloads/ folder lying aruond, and just about every R package ever in /use/local/lib64/R. I cut it down to ~6gb by removing these two directories.
If I were to do this again, I would use the official Canonical images (more details here), and manually install what I need. Honesty, this is more my style -- it means fewer unknown hands have touched my root, and there's a clear audit chain of the image. Aside from installing a few extra packages (Rstudio server, for example), the rest of the instructions should be similar.
Modifications
I proceeded to lock it down -- add an admin user, prevent root login, etc.Then I set up apache2 to serve Rstudio server over HTTPS/SSL.
In the AWS console, I edited Security Groups to add a custom Inbound rule for TCP port 443 (https). I then closed off every other port besides 22 (ssh).
Below is the commandline session that I used to do it, along with annotations and links to some files.
adduser myusername adduser myusername sudoers ## add correct key to ~myusername/.ssh/authorized_keys vi /etc/ssh/sshd_config ## disable root login /etc/init.d/ssh restart ## now log in as myusername via another terminal to make sure it works, and then log out as root
Next, I set up dynamic dns using http://afraid.org (I've previously registered my own domain and added it to their system). I use a script file made specifically to work with AWS -- it's very self-explanatory.
## change hostname to match afraid.org entry sudo vi /etc/hostname sudo /etc/init.d/hostname restart ## Now it's time to make Rstudio server a little more secure ## from http://rstudio.org/docs/server/running_with_proxy sudo apt-get install apache2 libapache2-mod-proxy-html libxml2-dev sudo a2enmod proxy sudo a2enmod proxy_http ## based on instructions from http://beeznest.wordpress.com/2008/04/25/how-to-configure-https-on-apache-2/ openssl req -new -x509 -days 365 -keyout crocus.key -out crocus.crt -nodes -subj \ '/O=UNM Biology Wearing Disease Ecology Lab/OU=Christian Gunning, chief technologist/CN=crocus.x14n.org' ## change permissions, move to default ubuntu locations sudo chmod 444 crocus.crt sudo chown root.ssl-cert crocus.crt sudo mv crocus.crt /usr/share/ca-certificates sudo chmod 640 crocus.key sudo chown root.ssl-cert crocus.key sudo mv crocus.key /etc/ssl/private/ sudo a2enmod rewrite sudo a2enmod ssl sudo vi /etc/apache2/sites-enabled/000-default sudo /etc/init.d/apache2 restart
You can see my full apache config file here:
Conclusions
Now I access Rstudio on the EC2 instance with a browser via:https://myhostname.mydomain.org
I found that connecting to the Rstudio server web interface gave noticable lag. Most annoyingly, key-presses were missed, meaning that I kept hitting enter on incorrect commands. Connecting to the commandline via SSH worked much better.
Another annoyance was that Rstudio installs packages into something like ~R/libraries, whereas the commandline R installs them into ~/R/x86_64-pc-linux-gnu-library/2.14. Is this a general feature of Rstudio? It's a little confusing that this isn't standardized.
Another quirk -- I did all of this on a Spot Price instance. After all of these modifications, I discovered that Spot instances can't be "stopped" (the equivalent of powering down), only terminated (which discards all of the changes). After some looking, I discovered that I could "Create an Image" (EBS AMI) from the running image. This worked well -- I was able to create a new instance from the new AMI that had all of the changes, terminate the original instance, and then stop the new instance.
All of this sounds awfully complicated. Overall, this is how I've felt about Amazon AWS in general and EC2 in particular for a while. The docs aren't great, the web-based GUI tools are sometimes slow to respond, and the concepts are *new*. But I'm glad I waded in and got my feet wet. I now understand how to power up my customized image on a micro instance for less than $0.10 an hour to configure and re-image it, and how to run that image on an 8 core instance with 50+GB RAM for less than a dollar an hour via Spot Pricing.