Monday, April 6, 2009

How I got rolling in the cloud

I recently jumped at the chance to research re-implementing a project here at work in the Amazon cloud. I've been curious about running EC2 instances for a while, and when AMZN announced Elastic MapReduce, their cloud implementation that removed the need to hand assemble Hadoop clusters, I really didn't have any excuses left.

There was a bit of FUD involved in actually getting into an actual cloud -- creating running instances, talking to various services, etc, in addition to trying a new approach to a system that was quickly approaching non functional.

This FUD was complicated by my head cold and the medication I was taking, but despite that fog (yes, I always blame it on the drugs) I was able to muddle through and get something going. My notes (aka a series of pointers to other peoples work):

(1) I needed to view some sample code at , i.e. code that was not in my personal s3 store. I tried to build a couple of S3 browsers, and was about to embark on a yak shaving exercise due to a misconfigured ant build on my dev box when I decided to try s3curl instead. s3curl and irb loaded with hpricot allowed me to get an XML listing of keys in a bucket, then parse the returned XML and download the source code files I wanted to see specifically, the AWS Elastic MapReduce Freebase sample code. I'm 100% sure I could have done this via a UI, but really didn't want to get distracted trying to fix a secondary issue.

(2) For browsing and syncing my personal s3 store: I used the S3 Firefox Organizer plugin. Especially useful when inspecting the output of a map-reduce run.

(3) For configuring AMIs and binding EBS volumes of public instance data, I used ElasticFox, another FF plugin. The tutorial walks you through the details of how to generate a keypair, create an instance from an AMI, and bind to an EBS.

(4) The application I'm working on (for work) processes wikipedia and freebase, both of which can be painful and time consuming to get dumps of. Freebase has done the 'right thing' and posted public instances of the Freebase data store as well as a 'cleaned up' version of the Wikipedia data store that is suitable for a postgres database. Just having these volumes available removes at least 4 hours of setup and maintenance time from our process.

(5) As part of their announcement, Amazon posted a tutorial on how to use Elastic MapReduce using Freebase data. I found this great PDF that walked me through using the CLI to set up several different workflows using different mappers and reducers to find the most popular people in American Football. The mappers and reducers output data to S3 and SimpleDB, which was great for me to see since I didn't have a lot of familiarity with either.

That's it for now. I'm going to write more as I prototype key parts of the system and try to figure out the best way to implement.