Tuesday, April 14, 2009

Configuring a Hadoop cluster on EC2

I've been ramping up on Amazon Elastic MapReduce, but now I need to process a 32GB file located in an Elastic Block Store, and there is no way I know of to get the AMIs that Amazon Elastic MapReduce starts up to mount an arbitrary EBS. So now it's time to roll my own Hadoop Cluster out on EC2.

I looked around for a while, and found this somewhat out of date tutorial by Tom White that pointed me to a set of ec2 helper scripts in the src/contrib subdir of the Hadoop installation.

Unfortunately, those scripts did not get me 'all the way there', but they were a start. I'm going to try to roll my changes into those ec2 helper scripts before I have to do set up another cluster:)

Setting Up a Multi Node Hadoop Cluster on EC2

Prior to setting up a Multi Node Hadoop Cluster, I set up a single node standalone installation. I recommend doing this because it allowed me to make sure my code worked, i.e. my jar file was valid, my Mapper and Reducer were working, etc.

In order to set up a multi node hadoop cluster, the standard Hadoop Cluster setup instructions applied to the EC2 environment meant that I would have to do the following
  1. find an AMI with hadoop on it
  2. bring up N+1 of those
  3. make one the master and the rest the slaves
  4. change the master config to account for the slaves
  5. change the slave config to point to the master
  6. allow Hadoop component port access between master and slave for namenode and datanode communication
  7. start the system up.
The scripts at {hadoop src location}/src/contrib/ec2/bin use the ec2 API shells to do attempt to do all of the above. They fall short in a couple of key areas, and need to be extended. I'm going to detail the necessary steps I took to get a cluster fully operational so that I can extend those scripts in the future.

What the scripts do:
  1. Find AMIs and starting up instances, N slave instances and 1 master instance.
  2. Allow you to log into the master as well as push files out to it.
  3. Generate a private/public key on the master, and push the public key out to the slaves to enable password-less ssh.
  4. Push the master hadoop-site.xml out to all slaves.
What they do not do:
  • they do not configure the master conf/slaves file to contain the IPs of all slaves.
  • they do not set up security groups with overridden port values specified in the /etc/rc.local of the AMI I was using. Those values are catted to conf/hadoop-site.xml. To be honest, there is no way they could actually be aware of those values unless the scripts were synchronized to the image, which they weren't.

Both of these mean that true distributed startup doesn't happen. But the failure is 'silent', so unless you are looking at the logs on multiple machines, you don't know that things are failing.

Initial Script Setup Steps
Here are the steps I used to get working with the scripts. Note that the AMI the scripts point to by default has version 0.17 of Hadoop installed.

(1) I configured my EC2_PRIVATE_KEY and EC2_CERT env vars to point to the .pem files I generated for them.
(2) In {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2-env.sh, I set the following env vars:
  • AWS_ACCOUNT_ID={acct number}
  • AWS_ACCESS_KEY_ID={key id}
  • AWS_SECRET_ACCESS_KEY={secret key}
  • KEY_NAME={name of KeyPair you want to use} NOTE: on the KeyPair, the hadoop-ec2 scripts assume that the generated private key for your keypair resides in the same directory you configured your EC2_PRIVATE_KEY in.
(3) {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} {number of desired nodes} to start up a cluster with the AMI configured at the S3_BUCKET location specified in the conf file.

At this point, I thought the cluster was up and running, but when I tried to copy a large file to the cluster, I got this error:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/root/input could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
I googled this and found out that it implied that my data nodes were failing (but I hadn't seen that!). I checked the masters and slaves files in the master machine conf file, and they only contained localhost, which meant that the master knew nothing about the slaves at startup.

I stopped hadoop, changed the conf/slaves config file to include the Amazon internal names of all slaves, and restarted. This time I could see the remote slave data nodes start up. So I tried the copy again, and got the same failure.

When I went out to a slave machine, I looked at the datanode log file in the log directory (on this AMI, configured at /mnt/hadoop/logs. I saw that the datanode service was trying to contact the master with no success.

This ended up being because of the security policy of EC2. In EC2, you need to explicitly configure which ports are accessible on each instance via EC2 Security Groups. In summary, the current scripts assumed defaults from hadoop-default.xml, and I had overridden some of those defaults in hadoop-site.xml.


  • extended conf/slaves to include IP addresses of slave instances.
  • added 50001 and 50002 access to the master security group (meaning that slave nodes could talk to the master on those ports)
  • added 50010 access to the slave security group (same meaning for master to slaves)
At this point Hadoop was configured with 4 slave nodes and 1 master.

Attaching an EBS to the master and copying EBS data to HDFS
(1) My data source was located in a Elastic Block Store volume. These are mounted like so:

ec2-attach-volume {volume ID} -i {image ID} -d {device location on image, i.e. /dev/sda}

(2) In order to actually access the data, you mount that volume like you would mount a drive:

mount {name of dir to map to} {name of device to mount, i.e. /dev/sda}

Running the Job
(1) Once I mounted the volume, I needed to log into the master to start the job.

{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} login

(2) Then I need to push the file to HDFS for processing.

hadoop fs -mkdir input
hadoop fs -copyFromLocal {location of articles.tsv} input

(3) From my local box, I push my jar out to the master:
{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} push {name of my jar file}

(4) On the master box, I start the job:
hadoop jar {name of jar} input output (NOTE: job will fail immediately if output exists in hdfs)

Wednesday, April 8, 2009

Developing an Application on Amazon Elastic MapReduce

Follow up from last week:
I've been interested in parallel processing for a while, but didn't have the time/opportunity to get into it until recently, when a group of us were asked to rewrite a data extraction system that was built around a single (relational) database, and suffering from severe i/o contention.

The actual mechanics of what we were trying to do with that system lent themselves to parallel processing, specifically there was a lot of data transformation and aggregation that we needed to do -- in other words map-reduce.

Right around the time that we decided to do the rewrite, Amazon came out with Elastic MapReduce. Prior to EMR, our options were to either set up a local Hadoop cluster, or assemble one out on EC2. My experience with Hadoop was zero, and my experience with the map/reduce algorithm was limited to working through the canonical Inverted Index example in my head. So I wanted to minimize the learning curve of setting up a cluster while still being able to validate our ideas. I don't know if we are eventually going to end up building our own AMIs for a custom Hadoop cluster, but not having to deal with those details at this point allows us to focus on the ideas instead of the infrastructure. Yay!

In my last post I explained the steps I took to run the basic streaming sample. Next up, I wanted to run my own map reduce job. In order to develop an EMR application utilizing a custom jar, I took the following steps:

(1) Set up a single node hadoop installation on my dev box.
Because Time Is Money on EC2, I wanted to get my code as right as I could before running it in the cloud. The Hadoop site provides a great step by step install sheet for Leopard.

(2) Went through the basic Hadoop Tutorial. This tutorial provides great explanations of each class the developer has to implement, as well as the helper classes. Whenever I got stuck, the answer was usually found in the tutorial.

(3) Installed the IBM MapReduce Tools Plugin for Eclipse. While I am not quite able to get it debugging a map/reduce job from my local Hadoop install, I can run my program in 'standalone' mode as a Java Application, as long as Hadoop is started on my system.

(4) After debugging in the IDE, I ran the job as a standard hadoop job:

hadoop jar {name of jar} {input dir} {output dir}

I chose to export the jar from eclipse (instead of writing a build.xml), as part of this I got to specify the main class and autogenerate the MANIFEST.MF.

The input directory and output directory needed to be specified as HDFS directories. The output directory could not exist.

NOTE on the application driver: I took a clue from example code (eventually) after running into unzip errors which a couple of posts said were due to bad configuration. I derived the driver class from Configuration, extended the Tool interface.

(5) After running successfully as a local Hadoop Job, I ran the job from the cloud,

(a) I copied the jar up to s3
(b) I copied the input files up to s3
(c) I created a top level bucket for the output directories.

I tried to run my job as it was configured for hadoop, with the input and output specified as s3n://{bucket-name} with no success.

In order to diagnose the problem, I enabled logging, which I should have done immediately. When configuring a job via the AWS Management console, select 'Advanced Options' in the 'configure EC2 instances' section. You need to specify the log destination using s3n://{bucket-name} format.

With logging enabled, I saw the following exception:

java.lang.IllegalArgumentException: Path must be absolute: s3n://java-emr-test

and it made no sense, even with googling. I ran the cloudburst sample, and noticed that they specified a full filename as the input parameter. When I did the same, my run succeeded. I'm not sure why I had to do this, and want to make sure I really need to do it, because it will mean an additional read step of intermediate output directories prior to subsequent map reduce steps.

Now that I've got this working through the UI, I'm going to look at the entangledstate ruby tool because it appears to allow programmatic configuration of multi step job flows.

Monday, April 6, 2009

How I got rolling in the cloud

I recently jumped at the chance to research re-implementing a project here at work in the Amazon cloud. I've been curious about running EC2 instances for a while, and when AMZN announced Elastic MapReduce, their cloud implementation that removed the need to hand assemble Hadoop clusters, I really didn't have any excuses left.

There was a bit of FUD involved in actually getting into an actual cloud -- creating running instances, talking to various services, etc, in addition to trying a new approach to a system that was quickly approaching non functional.

This FUD was complicated by my head cold and the medication I was taking, but despite that fog (yes, I always blame it on the drugs) I was able to muddle through and get something going. My notes (aka a series of pointers to other peoples work):

(1) I needed to view some sample code at http://elasticmapreduce.s3.amazonaws.com/ , i.e. code that was not in my personal s3 store. I tried to build a couple of S3 browsers, and was about to embark on a yak shaving exercise due to a misconfigured ant build on my dev box when I decided to try s3curl instead. s3curl and irb loaded with hpricot allowed me to get an XML listing of keys in a bucket, then parse the returned XML and download the source code files I wanted to see specifically, the AWS Elastic MapReduce Freebase sample code. I'm 100% sure I could have done this via a UI, but really didn't want to get distracted trying to fix a secondary issue.

(2) For browsing and syncing my personal s3 store: I used the S3 Firefox Organizer plugin. Especially useful when inspecting the output of a map-reduce run.

(3) For configuring AMIs and binding EBS volumes of public instance data, I used ElasticFox, another FF plugin. The tutorial walks you through the details of how to generate a keypair, create an instance from an AMI, and bind to an EBS.

(4) The application I'm working on (for work) processes wikipedia and freebase, both of which can be painful and time consuming to get dumps of. Freebase has done the 'right thing' and posted public instances of the Freebase data store as well as a 'cleaned up' version of the Wikipedia data store that is suitable for a postgres database. Just having these volumes available removes at least 4 hours of setup and maintenance time from our process.

(5) As part of their announcement, Amazon posted a tutorial on how to use Elastic MapReduce using Freebase data. I found this great PDF that walked me through using the CLI to set up several different workflows using different mappers and reducers to find the most popular people in American Football. The mappers and reducers output data to S3 and SimpleDB, which was great for me to see since I didn't have a lot of familiarity with either.

That's it for now. I'm going to write more as I prototype key parts of the system and try to figure out the best way to implement.