Tuesday, April 14, 2009

Configuring a Hadoop cluster on EC2

I've been ramping up on Amazon Elastic MapReduce, but now I need to process a 32GB file located in an Elastic Block Store, and there is no way I know of to get the AMIs that Amazon Elastic MapReduce starts up to mount an arbitrary EBS. So now it's time to roll my own Hadoop Cluster out on EC2.

I looked around for a while, and found this somewhat out of date tutorial by Tom White that pointed me to a set of ec2 helper scripts in the src/contrib subdir of the Hadoop installation.

Unfortunately, those scripts did not get me 'all the way there', but they were a start. I'm going to try to roll my changes into those ec2 helper scripts before I have to do set up another cluster:)

Setting Up a Multi Node Hadoop Cluster on EC2

Prior to setting up a Multi Node Hadoop Cluster, I set up a single node standalone installation. I recommend doing this because it allowed me to make sure my code worked, i.e. my jar file was valid, my Mapper and Reducer were working, etc.

In order to set up a multi node hadoop cluster, the standard Hadoop Cluster setup instructions applied to the EC2 environment meant that I would have to do the following
  1. find an AMI with hadoop on it
  2. bring up N+1 of those
  3. make one the master and the rest the slaves
  4. change the master config to account for the slaves
  5. change the slave config to point to the master
  6. allow Hadoop component port access between master and slave for namenode and datanode communication
  7. start the system up.
The scripts at {hadoop src location}/src/contrib/ec2/bin use the ec2 API shells to do attempt to do all of the above. They fall short in a couple of key areas, and need to be extended. I'm going to detail the necessary steps I took to get a cluster fully operational so that I can extend those scripts in the future.

What the scripts do:
  1. Find AMIs and starting up instances, N slave instances and 1 master instance.
  2. Allow you to log into the master as well as push files out to it.
  3. Generate a private/public key on the master, and push the public key out to the slaves to enable password-less ssh.
  4. Push the master hadoop-site.xml out to all slaves.
What they do not do:
  • they do not configure the master conf/slaves file to contain the IPs of all slaves.
  • they do not set up security groups with overridden port values specified in the /etc/rc.local of the AMI I was using. Those values are catted to conf/hadoop-site.xml. To be honest, there is no way they could actually be aware of those values unless the scripts were synchronized to the image, which they weren't.

Both of these mean that true distributed startup doesn't happen. But the failure is 'silent', so unless you are looking at the logs on multiple machines, you don't know that things are failing.

Initial Script Setup Steps
Here are the steps I used to get working with the scripts. Note that the AMI the scripts point to by default has version 0.17 of Hadoop installed.

(1) I configured my EC2_PRIVATE_KEY and EC2_CERT env vars to point to the .pem files I generated for them.
(2) In {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2-env.sh, I set the following env vars:
  • AWS_ACCOUNT_ID={acct number}
  • AWS_ACCESS_KEY_ID={key id}
  • AWS_SECRET_ACCESS_KEY={secret key}
  • KEY_NAME={name of KeyPair you want to use} NOTE: on the KeyPair, the hadoop-ec2 scripts assume that the generated private key for your keypair resides in the same directory you configured your EC2_PRIVATE_KEY in.
(3) {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} {number of desired nodes} to start up a cluster with the AMI configured at the S3_BUCKET location specified in the conf file.

At this point, I thought the cluster was up and running, but when I tried to copy a large file to the cluster, I got this error:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/root/input could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
I googled this and found out that it implied that my data nodes were failing (but I hadn't seen that!). I checked the masters and slaves files in the master machine conf file, and they only contained localhost, which meant that the master knew nothing about the slaves at startup.

I stopped hadoop, changed the conf/slaves config file to include the Amazon internal names of all slaves, and restarted. This time I could see the remote slave data nodes start up. So I tried the copy again, and got the same failure.

When I went out to a slave machine, I looked at the datanode log file in the log directory (on this AMI, configured at /mnt/hadoop/logs. I saw that the datanode service was trying to contact the master with no success.

This ended up being because of the security policy of EC2. In EC2, you need to explicitly configure which ports are accessible on each instance via EC2 Security Groups. In summary, the current scripts assumed defaults from hadoop-default.xml, and I had overridden some of those defaults in hadoop-site.xml.

Summary:

  • extended conf/slaves to include IP addresses of slave instances.
  • added 50001 and 50002 access to the master security group (meaning that slave nodes could talk to the master on those ports)
  • added 50010 access to the slave security group (same meaning for master to slaves)
At this point Hadoop was configured with 4 slave nodes and 1 master.

Attaching an EBS to the master and copying EBS data to HDFS
(1) My data source was located in a Elastic Block Store volume. These are mounted like so:

ec2-attach-volume {volume ID} -i {image ID} -d {device location on image, i.e. /dev/sda}

(2) In order to actually access the data, you mount that volume like you would mount a drive:

mount {name of dir to map to} {name of device to mount, i.e. /dev/sda}

Running the Job
(1) Once I mounted the volume, I needed to log into the master to start the job.

{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} login

(2) Then I need to push the file to HDFS for processing.

hadoop fs -mkdir input
hadoop fs -copyFromLocal {location of articles.tsv} input

(3) From my local box, I push my jar out to the master:
{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} push {name of my jar file}

(4) On the master box, I start the job:
hadoop jar {name of jar} input output (NOTE: job will fail immediately if output exists in hdfs)




4 comments:

  1. Arun, have you tried the Cloudera AMI for EC2? That said, it still doesn't auto-mount EBS volumes for you... this is something I'm interested in doing but don't yet have enough knowledge to get my head around it...

    ReplyDelete
  2. p7a, I'm actually in the middle of writing up a guest post for the Cloudera guys, who contacted me after reading this last post. It was _way_ easier. Stay tuned, I'll update when this post goes out on the Cloudera blog.

    ReplyDelete
  3. When you say cluster here, do you mean, you have more than 1 machine at your end, and all these machines execute one job on EC2, Please correct me if i am wrong. Actually I have just started reading your post, its too long, so taking time to understand time.
    You answers may help me to understand your post better

    ReplyDelete
  4. The best way to do this for starters is to install, configure and test a "local" Hadoop setup for each of the two Ubuntu boxes, and in a second step to "merge" these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master, and the other box will become only a slave.

    ReplyDelete