Tuesday, April 14, 2009

Configuring a Hadoop cluster on EC2

I've been ramping up on Amazon Elastic MapReduce, but now I need to process a 32GB file located in an Elastic Block Store, and there is no way I know of to get the AMIs that Amazon Elastic MapReduce starts up to mount an arbitrary EBS. So now it's time to roll my own Hadoop Cluster out on EC2.

I looked around for a while, and found this somewhat out of date tutorial by Tom White that pointed me to a set of ec2 helper scripts in the src/contrib subdir of the Hadoop installation.

Unfortunately, those scripts did not get me 'all the way there', but they were a start. I'm going to try to roll my changes into those ec2 helper scripts before I have to do set up another cluster:)

Setting Up a Multi Node Hadoop Cluster on EC2

Prior to setting up a Multi Node Hadoop Cluster, I set up a single node standalone installation. I recommend doing this because it allowed me to make sure my code worked, i.e. my jar file was valid, my Mapper and Reducer were working, etc.

In order to set up a multi node hadoop cluster, the standard Hadoop Cluster setup instructions applied to the EC2 environment meant that I would have to do the following
  1. find an AMI with hadoop on it
  2. bring up N+1 of those
  3. make one the master and the rest the slaves
  4. change the master config to account for the slaves
  5. change the slave config to point to the master
  6. allow Hadoop component port access between master and slave for namenode and datanode communication
  7. start the system up.
The scripts at {hadoop src location}/src/contrib/ec2/bin use the ec2 API shells to do attempt to do all of the above. They fall short in a couple of key areas, and need to be extended. I'm going to detail the necessary steps I took to get a cluster fully operational so that I can extend those scripts in the future.

What the scripts do:
  1. Find AMIs and starting up instances, N slave instances and 1 master instance.
  2. Allow you to log into the master as well as push files out to it.
  3. Generate a private/public key on the master, and push the public key out to the slaves to enable password-less ssh.
  4. Push the master hadoop-site.xml out to all slaves.
What they do not do:
  • they do not configure the master conf/slaves file to contain the IPs of all slaves.
  • they do not set up security groups with overridden port values specified in the /etc/rc.local of the AMI I was using. Those values are catted to conf/hadoop-site.xml. To be honest, there is no way they could actually be aware of those values unless the scripts were synchronized to the image, which they weren't.

Both of these mean that true distributed startup doesn't happen. But the failure is 'silent', so unless you are looking at the logs on multiple machines, you don't know that things are failing.

Initial Script Setup Steps
Here are the steps I used to get working with the scripts. Note that the AMI the scripts point to by default has version 0.17 of Hadoop installed.

(1) I configured my EC2_PRIVATE_KEY and EC2_CERT env vars to point to the .pem files I generated for them.
(2) In {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2-env.sh, I set the following env vars:
  • AWS_ACCOUNT_ID={acct number}
  • AWS_ACCESS_KEY_ID={key id}
  • AWS_SECRET_ACCESS_KEY={secret key}
  • KEY_NAME={name of KeyPair you want to use} NOTE: on the KeyPair, the hadoop-ec2 scripts assume that the generated private key for your keypair resides in the same directory you configured your EC2_PRIVATE_KEY in.
(3) {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} {number of desired nodes} to start up a cluster with the AMI configured at the S3_BUCKET location specified in the conf file.

At this point, I thought the cluster was up and running, but when I tried to copy a large file to the cluster, I got this error:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/root/input could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
I googled this and found out that it implied that my data nodes were failing (but I hadn't seen that!). I checked the masters and slaves files in the master machine conf file, and they only contained localhost, which meant that the master knew nothing about the slaves at startup.

I stopped hadoop, changed the conf/slaves config file to include the Amazon internal names of all slaves, and restarted. This time I could see the remote slave data nodes start up. So I tried the copy again, and got the same failure.

When I went out to a slave machine, I looked at the datanode log file in the log directory (on this AMI, configured at /mnt/hadoop/logs. I saw that the datanode service was trying to contact the master with no success.

This ended up being because of the security policy of EC2. In EC2, you need to explicitly configure which ports are accessible on each instance via EC2 Security Groups. In summary, the current scripts assumed defaults from hadoop-default.xml, and I had overridden some of those defaults in hadoop-site.xml.


  • extended conf/slaves to include IP addresses of slave instances.
  • added 50001 and 50002 access to the master security group (meaning that slave nodes could talk to the master on those ports)
  • added 50010 access to the slave security group (same meaning for master to slaves)
At this point Hadoop was configured with 4 slave nodes and 1 master.

Attaching an EBS to the master and copying EBS data to HDFS
(1) My data source was located in a Elastic Block Store volume. These are mounted like so:

ec2-attach-volume {volume ID} -i {image ID} -d {device location on image, i.e. /dev/sda}

(2) In order to actually access the data, you mount that volume like you would mount a drive:

mount {name of dir to map to} {name of device to mount, i.e. /dev/sda}

Running the Job
(1) Once I mounted the volume, I needed to log into the master to start the job.

{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} login

(2) Then I need to push the file to HDFS for processing.

hadoop fs -mkdir input
hadoop fs -copyFromLocal {location of articles.tsv} input

(3) From my local box, I push my jar out to the master:
{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} push {name of my jar file}

(4) On the master box, I start the job:
hadoop jar {name of jar} input output (NOTE: job will fail immediately if output exists in hdfs)


  1. Arun, have you tried the Cloudera AMI for EC2? That said, it still doesn't auto-mount EBS volumes for you... this is something I'm interested in doing but don't yet have enough knowledge to get my head around it...

  2. p7a, I'm actually in the middle of writing up a guest post for the Cloudera guys, who contacted me after reading this last post. It was _way_ easier. Stay tuned, I'll update when this post goes out on the Cloudera blog.

  3. When you say cluster here, do you mean, you have more than 1 machine at your end, and all these machines execute one job on EC2, Please correct me if i am wrong. Actually I have just started reading your post, its too long, so taking time to understand time.
    You answers may help me to understand your post better

  4. The best way to do this for starters is to install, configure and test a "local" Hadoop setup for each of the two Ubuntu boxes, and in a second step to "merge" these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master, and the other box will become only a slave.

  5. This is a great inspiring tutorials.I am pretty much pleased with your good work.You put really very helpful information. Keep it up.
    Hadoop Training in hyderabad

  6. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of hadoop training in velachery, keep up your good work of sharing quality articles.

  7. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.

  8. Actually, you have explained the technology to the fullest. Thanks for sharing the information you have got. It helped me a lot. I experimented your thoughts in my training program.

    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

  9. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

  10. Oracle DBA Training in Chennai
    Thanks for sharing this informative blog. I did Oracle DBA Certification in Greens Technology at Adyar. This is really useful for me to make a bright career..

  11. Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
    Websphere Training in Chennai

  12. Data warehousing Training in Chennai
    I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..

  13. Selenium Training in Chennai
    Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

  14. Oracle Training in chennai
    Thanks for sharing such a great information..Its really nice and informative..

  15. SAP Training in Chennai
    This post is really nice and informative. The explanation given is really comprehensive and informative..

  16. This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic
    Android Training In Chennai In Chennai

  17. Pretty article! I found some useful information in your blog, it was awesome to read,thanks for sharing this great content to my vision, keep sharing..
    Unix Training In Chennai

  18. I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
    SalesForce Training in Chennai

  19. There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

  20. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce Training). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Salesforce administrator training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

  21. This is really an awesome article. Thank you for sharing this.It is worth reading for everyone. Visit us:
    Oracle Training in Chennai

  22. very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
    Oracle DBA Training in Chennai

  23. great article!!!!!This is very importent information for us.I like all content and information.I have read it.You know more about this please visit again.
    Oracle RAC Training in Chennai

  24. Wonderful tips, very helpful well explained. Your post is definitely incredible. I will refer this to my friend.SalesForce Training in Chennai

  25. I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
    Java Training in Chennai

  26. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    PHP Training in Chennai

  27. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me..
    Android Training in Chennai

  28. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    SAP Training in Chennai

  29. Excellent information with unique content and it is very useful to know about the information based on blogs.
    Hadoop Training in Chennai

  30. There will be a lot of difference in attending hadoop online training center compared to attending a live classroom training. However, websites like this with rich in information will be very useful for gaining additional knowledge.

  31. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing courses in chennai | Software testing course

  32. This comment has been removed by a blog administrator.

  33. This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

  34. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai