Tuesday, April 14, 2009

Configuring a Hadoop cluster on EC2

I've been ramping up on Amazon Elastic MapReduce, but now I need to process a 32GB file located in an Elastic Block Store, and there is no way I know of to get the AMIs that Amazon Elastic MapReduce starts up to mount an arbitrary EBS. So now it's time to roll my own Hadoop Cluster out on EC2.

I looked around for a while, and found this somewhat out of date tutorial by Tom White that pointed me to a set of ec2 helper scripts in the src/contrib subdir of the Hadoop installation.

Unfortunately, those scripts did not get me 'all the way there', but they were a start. I'm going to try to roll my changes into those ec2 helper scripts before I have to do set up another cluster:)

Setting Up a Multi Node Hadoop Cluster on EC2

Prior to setting up a Multi Node Hadoop Cluster, I set up a single node standalone installation. I recommend doing this because it allowed me to make sure my code worked, i.e. my jar file was valid, my Mapper and Reducer were working, etc.

In order to set up a multi node hadoop cluster, the standard Hadoop Cluster setup instructions applied to the EC2 environment meant that I would have to do the following
  1. find an AMI with hadoop on it
  2. bring up N+1 of those
  3. make one the master and the rest the slaves
  4. change the master config to account for the slaves
  5. change the slave config to point to the master
  6. allow Hadoop component port access between master and slave for namenode and datanode communication
  7. start the system up.
The scripts at {hadoop src location}/src/contrib/ec2/bin use the ec2 API shells to do attempt to do all of the above. They fall short in a couple of key areas, and need to be extended. I'm going to detail the necessary steps I took to get a cluster fully operational so that I can extend those scripts in the future.

What the scripts do:
  1. Find AMIs and starting up instances, N slave instances and 1 master instance.
  2. Allow you to log into the master as well as push files out to it.
  3. Generate a private/public key on the master, and push the public key out to the slaves to enable password-less ssh.
  4. Push the master hadoop-site.xml out to all slaves.
What they do not do:
  • they do not configure the master conf/slaves file to contain the IPs of all slaves.
  • they do not set up security groups with overridden port values specified in the /etc/rc.local of the AMI I was using. Those values are catted to conf/hadoop-site.xml. To be honest, there is no way they could actually be aware of those values unless the scripts were synchronized to the image, which they weren't.

Both of these mean that true distributed startup doesn't happen. But the failure is 'silent', so unless you are looking at the logs on multiple machines, you don't know that things are failing.

Initial Script Setup Steps
Here are the steps I used to get working with the scripts. Note that the AMI the scripts point to by default has version 0.17 of Hadoop installed.

(1) I configured my EC2_PRIVATE_KEY and EC2_CERT env vars to point to the .pem files I generated for them.
(2) In {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2-env.sh, I set the following env vars:
  • AWS_ACCOUNT_ID={acct number}
  • AWS_ACCESS_KEY_ID={key id}
  • AWS_SECRET_ACCESS_KEY={secret key}
  • KEY_NAME={name of KeyPair you want to use} NOTE: on the KeyPair, the hadoop-ec2 scripts assume that the generated private key for your keypair resides in the same directory you configured your EC2_PRIVATE_KEY in.
(3) {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} {number of desired nodes} to start up a cluster with the AMI configured at the S3_BUCKET location specified in the conf file.

At this point, I thought the cluster was up and running, but when I tried to copy a large file to the cluster, I got this error:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/root/input could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
I googled this and found out that it implied that my data nodes were failing (but I hadn't seen that!). I checked the masters and slaves files in the master machine conf file, and they only contained localhost, which meant that the master knew nothing about the slaves at startup.

I stopped hadoop, changed the conf/slaves config file to include the Amazon internal names of all slaves, and restarted. This time I could see the remote slave data nodes start up. So I tried the copy again, and got the same failure.

When I went out to a slave machine, I looked at the datanode log file in the log directory (on this AMI, configured at /mnt/hadoop/logs. I saw that the datanode service was trying to contact the master with no success.

This ended up being because of the security policy of EC2. In EC2, you need to explicitly configure which ports are accessible on each instance via EC2 Security Groups. In summary, the current scripts assumed defaults from hadoop-default.xml, and I had overridden some of those defaults in hadoop-site.xml.

Summary:

  • extended conf/slaves to include IP addresses of slave instances.
  • added 50001 and 50002 access to the master security group (meaning that slave nodes could talk to the master on those ports)
  • added 50010 access to the slave security group (same meaning for master to slaves)
At this point Hadoop was configured with 4 slave nodes and 1 master.

Attaching an EBS to the master and copying EBS data to HDFS
(1) My data source was located in a Elastic Block Store volume. These are mounted like so:

ec2-attach-volume {volume ID} -i {image ID} -d {device location on image, i.e. /dev/sda}

(2) In order to actually access the data, you mount that volume like you would mount a drive:

mount {name of dir to map to} {name of device to mount, i.e. /dev/sda}

Running the Job
(1) Once I mounted the volume, I needed to log into the master to start the job.

{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} login

(2) Then I need to push the file to HDFS for processing.

hadoop fs -mkdir input
hadoop fs -copyFromLocal {location of articles.tsv} input

(3) From my local box, I push my jar out to the master:
{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} push {name of my jar file}

(4) On the master box, I start the job:
hadoop jar {name of jar} input output (NOTE: job will fail immediately if output exists in hdfs)




68 comments:

  1. Arun, have you tried the Cloudera AMI for EC2? That said, it still doesn't auto-mount EBS volumes for you... this is something I'm interested in doing but don't yet have enough knowledge to get my head around it...

    ReplyDelete
  2. p7a, I'm actually in the middle of writing up a guest post for the Cloudera guys, who contacted me after reading this last post. It was _way_ easier. Stay tuned, I'll update when this post goes out on the Cloudera blog.

    ReplyDelete
  3. When you say cluster here, do you mean, you have more than 1 machine at your end, and all these machines execute one job on EC2, Please correct me if i am wrong. Actually I have just started reading your post, its too long, so taking time to understand time.
    You answers may help me to understand your post better

    ReplyDelete
  4. The best way to do this for starters is to install, configure and test a "local" Hadoop setup for each of the two Ubuntu boxes, and in a second step to "merge" these two single-node clusters into one multi-node cluster in which one Ubuntu box will become the designated master, and the other box will become only a slave.

    ReplyDelete
  5. This is a great inspiring tutorials.I am pretty much pleased with your good work.You put really very helpful information. Keep it up.
    Hadoop Training in hyderabad

    ReplyDelete
  6. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of hadoop training in velachery, keep up your good work of sharing quality articles.

    ReplyDelete
  7. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.

    ReplyDelete
  8. Actually, you have explained the technology to the fullest. Thanks for sharing the information you have got. It helped me a lot. I experimented your thoughts in my training program.


    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  9. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

    ReplyDelete
  10. Cloud is one of the tremendous technology that any company in this world would rely on(Salesforce Training). Using this technology many tough tasks can be accomplished easily in no time. Your content are also explaining the same(Salesforce administrator training in chennai). Thanks for sharing this in here. You are running a great blog, keep up this good work.

    ReplyDelete
  11. There will be a lot of difference in attending hadoop online training center compared to attending a live classroom training. However, websites like this with rich in information will be very useful for gaining additional knowledge.

    ReplyDelete
  12. Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

    Software testing training in chennai | Testing courses in chennai | Software testing course

    ReplyDelete
  13. This comment has been removed by a blog administrator.

    ReplyDelete
  14. This content is so informatics and it was motivating all the programmers and beginners to switch over the career into the Big Data Technology. This article is so impressed and keeps updating us regularly.
    Hadoop Training in Chennai | Hadoop Training Chennai | Big Data Training in Chennai

    ReplyDelete
  15. Thank you for posting the nice information about Hadoop.
    Hadoop Training In Bangalore

    ReplyDelete
  16. Thank you for sharing keep going like this and post on Mobile App Development also

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. It is a great post. Keep sharing such kind of useful information.

    entrepreneursoutlook
    Article submission sites

    ReplyDelete
  19. This comment has been removed by the author.

    ReplyDelete
  20. Amazing information,thank you for your ideas.after along time i have studied
    an interesting information's.we need more updates in your blog.
    Android Training in chennai
    Android Training courses near me
    Android Training in Anna Nagar
    android developer course in bangalore

    ReplyDelete
  21. insighting article. i really found useful. thank you for sharing experience. nice article

    ReplyDelete
  22. I have gone through your blog, it was very much useful for me and because of your blog, and also I gained many unknown information, the way you have clearly explained is really fantastic. Kindly post more like this, Thank You.
    airport ground staff training courses in chennai
    airport ground staff training in chennai
    ground staff training in chennai

    ReplyDelete
  23. This post is much helpful for us. This is really very massive value to all the readers and it will be the only reason for the post to get popular with great authority.
    Selenium Training
    Selenium Course in Chennai
    Selenium Training Institute in Chennai
    Best Software Testing Training Institute in Chennai
    Testing training
    Software testing training institutes

    ReplyDelete
  24. Thank you for sharing this kind of noteworthy information. Nice Post.It is a great post. Keep sharing

    sustainable-hyderabad
    Education

    ReplyDelete
  25. Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
    Data science Course Training in Chennai | Data Science Training in Chennai
    RPA Course Training in Chennai | RPA Training in Chennai
    AWS Course Training in Chennai | AWS Training in Chennai

    ReplyDelete
  26. Thanks For Sharing The Information The Information Shared Is Very Valuable Please Keep Updating Us Time Just Went On Reading The article Hadoop Online Course

    ReplyDelete

  27. I like your post very much. It is very much useful for my research. I hope you to share more info

    about this. Keep posting!!

    Best Devops Training Institute

    ReplyDelete
  28. Vanskeligheter( van bi ) vil passere. På samme måte som( van điện từ ) regnet utenfor( van giảm áp ) vinduet, hvor nostalgisk( van xả khí ) er det som til slutt( van cửa ) vil fjerne( van công nghiệp ) himmelen.

    ReplyDelete
  29. http://www.google.sr/url?q=https://forums.futura-sciences.com/members/1080064-thanhgompaumaieco.html
    http://www.google.ad/url?q=https://forums.futura-sciences.com/members/1080064-thanhgompaumaieco.html
    http://www.google.com.bh/url?q=https://forums.futura-sciences.com/members/1080064-thanhgompaumaieco.html
    http://www.google.com.bo/url?q=https://forums.futura-sciences.com/members/1080064-thanhgompaumaieco.html
    http://www.google.co.bw/url?q=https://forums.futura-sciences.com/members/1080064-thanhgompaumaieco.html

    ReplyDelete
  30. I have enjoyed reading your post,it looks very attractive.Thanks for sharing some useful information.Tableau can help anyone see and understand their data.Get Tableau certification from our institute.
    Best tableau training institutes in Bangalore

    ReplyDelete
  31. We as a team of real-time industrial experience with a lot of knowledge in developing applications in python programming (7+ years) will ensure that we will deliver our best in python training in vijayawada. , and we believe that no one matches us in this context.

    ReplyDelete
  32. https://www.kaashivinfotech.com/robotics-training-in-chennai
    https://www.kaashivinfotech.com/internship-for-mba-students
    https://www.kaashivinfotech.com/internship-for-cse-students-in-hyderabad
    https://www.kaashivinfotech.com/internship-in-chennai
    https://www.kaashivinfotech.com/internship-for-cse-students
    https://www.kaashivinfotech.com/best-final-year-project-in-information-technology
    https://www.kaashivinfotech.com/internship-for-bba-students
    https://www.kaashivinfotech.com/internship-for-ece-students-in-bangalore
    https://www.kaashivinfotech.com/tag/list-of-architectural-firms-in-chennai-for-internship

    ReplyDelete
  33. You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Multinode Hadoop Cluster Setup,for more information about Hadoop Multi Node Cluster setup Tutorial.

    ReplyDelete
  34. Thank you for that valuable post. Fresher’s have struggle to learn web design and developement applications in this post guide that students and give more extended knowledge of web technology. good luck guys
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  35. Thank you for the time to publish this information very useful! I've been looking for books of this nature for a way too long. I'm just glad that I found yours. Looking forward for your next post. Thanks
    Salesforce Training in Chennai

    Salesforce Online Training in Chennai

    Salesforce Training in Bangalore

    Salesforce Training in Hyderabad

    Salesforce training in ameerpet

    Salesforce Training in Pune

    Salesforce Online Training

    Salesforce Training

    ReplyDelete
  36. Thank you for the time to publish this information very useful! I've been looking for books of this nature for a way too long. I'm just glad that I found yours. Looking forward for your next post. Thanks
    Salesforce Training in Chennai

    Salesforce Online Training in Chennai

    Salesforce Training in Bangalore

    Salesforce Training in Hyderabad

    Salesforce training in ameerpet

    Salesforce Training in Pune

    Salesforce Online Training

    Salesforce Training

    ReplyDelete
  37. Awesome blog. It was very informative. I would like to appreciate you. Keep updated like this!

    Bigdata Hadoop Training in Gurgaon

    ReplyDelete
  38. Salesforce advancement has become a fundamental necessity in the current business situation. How is the Salesforce job market in Noida after training

    ReplyDelete
  39. https://digitalbrolly.com/affiliate-marketing-course-in-hyderabad/

    ReplyDelete
  40. Welcome to CapturedCurrentNews – Latest & Breaking India News 2021
    Hello Friends My Name Anthony Morris.latest and breaking news linkfeeder

    ReplyDelete
  41. Five near do. Have owner yourself hard. They little arrive reduce movie energy.trending-updates

    ReplyDelete