I've been interested in parallel processing for a while, but didn't have the time/opportunity to get into it until recently, when a group of us were asked to rewrite a data extraction system that was built around a single (relational) database, and suffering from severe i/o contention.  
The actual mechanics of what we were trying to do with that system lent themselves to parallel processing, specifically there was a lot of data transformation and aggregation that we needed to do -- in other words map-reduce. 
Right around the time that we decided to do the rewrite,  Amazon came out with Elastic MapReduce. Prior to EMR, our options were to either set up a local Hadoop cluster, or assemble one out on EC2. My experience with Hadoop was zero, and my experience with the map/reduce algorithm was limited to working through the canonical Inverted Index example in my head.  So I wanted to minimize the learning curve of setting up a cluster while still being able to validate our ideas. I don't know if we are eventually going to end up building our own AMIs for a custom Hadoop cluster, but not having to deal with those details at this point allows us to focus on the ideas instead of the infrastructure. Yay!
In my last post I explained the steps I took to run the basic streaming sample. Next up, I wanted to run my own map reduce job. In order to develop an EMR application utilizing a custom jar, I took the following steps:
(1) Set up a single node hadoop installation on my dev box.
Because Time Is Money on EC2, I wanted to get my code as right as I could before running it in the cloud. The Hadoop site provides a great step by step install sheet for Leopard. 
(2) Went through the basic Hadoop Tutorial. This tutorial provides great explanations of each class the developer has to implement, as well as the helper classes. Whenever I got stuck, the answer was usually found in the tutorial.
(3) Installed the IBM MapReduce Tools Plugin for Eclipse. While I am not quite able to get it debugging a map/reduce job from my local Hadoop install, I can run my program in 'standalone' mode as a Java Application, as long as Hadoop is started on my system. 
(4) After debugging in the IDE, I ran the job as a standard hadoop job: 
hadoop jar {name of jar} {input dir} {output dir}
I chose  to export the jar from eclipse (instead of writing a build.xml), as part of this I got to specify the main class and autogenerate the MANIFEST.MF. 
The input directory and output directory needed to be specified as HDFS directories. The output directory could not exist. 
NOTE on the application driver: I took a clue from example code (eventually) after running into unzip errors which a couple of posts said were due to bad configuration. I derived the driver class from Configuration, extended the Tool interface.
(5) After running successfully as a local Hadoop Job, I ran the job from the cloud, 
(a) I copied the jar up to s3
(b) I copied the input files up to s3
(c) I created a top level bucket for the output directories. 
I tried to run my job as it was configured for hadoop, with the input and output specified as s3n://{bucket-name} with no success. 
In order to diagnose the problem, I enabled logging, which I should have done immediately. When configuring a job via the AWS Management console, select 'Advanced Options' in the 'configure EC2 instances' section. You need to specify the log destination using s3n://{bucket-name} format. 
With logging enabled, I saw the following exception: 
java.lang.IllegalArgumentException: Path must be absolute: s3n://java-emr-test
and it made no sense, even with googling. I ran the cloudburst sample, and noticed that they specified a full filename as the input parameter. When I did the same, my run succeeded. I'm not sure why I had to do this, and want to make sure I really need to do it, because it will mean an additional read step of intermediate output directories prior to subsequent map reduce steps. 
Now that I've got this working through the UI, I'm going to look at the entangledstate ruby tool because it appears to allow programmatic configuration of multi step job flows.
 
 
Hi Arun,
ReplyDeleteYou should be able to use an Amazon S3 bucket as input directly by appending a slash "/". Try s3n://java-emr-test/.
A user on the developer forums had this same issue.
Andrew
Thanks, I'm relatively new to EC2 and am figuring this out as I go.
ReplyDeleteHey Arun,
ReplyDeleteThat's really a useful information. Can you help me regarding running a job in EMR....I almost tried 30 jobs but all failed..It shows the same old error: classnotfound exception: s3://opmtest/wordcount/input/
looks like you're not specifying your jar when running hadoop.jar myjar.jar input output, where the jar = myjar.jar
ReplyDeleteHey there, awesome blog. I really like it.
ReplyDeleteThis is an awesome radiology software program. Check it out!
Radiology Software
Let me know what you think.
Good Video. I will also add http://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html which I also find very very useful resource. I can create and debug my job. It also describle how to terminate job at runtime.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete