Waving Not Drowning: Hadoop Streaming with MRJob

Sunday, November 17, 2013

Hadoop Streaming with MRJob

Motivation to use Streaming:

Writing java map-reduces for simple jobs feels like 95% boilerplate, 5% custom code. Streaming is a much simpler interface into Mapreduce, and it gives me the ability to tap into of the rich data processing, statistical analysis and nlp modules of Python.

Motivation to use mrjob:

While the interface to Hadoop Streaming couldn't be simpler, not all of my jobs are simple 'one and done' map-reduces, and most of them require custom options MRJob allows you to configure and run a single map and multiple reduces. It also does some blocking and tackling, allowing me to customize arguments and passing them into specified jobs. Finally, mrjob can be applied to an on prem cluster or an amazon cluster - and we are looking at running amazon clusters for specific prototype use cases.

mrjob and streaming hurdles

The mrjob documentation is excellent for getting up and running with a simple job. I'm going to assume that you have read enough to know how to subclass MRJob, set up a map and a reduce function, and run it.

I'm going to discuss some of the things that weren't completely obvious to me after I had written my first job, or even my second job. Some of these things definitely made sense after I had read through the documentation, but it took multiple reads, some debug attempts on a live cluster, and some source code inspection.

Hurdle #1: passing arguments

My first job was basically a multi dimensional grep: I wanted to walk input data that had timestamp information a tab delimited field and only process those lines that were in my specified date range. In order to do this I needed two range arguments that took date strings to do the range check in the mapper. I also wanted to be able to apply specified regex patterns to those lines at map time. Because there were several regex patterns, I decided to put them in a file and parse them. So I needed to pass three arguments into my job, and those arguments were required for every mapper that got run in the cluster.

In order to pass arguments into my job, I had to override the configure_options() method of MRJob and use add_passthrough_option() for the range values, and add_file_option() for the file that held the regexes:

def configure_options(self):

super(HDFSUsageByPathMatch,self).configure_options()

self.add_passthrough_option("--startDateRange",type='string',help='...')

self.add_passthrough_option("--endDateRange",type='string',help='')

self.add_file_option("--filters")

All options were passed straight through to my job from the command-line:

python job.py --startDateRange 01/01/13 --endDateRange 12/01/13 --filters filters.json

I referenced them in an init function of my job class, which subclassed the MRJob class:

class MyJob:
...

def task_init(self):

self.startDateRange = dateutil.parser.parse(self.options.startDateRange)

self.endDateRange = dateutil.parser.parse(self.options.endDateRange)

self.filters = parseJsonOptions(self.options.filters)

This init method was specified in the MyJob.steps() override of the default MRJob method:

def steps(self):

return [

self.mr(mapper_init = self.task_init,

.....

]

Something to note here: In the code I had written during development, I had neglected to really read the documentation and as a result I had previously done all validation of my custom args using a standard OptParse class in my main handler. This worked for me in inline mode, which is what I was developing in. It does not work at all when running the job on a cluster, and it took some source code digging to figure out. Do as I say, not as I do :) In hadoop mode, the main MRJob script file is passed to mapper and reducer nodes with the step parameter set to the appropriate element in the steps array. The entry point into the script is the default main, and MRJob has a set of default parameters it needs to pass through to the MRJob subclassed job class. Overriding parameter handling in main effectively breaks MRJob when it tries to spawn mappers and reducers on worker nodes. MRJob handles the args for you, and you need to let it handle all arg parsing, and pass custom arguments as passthrough or file options.

Hurdle #2: passing python modules

This nuance has more to do with streaming than it does with mrjob. But it's worth understanding if you're going to leverage non-standard Python modules in your mapper or reducer code, and those modules have not been installed on all of your datanodes.

I was using the dateutil class because it makes parsing dates from strings super easy. On a single node, getting dateutil up and running is this hard:

easy_install python-dateutil

But when you're running a streaming job on a cluster, that isn't an option. Or, it wasn't an option for me because the ops team didn't give me sudoers permissions on the cluster nodes, and even if they did, I would have had to write the install script to ssh in, do the install, and roll back on error. Arrgh, too hard.

What worked for me was to

Download the source code
Zip it up (it arrived in tar.gz)
Change the extension of the zip file because files that end in .zip are automatically moved to the lib folder of the task's working directory
Access it from within my script by putting it into the load path:

sys.path.insert(0,'dateutil.mod/dateutil')
import dateutil
...

I'm passing dateutil.mod as a file passed in via add_file_option() in myjob.configure_options(). Leveraging the add_file_option() method puts dateutil.mod in the local hadoop job's working directory:

def configure_options(self):

super(HDFSUsageByPathMatch,self).configure_options()
....

self.add_file_option("--dateutil")

Three things to note from the above code: (1) dateutil.mod is the zip file, (2) I'm referencing a module within the zip file by it's path location in that zipfile, and (3) because I've renamed the file, it gets placed in the job working directory, which means it is on my path by default.

This is how I pass dateutil.mod into the job:

python job.py ... --dateutil dateutil.mod

Hurdle #3 (not quite cleared): chaining reduces vs map-reduces

As mentioned in the doc, it's super easy to chain reduces to do successive filtering and processing. Simply specify your multiple reduces in the steps() override:

def steps(self):

return [

self.mr(mapper_init = self.task_init,

mapper=self.mapper_filter_matches,

combiner=self.combiner_sum_usage,

reducer=self.reducer_sum_usage),

self.mr(reducer_init = self.task_init,

reducer=self.reducer_filter_keys)

]

I haven't found it necessary to run successive mapreduces -- successive reduces work just as well in the use cases I've tried. When chaining reduces to the end of your first mapreduce, you can specify the key value from the first mapreduce as the key value in the next reduce.

What is not easy at this time is the ability to save intermediate output to a non intermediate location. While doing that is relatively straightforward in 'inline' mode, the approach suggested in the link won't work in hadoop mode because MRJob is invoking the python script with the right --step-num argument based on what it sees in the steps() method.

I did read about the --cleanup option, but from what I understand the intermediate output dir of a complex job is based on a naming convention, not on something I can set. As this is somewhat of an edge case, I can work around it by chaining MRJob runs with Oozie.

Summary

What I've learned about MRJob is that while it does a great job of allowing you to set and pass options, and allows you to construct good workflows (assuming you don't care about intermediate output), it is so easy to use that I fell into the trap of believing that running local on my machine was equivalent to running on a hadoop cluster.

As I've found out several times above, that is not the case. For me the keys here are (1) let MRJob handle your job specific variables, (2) leverage the steps() method for your more complex flows, and (3) if you need to save intermediate output, chain your jobs using an external scheduler.

31 comments:

mareddyonlineJuly 24, 2014 at 4:55 AM
I recently came across your blog on hadoop and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.
Hadoop Training in hyderabad
ReplyDelete
Replies
StephenApril 11, 2015 at 10:23 PM
Thank you so much for sharing this great information. Today I stand as a successful hadoop certified professional. Thanks to Big Data Training
ReplyDelete
Replies
UnknownMay 4, 2015 at 5:26 AM
The information you have posted here is really useful and interesting too & here, I had a chance to gather some useful tactics in programming, thanks for sharing and I have an expectation about your future blogs keep your updates please.

JAVA Training in Chennai | JAVA Training Institutes in Chennai
ReplyDelete
Replies
MelisaDecember 7, 2015 at 3:41 AM
I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself, Thanks for sharing...
Regards,
ccna course in Chennai|ccna training in Chennai|ccna training institute in Chennai
ReplyDelete
Replies
UnknownFebruary 12, 2016 at 3:42 AM
All are saying the same thing repeatedly, but in your blog I had a chance to get some useful and unique information, I love your writing style very much, I would like to suggest your blog in my dude circle, so keep on updates…
Regards
Angularjs training in chennai|Angularjs training chennai|Angularjs course in chennai
ReplyDelete
Replies
Lashcraz1May 16, 2017 at 11:07 PM
This blog is loaded with amusing and engaging material that can even help working individuals to prep and to upgrade the qualities. Its http://sfwriterstoolkit.com/ is extremely delicate concerning its objectives and targets and have an enduring impact on the psyches of the peruses and its individuals as well. It is a value perusing lowland and ought to be more advanced.
ReplyDelete
Replies
TejutejuJune 8, 2018 at 12:28 AM
After reading this blog i very strong in this topics and this blog really helpful to all. Big data hadoop online Course
ReplyDelete
Replies
UnknownJuly 16, 2018 at 2:57 AM
Great post dear. It definitely has increased my knowledge on Python. Please keep sharing similar write ups of yours. You can check this too for Python tutrial as i have recorded this recently on Python. and i'm sure it will be helpful to you.https://www.youtube.com/watch?v=HcsvDObzW2U
ReplyDelete
Replies
UnknownAugust 29, 2018 at 10:49 PM
It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command
python training in chennai | python training in bangalore

python online training | python training in pune

python training in chennai
ReplyDelete
Replies
simbuAugust 29, 2018 at 11:47 PM
I would assume that we use more than the eyes to gauge a person's feelings. Mouth. Body language. Even voice. You could at least have given us a face in this test.
java training in omr

java training in annanagar | java training in chennai

java training in marathahalli | java training in btm layout

java training in rajaji nagar | java training in jayanagar

ReplyDelete
Replies
UnknownSeptember 7, 2018 at 5:58 AM
Nice post. By reading your blog, i get inspired and this provides some useful information. Thank you for posting this exclusive post for our vision.

rpa training in Chennai | rpa training in velachery

rpa training in tambaram | rpa training in sholinganallur

rpa training in Chennai | rpa training in pune

rpa online training | rpa training in bangalore
ReplyDelete
Replies
shalinipriyaSeptember 7, 2018 at 11:47 PM
I found your blog while searching for the updates, I am happy to be here. Very useful content and also easily understandable providing.. Believe me I did wrote an post about tutorials for beginners with reference of your blog.
Data Science training in marathahalli
Data Science training in btm
Data Science training in rajaji nagar
Data Science training in chennai
Data Science training in kalyan nagar
Data Science training in electronic city
Data Science training in USA

ReplyDelete
Replies
SaroSeptember 8, 2018 at 12:27 AM
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.

rpa training in Chennai | rpa training in velachery

rpa training in tambaram | rpa training in sholinganallur

rpa training in Chennai | rpa training in pune

rpa online training | rpa training in bangalore
ReplyDelete
Replies
saiSeptember 24, 2018 at 12:56 AM
Thanks for the good words! Really appreciated. Great post. I’ve been commenting a lot on a few blogs recently, but I hadn’t thought about my approach until you brought it up.
python training in annanagar
python training in chennai
python training in chennai
python training in Bangalore
ReplyDelete
Replies
shethalOctober 3, 2018 at 12:12 AM
This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me..
Devops training in sholinganallur
Devops training in velachery
ReplyDelete
Replies
UnknownOctober 10, 2018 at 2:17 AM
Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.

Best Selenium Training in Chennai | Selenium Training Institute in Chennai | Besant Technologies

Selenium Training in Bangalore | Best Selenium Training in Bangalore

AWS Training in Bangalore | Amazon Web Services Training in Bangalore
ReplyDelete
Replies
sunshineprofeOctober 25, 2018 at 5:38 AM
I look forward to fresh updates and will talk about this blog with my Facebook group. Chat soon!
safety courses in chennai
ReplyDelete
Replies
AnandNovember 26, 2018 at 11:34 PM
Nice post ,thanks for sharing information...

Java Training in Chennai
Python Training in Chennai
IOT Training in Chennai
Selenium Training in Chennai
Data Science Training in Chennai
FSD Training in Chennai
MEAN Stack Training in Chennai
ReplyDelete
Replies
tamizhMarch 7, 2019 at 11:18 PM
I recently came across your blog and have been reading along. I thought I would leave my first comment.
devops online training

aws online training

data science with python online training

data science online training

rpa online training
ReplyDelete
Replies
priyaMarch 7, 2019 at 11:33 PM
Nice information, valuable and excellent design, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here.
Microsoft Azure online training
Selenium online training
Java online training
uipath online training
Python online training

ReplyDelete
Replies
drasDecember 15, 2019 at 8:48 PM
Very nice post...
inplant training in chennai
inplant training in chennai
inplant training in chennai for it.php
Australia hosting
mexico web hosting
moldova web hosting
albania web hosting
andorra hosting
australia web hosting
denmark web hosting
ReplyDelete
Replies
rajuDecember 16, 2019 at 8:58 PM
Great Post Thanks for sharing
inplant training in chennai
inplant training in chennai for it
panama web hosting
syria hosting
services hosting
afghanistan shared web hosting
andorra web hosting
belarus web hosting
brunei darussalam hosting
inplant training in chennai
ReplyDelete
Replies
shriDecember 22, 2019 at 10:06 PM
good post...!
internship in chennai for ece students
internships in chennai for cse students 2019
Inplant training in chennai
internship for eee students
free internship in chennai
eee internship in chennai
internship for ece students in chennai
inplant training in bangalore for cse
inplant training in bangalore
ccna training in chennai

ReplyDelete
Replies
DeviApril 3, 2021 at 9:48 AM
Set your career towards Amazon Web Services with Infycle Technologies, the best software training center in Chennai. Infycle Technologies gives the combined and best Big AWS Training in Chennai, along with the 100% hands-on training guided by professional teachers in the field. In addition to this, the interviews for the placement will be guided to the candidates, so that, they can face the interviews without struggles. Apart from all, the candidates will be placed in the top MNC's with a great salary package. To get it all, call 7502633633 and make this happen for your happy life.Best AWS Training in Chennai
ReplyDelete
Replies
INFYCLE TECHNOLOGIESNovember 10, 2021 at 6:24 AM
Want to do a No.1 Data Science Course in Chennai with a Certification Exam? Catch the best features of Data Science training courses with Infycle Technologies, the best Data Science Training & Placement institutes in and around Chennai. Infycle offers the best hands-on training to the students with the revised curriculum to enhance their knowledge. In addition to the Certification & Training, Infycle offers placement classes for personality tests, interview preparation, and mock interviews for clearing the interviews with the best records. To have all it in your hands, dial 7504633633 for a free demo from the experts.
ReplyDelete
Replies
INFYCLE TECHNOLOGIESNovember 24, 2021 at 10:27 AM
Pull- up your socks and knot your tie. Gonna have a good salary package job after completing Big-data Hadoop training in Chennai at Infycle. Infylce is completely for Software training and placement by friendly trainees, good atmosphere, 200% practical classes, and more.
ReplyDelete
Replies
ReshmaDecember 27, 2021 at 3:35 AM

Awesome blog. Thanks for sharing such a worthy information....
Angularjs Training in hyderabad
Angularjs Training in Gurgaon
ReplyDelete
Replies
rakeshFebruary 10, 2022 at 1:40 AM
This post is so helpfull and informative.Keep updating more information....
Future Of RPA
Robotic Automation Tools
ReplyDelete
Replies
NiyazFebruary 12, 2022 at 1:39 AM
Great Content!!! thanks for it to share with us.
Is AWS A Good Career?
How to start a career in AWS?
ReplyDelete
Replies
milkaAugust 30, 2022 at 2:18 AM
Great post. keep sharing such a worthy information.
AWS Training institute in Chennai
ReplyDelete
Replies
AnonymousSeptember 25, 2023 at 12:10 AM
sw
ReplyDelete
Replies

Add comment