Monday, April 18, 2011

Rolling out splittable lzo on CDH3

Until splittable lzo, compression options in HDFS were limited. Gzip generated unsplittable output -- great for reducing allocated block usage, terrible for mapreduce efficiency. Bz2 generated splittable output, but took far too long to be effectively used in production.

When we wanted to start incorporating compression into our storage procedures, splittable lzo was the only rational option to ensure parallel processing of compressed files.

We had tried to use bz2 compression on files prior to ingestion, but it took much longer --  approximately 20x as long as gzip compression on the same file. 

For a 1GB text file, 
  • gzip -1 took ~ 25 seconds (actually, this is strange. I was expecting gzip to be slightly faster than lzo)
  • lzo -1 took ~ 9 seconds, indexing took another 4.
  • bzip2 -1 took  ~ 3 minutes. 

I set the max speed of each compression routine to provide a relative benchmark: in reality we would be running at a slower speed that increased compression. 

Installing The Bits

The java and native source for splittable lzo can be found at If you're using the Cloudera distro, you should use the fork.

The cluster I was installing splittable lzo on was running Centos and walled off from the rest of the world. I found it easiest to generate RPMs on a box with the same architecture, then install those RPMs on all nodes in the cluster. I did this using the code, which takes the native and java components and installs them to the right locations. Note that since I was building on a Centos box, I ran

./ --no-deb

to build RPMs only. There were two rpms, the standard one and the debug-info one. The naming convention appears to be YYYYmmDDHHMMSS.full.version.git_hash_of_hadoop_lzo_project.arch, to allow you to upgrade when either the packaging code or the original hadoop lzo code changes.

The RPMs installed the following java and native bits (note that the packager timestamps the jars):

rpm -ql cloudera-hadoop-lzo-20110414162014.


rpm -ql cloudera-hadoop-lzo-debuginfo-20110414162014.


Hadoop Configuration Changes

After installing the bits via RPMs, There were a couple of changes necessary to get Hadoop to recognize the new codec.

In core-site.xml:


registers the codec in the codec factory.
In mapred-site.xml:


sets intermediate output to be lzo compressed. After pushing configs out to all nodes in the cluster, I restarted the cluster. The next step was to verify that lzo was installed correctly.


There were some hiccups I ran into during validation -- all pilot error, but I wanted to put them all in one place for next time. My validation steps looked like this:

(1) create an lzo file that was greater than my block size.
(2) upload and index it.
(3) run a mapreduce using the default IdentityMapper
(4) verify that multiple mappers were run from the one lzo file.
(5) verify that the output was the same size and format as the input.

My first mistake: I lzo compressed a set of files. The splittable lzo code only works with a single file. This took me a while to figure out -- mostly due to tired brain. After I had catted the files together into a single file, then lzo'd that file, I was able to upload it to HDFS and index it:

hadoop jar /usr/lib/hadoop/lib/cloudera-hadoop-lzo-20110414162014. com.hadoop.compression.lzo.LzoIndexer /tmp/out.lzo

This created an index file. From this great article on the Cloudera site: "Once the index file has been created, any LZO-based input format can split compressed data by first loading the index, and then nudging the default input splits forward to the next block boundaries."

Since I had an uploaded, indexed file at this point, I moved to step 3 and 4. Before I could make the IdentityMapper, I needed to get the LZO bits on my mac so that the IdentityMapper could run.

Detour: Getting the Bits on my Mac

I dev on a Mac, but run the cluster on Centos (I can already feel the wrath of Ted Dziuba coming down from on high). I found the instructions here to be adequate to get the changes I needed to make to the IdentityMapper code to compile. 

Back to Validation

I ran an IdentityMapper on the original source (side note: in 0.20, to run IdentityMapper, just don't specify a mapper, the default Mapper class implements pass through mapping). I watched the cluster to make sure that the original file was split out across mappers. It wasnt. I was stumped -- I knew this was something simple, but couldn't see what it was.

After a gentle reminder from Cloudera Support (one of many in the last couple of days, actually:), I set my input format class to LzoTextInputFormat, which -- as the same article above mentions in the next sentence -- "splits compressed data by first loading the index, and then nudges the default input splits forward to the next block boundaries. With these nudged splits, each mapper gets an input split that is aligned to block boundaries, meaning it can more or less just wrap its InputStream in an LzopInputStream and be done." When I had used the default TextInputFormat, the mapreduce was working, but the input was being compressed and not split.


Once I had observed splitting behavior from my indexed lzo file by confirming multiple map tasks, I made sure that output was recompressed as lzo by setting FileOutputFormat properties:

FileOutputFormat.setCompressOutput(job, true); 
FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class) ;

This is different from instructions in Hadoop: The Definitive Guide, and I found it after some googling around. The instructions in the book -- setting properties in the Configuration objct -- did not work -- most likely because the book was written for an earlier version of Hadoop.

Once I had added those lines to my Tool subclass, I was able to get compressed output that matched my compressed input: the exact result  I was looking for when validating using the IdentityMapper.

No comments:

Post a Comment