<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-8840067776782114927</id><updated>2012-01-23T17:19:53.087-08:00</updated><category term='amazon aws'/><category term='rails activeresource'/><category term='Leopard CouchDB'/><category term='Rails'/><category term='GPS training'/><category term='Cycling'/><category term='cloud'/><category term='rails ubuntu centos migration'/><category term='console debugging'/><category term='training running'/><category term='rails ubuntu centos migration mod_rails'/><category term='RRD'/><category term='rails ubuntu centos migration rrd'/><category term='Flot JQuery javascript graphing'/><category term='running'/><category term='Commuting'/><category term='Firebug'/><category term='Breathing'/><category term='Pie chart'/><category term='rails ubuntu centos migration postgresql'/><category term='Ruby'/><category term='family'/><category term='Hadoop EC2 Amazon AWS'/><category term='Leela'/><category term='Net::HTTP'/><category term='Singlepspeed'/><category term='map-reduce'/><category term='JavaScript'/><category term='Swimming'/><category term='Total Immersion'/><category term='data'/><category term='training'/><category term='Kiran'/><category term='Basic Authentication'/><title type='text'>Waving Not Drowning</title><subtitle type='html'>Things that I think are worth remembering...</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>65</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1342803052950388845</id><published>2011-11-02T07:05:00.000-07:00</published><updated>2012-01-23T15:55:41.461-08:00</updated><title type='text'>Schema On Read? Not so fast!</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;I just got back from &lt;a href="http://www.hadoopworld.com/"&gt;HadoopWorld&lt;/a&gt;. I have many thoughts on what I saw and heard there, but that is probably a separate post. I've been trying to write something for the last 3 months, and HadoopWorld gave me the clarity I needed to finish it.&lt;br /&gt;&lt;br /&gt;There was this phrase I kept hearing in the halls and meeting rooms....&lt;a href="http://howsoftwareisbuilt.com/2010/01/06/interview-with-amr-awadallah-cloudera-cto/"&gt;"Schema on Read"&lt;/a&gt;. Schema on Read, in contrast to Schema on Write, gives you the freedom to define your schema after you've stored your data. Schema on Read sounds like Freedom. And who wouldn't &amp;nbsp;like Freedom from the Tyranny of Schemas?&lt;br /&gt;&lt;br /&gt;If, by the way, that phrase rings a bell, it may be because of this:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;object class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://2.gvt0.com/vi/hEqQMLSXQlY/0.jpg" height="266" width="320"&gt;&lt;param name="movie" value="http://www.youtube.com/v/hEqQMLSXQlY&amp;fs=1&amp;source=uds" /&gt;&lt;param name="bgcolor" value="#FFFFFF" /&gt;&lt;embed width="320" height="266"  src="http://www.youtube.com/v/hEqQMLSXQlY&amp;fs=1&amp;source=uds" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/div&gt;&lt;br /&gt;All &lt;a href="http://knowyourmeme.com/memes/downfall-hitler-reacts#.TrDZbFY9wgw"&gt;Downfall Meme&lt;/a&gt; kidding aside, Schema on Read&amp;nbsp;sure seems nice. Because there is nothing in any of the current NoSQL storage engines that enforces a schema down to the column level, can we just not worry about schemas? What's the point?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Schemas -- good for the consumer, bad for the producer.&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The point is that schemas are a guarantee to a consumer of the data. They guarantee that the data follows a specific format, which makes it easier to consume. That's great for the consumer. They get to write processes that can safely assume that the data has structure. &lt;br /&gt;&lt;br /&gt;But...not without a cost, to someone. For the producer of the data, &amp;nbsp;schemas can suck. &amp;nbsp;Several reasons:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Because you never get them right the first time, and you're left doing schema versioning and writing upgrade scripts to compensate for your sins of omission -- basically you get nailed for not being omniscient when you designed the schema.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Because they don't do well with variability. If you have data with a high rate of variability, your schema is guaranteed to change every time you encounter values/types/ranges that you didn't expect.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;The various parties pumping freedom from schemas via NoSQL technologies seem to have an additional implicit message -- that &lt;i&gt;even though you don't have to lock down your data to a schema, you still get the benefits of having one -- the data is still usable&lt;/i&gt;. Specifically, if you don't define or partially define the data, you can still get value from it. Because you're storing it. Is that true?&lt;br /&gt;&lt;br /&gt;Sure it is. Sort of. Take a file in HDFS. If the file isn't formatted in a specific manner, can it still be processed? As long as you can split it, you can process it. Will the processing code need to account for an undefined range of textual boundary conditions? Absolutely. That code will be guaranteed to break arbitrarily because the format of the data is arbitrary.&lt;br /&gt;&lt;br /&gt;The same thing can happen with column families. Any code that processes a schema-free column family needs to be prepared to deal with any kind of column that is added into that column family. Again, the processing code needs to be incredibly flexible to deal with potentially unconstrained change. Document stores are another example where even though the data is parse-able, your processing code may need to achieve sentience a la &lt;a href="http://en.wikipedia.org/wiki/Skynet_(Terminator)"&gt;Skynet&lt;/a&gt;&amp;nbsp;in order to process it.&lt;br /&gt;&lt;br /&gt;So, yes, you can get value from randomly formatted data, if you can write bulletproof, highly adaptable code. That will eventually take over the world and produce cyborgs with Austrian accents that travel back in time.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://cache.gawker.com/assets/images/io9/2011/04/terminator-3-the-redemption.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="320" src="http://cache.gawker.com/assets/images/io9/2011/04/terminator-3-the-redemption.jpg" width="222" /&gt;&lt;/a&gt;&lt;/div&gt;But what about those of us that process (semi) structured data? Web logs, for example. Or (XML/JSON) data feeds. Things that have some kind of structure, where the meaning and presence -- aka the semantics -- of fields may change but the separators between them don't. Do we really need freedom from the tyranny of something that guarantees structure when we are processing things that have a basic structure?&lt;br /&gt;&lt;br /&gt;Yes. Even though format may be well defined, semantics can be quite variable. Fields may be optional, mileage may vary. &amp;nbsp;Putting some kind of schematic constraint on all data invalidates one of the key bonuses of big data platforms -- we wouldn't be able to clean it because we wouldn't be able to load it if we had to adhere to some kind of well defined format. In the big data world, imposing a schema early on would not only suck, it would suck at scale.&lt;br /&gt;&lt;br /&gt;However, the moment we have done some kind of data cleansing, and have data that is ready for consumption, it makes sense to bind that data to a schema. Note: by schema I'm talking about a definition of the data that is machine readable. JSON keys and values work quite well, regardless of the actual storage engine.&lt;br /&gt;&lt;br /&gt;Because the moment you guarantee your data can adhere to a schema, you liberate your data consumers. You give them...freedom! Freedom from....the tyranny of undefined data!&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;But wait, that's not all...&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;What else comes along for free? How about a quality bar by which you can judge all data you ingest? If your data goes through a cleansing process, you could publish how much data didn't make it through the cleansing process. Consumers could opt out of data consumption if too much data was lost.&lt;br /&gt;&lt;br /&gt;And when your data changes (because it will) your downstream customers can opt out because none of the data would pass validation. This &lt;a href="http://martinfowler.com/ieeeSoftware/failFast.pdf"&gt;fast failure&lt;/a&gt; mode is much preferred to the one in which you discover that your financial reports were wrong after a month because of an undetected format change. That isn't an urban myth, it actually happened to -- &lt;i&gt;ahem&lt;/i&gt; -- a friend of mine, who was left to contemplate the phrase that '&lt;a href="http://www.childhoodaffirmations.com/stage7/motivation.html"&gt;pain is a very effective motivator&lt;/a&gt;' while scrambling to fix the issue :)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;So what does this all mean?&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;While the costs of Schema on Write are (a) well known and (b) onerous, Schema on Read is not much better the moment you have to maintain processes that consume the data.&lt;br /&gt;&lt;br /&gt;However, by leveraging the flexibility of&amp;nbsp;Hadoop, Cassandra, HBase, Mongo, etc, and loading the data in without a schema, I can then rationalize (clean) the data and apply a schema at that point. This provides freedom to the data producer while they are discovering what the data actually looks like, and freedom to the data consumer because they know what to expect. It also lets me change over time in a &amp;nbsp;controlled manner that my consumers can opt in or out of.&lt;br /&gt;&lt;br /&gt;That's not Schema on Read or Schema on Write, it's more like Eventual Schema. And I think it's a rational compromise between the two.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1342803052950388845?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1342803052950388845/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2011/11/schema-on-read-not-so-fast.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1342803052950388845'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1342803052950388845'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2011/11/schema-on-read-not-so-fast.html' title='Schema On Read? Not so fast!'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2853601386614805529</id><published>2011-04-18T20:42:00.000-07:00</published><updated>2011-04-18T20:42:58.456-07:00</updated><title type='text'>Rolling out splittable lzo on CDH3</title><content type='html'>Until splittable lzo, compression options in HDFS were limited. Gzip generated unsplittable output -- great for reducing allocated block usage, terrible for mapreduce efficiency. Bz2 generated splittable output, but took far too long to be effectively used in production.&lt;br /&gt;&lt;br /&gt;When we wanted to start incorporating compression into our storage procedures, splittable lzo was the only rational option to ensure parallel processing of compressed files.&lt;br /&gt;&lt;br /&gt;We had tried to use bz2 compression on files prior to ingestion, but it took much longer -- &amp;nbsp;approximately 20x&lt;todo, get="" stats=""&gt;&amp;nbsp;as long as gzip compression on the same file.&amp;nbsp;&lt;/todo,&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;&lt;br /&gt;&lt;/todo,&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;For a 1GB text file,&amp;nbsp;&lt;/todo,&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;gzip -1&lt;/i&gt;&lt;/b&gt; took ~ 25 seconds (actually, this is strange. I was expecting gzip to be slightly faster than lzo)&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;lzo -1&lt;/i&gt;&lt;/b&gt; took ~ 9 seconds, indexing took another 4.&lt;/li&gt;&lt;li&gt;&lt;b&gt;&lt;i&gt;bzip2 -1&lt;/i&gt;&lt;/b&gt; took &amp;nbsp;~ 3 minutes.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;I set the max speed of each compression routine to provide a relative benchmark: in reality we would be running at a slower speed that increased compression.&amp;nbsp;&lt;/todo,&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;&lt;br /&gt;&lt;/todo,&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Installing The Bits&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/todo,&gt;&lt;br /&gt;&lt;todo, get="" stats=""&gt;The java and native source for splittable lzo can be found at&amp;nbsp;&lt;a href="https://github.com/kevinweil/hadoop-lzo"&gt;https://github.com/kevinweil/hadoop-lzo&lt;/a&gt;. If you're using the Cloudera distro, you should use the &lt;a href="https://github.com/toddlipcon/hadoop-lzo"&gt;https://github.com/toddlipcon/hadoop-lzo&lt;/a&gt; fork.&lt;br /&gt;&lt;br /&gt;The cluster I was installing splittable lzo on was running Centos and walled off from the rest of the world. I found it easiest to generate RPMs on a box with the same architecture, then install those RPMs on all nodes in the cluster. I did this using the&amp;nbsp;&lt;a href="https://github.com/toddlipcon/hadoop-lzo-packager"&gt;https://github.com/toddlipcon/hadoop-lzo-packager&lt;/a&gt; code, which takes the native and java components and installs them to the right locations. Note that since I was building on a Centos box, I ran&lt;br /&gt;&lt;br /&gt;&lt;/todo,&gt;&lt;br /&gt;&lt;pre&gt;./run.sh --no-deb&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;to build RPMs only. There were two rpms, the standard one and the debug-info one. The naming convention appears to be YYYYmmDDHHMMSS.full.version.git_hash_of_hadoop_lzo_project.arch, to allow you to upgrade when either the packaging code or the original hadoop lzo code changes.&lt;br /&gt;&lt;br /&gt;The RPMs installed the following java and native bits (note that the packager timestamps the jars):&lt;br /&gt;&lt;br /&gt;&lt;i&gt; rpm -ql cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b-1.x86_64&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt; &lt;br /&gt;&lt;pre&gt;/usr/lib/hadoop-0.20/lib/cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b.jar&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.a&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.la&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0&lt;br /&gt;/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.0.0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;i&gt; rpm -ql cloudera-hadoop-lzo-debuginfo-20110414162014.0.4.10.0.g2bd0d5b-1.x86_64&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;pre&gt;/usr/lib/debug&lt;br /&gt;/usr/lib/debug/usr&lt;br /&gt;/usr/lib/debug/usr/lib&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib/native&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.0.0.debug&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.debug&lt;br /&gt;/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.debug&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Hadoop Configuration Changes&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;After installing the bits via RPMs, There were a couple of changes necessary to get Hadoop to recognize the new codec.&lt;/div&gt;&lt;br /&gt;In core-site.xml:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;property&amp;gt;&lt;br /&gt;  &amp;lt;name&amp;gt;io.compression.codecs&amp;lt;/name&amp;gt;&lt;br /&gt;  &amp;lt;value&amp;gt;&lt;br /&gt;    org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,&lt;br /&gt;    com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,&lt;br /&gt;    org.apache.hadoop.io.compress.BZip2Codec&lt;br /&gt;  &amp;lt;/value&amp;gt;&lt;br /&gt; &amp;lt;/property&amp;gt;&lt;br /&gt; &amp;lt;property&amp;gt;&lt;br /&gt;   &amp;lt;name&amp;gt;io.compression.codec.lzo.class&lt;br /&gt;   &amp;lt;value&amp;gt;com.hadoop.compression.lzo.LzoCodec&amp;lt;/value&amp;gt;&lt;br /&gt; &amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;registers the codec in the codec factory.  &lt;br /&gt;In mapred-site.xml: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;property&amp;gt;&lt;br /&gt;   &amp;lt;name&amp;gt;mapred.compress.map.output&amp;lt;/name&amp;gt;&lt;br /&gt;   &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;&lt;br /&gt; &amp;lt;/property&amp;gt;&lt;br /&gt; &amp;lt;property&amp;gt;&lt;br /&gt;   &amp;lt;name&amp;gt;mapred.map.output.compression.codec&amp;lt;/name&amp;gt;&lt;br /&gt;   &amp;lt;value&amp;gt;com.hadoop.compression.lzo.LzoCodec&amp;lt;/value&amp;gt;&lt;br /&gt; &amp;lt;/property&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;sets intermediate output to be lzo compressed. After pushing configs out to all nodes in the cluster, I restarted the cluster. The next step was to verify that lzo was installed correctly.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Validation&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There were some hiccups I ran into during validation -- all pilot error, but I wanted to put them all in one place for next time. My validation steps looked like this:&lt;br /&gt;&lt;br /&gt;(1) create an lzo file that was greater than my block size.&lt;br /&gt;(2) upload and index it.&lt;br /&gt;(3) run a mapreduce using the default IdentityMapper&lt;br /&gt;(4) verify that multiple mappers were run from the one lzo file.&lt;br /&gt;(5) verify that the output was the same size and format as the input.&lt;br /&gt;&lt;br /&gt;My first mistake: I lzo compressed a set of files. &lt;b&gt;&lt;i&gt;&lt;u&gt;The splittable lzo code only works with a single file&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;. This took me a while to figure out -- mostly due to tired brain. After I had catted the files together into a single file, then lzo'd that file, I was able to upload it to HDFS and index it:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;hadoop jar /usr/lib/hadoop/lib/cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b.jar com.hadoop.compression.lzo.LzoIndexer /tmp/out.lzo&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This created an index file. From this great &lt;a href="http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/"&gt;article&lt;/a&gt; on the Cloudera site: "Once the index file has been created, any LZO-based input format can split compressed data by first loading the index, and then nudging the default input splits forward to the next block boundaries."&lt;br /&gt;&lt;br /&gt;Since I had an uploaded, indexed file at this point, I moved to step 3 and 4. Before I could make the IdentityMapper, I needed to get the LZO bits on my mac so that the IdentityMapper could run.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Detour: Getting the Bits on my Mac&lt;/b&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I dev on a Mac, but run the cluster on Centos (I can already feel the &lt;a href="http://teddziuba.com/2011/03/osx-unsuitable-web-development.html"&gt;wrath of Ted Dziuba&lt;/a&gt; coming down from on high). I found the instructions&amp;nbsp;&lt;a href="http://code.google.com/a/apache-extras.org/p/hadoop-gpl-compression/wiki/FAQ?redir=1"&gt;here&lt;/a&gt;&amp;nbsp;to be adequate to get the changes I needed to make to the IdentityMapper code to compile.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Back to Validation&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;I ran an IdentityMapper on the original source &lt;i&gt;(side note: in 0.20, to run IdentityMapper, just don't specify a mapper, the default Mapper class implements pass through mapping)&lt;/i&gt;. I watched the cluster to make sure that the original file was split out across mappers. It wasnt. I was stumped -- I knew this was something simple, but couldn't see what it was. &lt;br /&gt;&lt;br /&gt;After a gentle reminder from Cloudera Support (one of many in the last couple of days, actually:), I &lt;b&gt;&lt;i&gt;set my input format class to  LzoTextInputFormat&lt;/i&gt;&lt;/b&gt;, which -- as the same article above mentions in the next sentence --  "splits compressed data by first loading the index, and then nudges the default input splits forward to the next block boundaries.  With these nudged splits, each mapper gets an input split that is aligned to block boundaries, meaning it can more or less just wrap its InputStream in an LzopInputStream and be done." When I had used the default TextInputFormat, the mapreduce was working, but the input was being compressed and not split.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;job.setInputFormatClass(LzoTextInputFormat.class);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once I had observed splitting behavior from my indexed lzo file by confirming multiple map tasks, I made sure that output was recompressed as lzo by setting FileOutputFormat properties:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;FileOutputFormat.setCompressOutput(job, true); &lt;br /&gt;FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class) ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is different from instructions in &lt;a href="http://oreilly.com/catalog/9780596521981"&gt;Hadoop: The Definitive Guide&lt;/a&gt;, and I found it after some googling around. The instructions in the book -- setting properties in the Configuration objct -- did not work -- most likely because the book was written for an earlier version of Hadoop. &lt;br /&gt;&lt;br /&gt;Once I had added those lines to my Tool subclass, I was able to get compressed output that matched my compressed input: the exact result &amp;nbsp;I was looking for when validating using the IdentityMapper.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2853601386614805529?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2853601386614805529/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2853601386614805529'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2853601386614805529'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html' title='Rolling out splittable lzo on CDH3'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6343133085638535208</id><published>2011-04-11T22:15:00.000-07:00</published><updated>2011-06-15T21:45:48.713-07:00</updated><title type='text'>HDFS file size vs allocation</title><content type='html'>Recently, I had to understand HDFS at a deeper level that had nothing to do with running mapreduce jobs or writing to the FileSystem API. Specifically, I had to understand the way that HDFS interacts with the underlying filesystem, and the difference between actual HDFS file size and the way HDFS calculates available storage when using quotas.&lt;br /&gt;&lt;br /&gt;We recently discovered a bunch of files that were much smaller than our allocated block size -- on average they took up roughly 1/10th of an allocated block.&lt;br /&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;This was not the &lt;a href="http://www.cloudera.com/blog/2009/02/the-small-files-problem/"&gt;standard small file problem&lt;/a&gt;,&amp;nbsp;where the namenode requires too much memory to track metadata for large (10s of millions) numbers of files at 150 bytes of metadata per file.&lt;/div&gt;&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;My immediate conclusion was that these small files were effectively taking up a block at a time, and that we were running out of space -- fast! -- &amp;nbsp;because that was the behavior I thought I was seeing at the HDFS level -- I thought that &amp;nbsp;storage was allocated a block at a time, and quotas were determined based on available blocks.&lt;br /&gt;&lt;br /&gt;That last statement is partially correct. Storage is allocated a block -- actually a block * replication factor -- at a time. However &lt;b&gt;&lt;i&gt;quotas are determined based on available bytes&lt;/i&gt;&lt;/b&gt;. A space quota, &lt;a href="http://hadoop.apache.org/common/docs/current/hdfs_quota_admin_guide.html"&gt;according to the docs&lt;/a&gt; is "&lt;span class="Apple-style-span" style="font-family: Verdana, Helvetica, sans-serif; line-height: 15px;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;a hard limit on the number of bytes used by files in the tree rooted at that directory. Block allocations fail if the quota would not allow a full block to be written."&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;This is what that means: &lt;b&gt;&lt;i&gt;&lt;u&gt;the only time files are measured in blocks is at block allocation time&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;. The rest of the time, files are measured in bytes. The space quota is calculated against the number of bytes, not blocks, left in the cluster. That number of bytes is converted to the number of blocks (not bytes) that would be required to store a file when a user tries to upload a file. &lt;b&gt;&lt;i&gt;&lt;u&gt;The key here is that space is calculated in blocks at allocation time, so no matter how small a file is, you will always need 1 block * replication factor available to put it in the cluster.&lt;/u&gt;&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;HDFS Operational Details&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;I spent some time asking, researching, and re-reading &lt;a href="http://oreilly.com/catalog/9780596521981"&gt;the book&lt;/a&gt;, and found that making analogies from a standard filesystem to understand HDFS helped me immensely -- to a point (more on that later).&lt;br /&gt;&lt;br /&gt;In a standard filesystem, an inode contains file metadata, like permissions, ownership, last time changed, etc, in addition to a set of pointers that point to all blocks that comprise the file. Inodes are kept in a specific location in the filesystem are used to access files.&amp;nbsp;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The inode and block equivalents in HDFS are distributed across the namenode and the datanode.&lt;br /&gt;&lt;br /&gt;The namenode maintains file system metadata, which is analogous to the inodes in a standard FS. This metadata is stored in {dfs.name.dir}/current. Datanodes contain blocks of data, stored as block files in the underlying filesystem.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;On the datanode, HDFS stores block data in files in the directory specified by dfs.data.dir, which defaults to {hadoop.tmp.dir}/dfs/data/current. HDFS may create subdirectories underneath that dir to balance out files across directories (many filesystems have a file-per-directory limit). The raw data per block is kept in two files, a blk_NNNN file, and a corresponding blk_NNNN_XXXX.meta file, which contains the block checksum, used in block integrity checks.&amp;nbsp;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;The block file and checksum file information is periodically sent to the namenode as a blockreport -- i.e. at HDFS startup (HDFS enters&amp;nbsp;&lt;a href="http://safemode/"&gt;safemode&lt;/a&gt;&amp;nbsp;while the namenode processes block reports from it's consituent datanodes). Note that each datanode has no idea which block files map to which actual files. It just tracks the blocks. This makes the namenode very critical to HDFS functionality.&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;To summarize: the metadata that inodes maintain in a standard FS is maintained in the HDFS namenode, and actual file data that is maintained in filesystem blocks in a standard FS is maintained in HDFS blocks on datanodes, which store that block data in block files, maintain checksums of the block data for integrity checking, and update the namenode with information about the blocks they manage.&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;FileSystem Analogies That Do and Don't Work&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In a standard filesystem, disks have a minimum amount of data that they can read or write to, this is called a disk block. Unix disk blocks are 512 bytes. &amp;nbsp;FileSystems also have minimum read/write filesystem blocks that are typically 1-2kb.&lt;br /&gt;&lt;br /&gt;Files on a standard filesystem are typically much larger than a block in size. Since most files are not exactly X blocks in size, the 'remainder' of the file that does not fill up a block still takes up that much space on the system. In general (&lt;a href="http://en.wikipedia.org/wiki/ReiserFS"&gt;ReiserFS&lt;/a&gt; being one exception) the difference between the files real size and it's block size &amp;nbsp;-- the slack -- cannot actually be used for any other file.&lt;br /&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;In HDFS, if a file is smaller than a block in size, it does not take up an entire HDFS block on disk. There is no concept of HDFS block 'slack space'. A small file takes up as many bytes as it would in a normal filesystem because it is stored as a block file in the normal filesystem. This is where the definition of HDFS block differs from a traditional file block, and this is where my mental model of HDFS as a filesystem failed me :)&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;While the file and block analogy is valid in HDFS, the size of the blocks makes the difference between file allocated size (always represented in blocks) and file actual size (always in bytes) much larger than it would be on a traditional file system. &lt;b&gt;&lt;i&gt;So you can't treat allocated vs actual size as equivalents, like you effectively can on a traditional filesystem where the block size to file size ratio is relatively tiny.&amp;nbsp;&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Small Files on the Datanodes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At allocation time, a &amp;nbsp;small file will require a single block file per datanode. Note that the actual number of blocks required to store that file on the cluster depends on HDFS replication policy, which defaults to 3. So factoring in replication, a small (less than 1 block) file is replicated at three identical block files on separate nodes.&lt;br /&gt;&lt;br /&gt;That block file is the same size as the small file -- large files would span several blocks and be split into block size files -- a large file that was 350MB big on a system with 128MB block size would be split into 3 blocks, the first two of 128MB, the last one of 94MB. Each of those would be replicated according to the replication policy of the cluster. The only files that don't take up space on the datanodes are zero byte files, which still take up space on the namenode.&lt;br /&gt;&lt;br /&gt;Regardless of actual size, at allocation time, HDFS treats a small file as having a minimum size of one HDFS block per datanode when it is calculating available disk space. &amp;nbsp;So, even if a file is really small, if there is less than a three blocks available on the cluster, the file cannot be stored on the system.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Space Quotas&lt;/span&gt;&lt;br /&gt;HDFS only has less than the number of replicated blocks left than it needs to store a file when it is either running out of space, or, more commonly, if there is a space quota on the directory that the file is being copied to. Calculating storage cost in blocks allows HDFS safely store data to a known maximum size, no matter what the actual size of the file is. HDFS will only permit new block creation if there is enough disk space to create a block on N datanodes, where N is the replication factor.&lt;br /&gt;&lt;br /&gt;This &lt;a href="http://www.michael-noll.com/blog/2011/03/28/hadoop-space-quotas-hdfs-block-size-replication-and-small-files/"&gt;article &lt;/a&gt;shows how HDFS block size, combined with the replication factor, not file actual size, determines available space.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Is this really a problem? Sort of...it's a matter of efficiency. Space quotas are checked by the amount of remaining space on a datanode disk. If a block file takes up 12MB on a system that has 128MB block, there are effectively 114MB available to be added into the available bytes for the space quota -- for a replication factor of 3, that would be 342MB available, or 2.67 blocks. While you could argue that effectively .67 blocks of that space is wasted, 2 blocks of that space is still available for quota calculations. While 2.67 blocks is less than the minimum amount of space required to store a file of _any_ size in an HDFS with a replication factor of 3, if you were to have 2 small files of 12MB, you have 5.34 blocks available across the system -- effectively if you always mod 'leftover space' by replication factor, at most you are wasting replication factor # of blocks.&lt;br /&gt;&lt;br /&gt;Granted that's not the most efficient use of disk, but it's not as if a small file takes up a 'virtual' block that gets factored in the next time a file is copied into the cluster.&lt;br /&gt;&lt;br /&gt;The bigger problem with small files is the lack of efficiency that is encountered in mapreduce operations. Reducing the number of mappers being used and traversing blocks of data at a time is not possible with small files -- one mapper is spun up per file, and the overhead involved in copying the jar file to the task tracker node, starting up the JVM, etc, only makes sense if there is a substantial amount of data to process. You can't go wrong with large files -- they will split across blocks, which are processed more efficiently.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6343133085638535208?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6343133085638535208/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6343133085638535208'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6343133085638535208'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html' title='HDFS file size vs allocation'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2846324287843277874</id><published>2011-03-14T21:41:00.000-07:00</published><updated>2011-03-21T22:10:20.435-07:00</updated><title type='text'>Setting up YCSB for low latency data store testing</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;Overview&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;When confronted with a problem, my first instinct is to look around to see where that problem has been handled before. This is because &amp;nbsp;I believe that &lt;a href="http://teddziuba.com/2010/10/taco-bell-programming.html"&gt;code is a liability&lt;/a&gt;, and I want to minimize risk by using code that has been vetted, tested, and put into production by others, and only add to it when necessary.&lt;br /&gt;&lt;br /&gt;Right now I have several problems around storing and accessing lots of data in real time. I have several diverse use cases that span applications, but the one thing all of these use cases have in common is that there is no need for transactional integrity. There is also a need for scale beyond which a traditional RDBMS can provide. &amp;nbsp;The first (un) requirement and the second very urgent requirement are pushing me towards (open source) low-latency, &lt;a href="http://en.wikipedia.org/wiki/NoSQL"&gt;NOSQL&lt;/a&gt; data stores.&lt;br /&gt;&lt;br /&gt;The two kinds of NOSQL data stores I'm looking at are Document Stores and Key-Value stores. Here is a great &lt;a href="http://ayende.com/Blog/archive/2010/04/11/that-no-sql-thing-ndash-document-databases.aspx"&gt;post&lt;/a&gt; discussing the differences between the two.&lt;br /&gt;&lt;br /&gt;Some of the questions I need answered to address projects in progress and planned:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;What Document Stores and Key-Value Stores out there have heavy adoption rates, a corporate sponsor, other community support that indicate good performance and support?&amp;nbsp;&lt;/li&gt;&lt;li&gt;How much 'eventual consistency' can an application live with? If data doesn't need to be transactional, can it really be eventually consistent?&amp;nbsp;&lt;/li&gt;&lt;li&gt;Is there a Document Store that is fast enough to act as a Key-Value store, since it would be easier to manage one piece of software over two.&amp;nbsp;&lt;/li&gt;&lt;li&gt;How do different Key Value stores compare to one another? Anecdotal evidence is one thing, hard data that I can refer to makes me feel much better.&lt;/li&gt;&lt;li&gt;What happens when I shut a node down? How hard is it to restore?&lt;/li&gt;&lt;li&gt;What are the costs of maintenance of different stores? How hard are they to set up?&lt;/li&gt;&lt;/ol&gt;I wanted to have question 1and 2 narrow down the range somewhat, &amp;nbsp;evaluate that range against 3 and 4 to filter out slower candidates, leaving me with a smaller set to run by questions 5 and 6.&lt;br /&gt;&lt;br /&gt;In order to answer 3 and 4 above, I need to compare and contrast both Doc and KV stores in an 'apples to apples' way to gauge performance.&lt;br /&gt;&lt;br /&gt;I was psyching myself up to write a generic test framework, when someone pointed me to&amp;nbsp;&lt;a href="http://www.brianfrankcooper.net/"&gt;Brian Cooper&lt;/a&gt;&amp;nbsp;&amp;nbsp;and &lt;a href="https://github.com/brianfrankcooper/YCSB"&gt;YCSB&lt;/a&gt;, the Yahoo Cloud Serving Benchmark. &amp;nbsp;I had originally dismissed it as being out of date, but a quick perusal of the code on GitHub convinced me that updating it would not be that hard because it cleanly separates specific database calls from core functionality.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;YCSB implements different database client abstraction layers, and provides good documentation on how to set them up:&amp;nbsp;https://github.com/brianfrankcooper/YCSB/wiki/getting-started.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Not Quite Ready For Prime Time&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Before I could fully use YCSB, I had to fix up a couple of things. There are patches submitted for some of these fixes in the root project, but they hadn't been accepted yet. It made more sense for me to fork a repo and make the changes I needed (and push them if they hadn't already been pushed up to the upstream origin repo):&amp;nbsp;https://github.com/arunxarun/YCSB&lt;br /&gt;&lt;br /&gt;Here are some of the fixes I added, they havent been integrated into the master repo yet:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The Cassandra7Client needed to be retrofitted to &lt;a href="https://github.com/brianfrankcooper/YCSB/pull/24"&gt;use ByteBuffers instead of byte[]s&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;The MongoDbClient was t&lt;a href="https://github.com/brianfrankcooper/YCSB/pull/26"&gt;hrowing a ClassCastException in the insert() method&lt;/a&gt; because it was casting a double encoded in a string to an Integer.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The MongoDbClient was &lt;a href="https://github.com/brianfrankcooper/YCSB/pull/27"&gt;not connecting to non localhost MongoDB instances&lt;/a&gt; because it wasn't appending the database name to the base database url.&lt;/li&gt;&lt;li&gt;There was no truncate functionality. For a Document Store like Mongo, this meant I had to manually truncate the db every time I wanted to reload data. &lt;a href="https://github.com/brianfrankcooper/YCSB/pull/28"&gt;I implemented the truncate method in the DB abstract class&lt;/a&gt; (and pushed it to the adaptor classes I used) so that I could do this via YCSB.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;I'm going to continue adding functionality -- right now I'm in the middle of adding delete functionality because we want to benchmark that as well -- &amp;nbsp;to my fork and pushing it up if I think it could be useful for other people.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Running A WorkLoad&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In YCSB terminology, a &lt;b&gt;&lt;i&gt;workload&lt;/i&gt;&lt;/b&gt; is a defined set of operations on a database. Workloads are stored as flat files, and executed by specific classes that extend the abstract Workload class. I'm using the CoreWorkload class (the default) for now, and may extend later. CoreWorkload lets me set the proportion of Reads vs Writes vs Updates vs Deletes in a separate property file. There are &lt;a href="https://github.com/brianfrankcooper/YCSB/wiki/Core-Workloads"&gt;default core workload files&lt;/a&gt; stored in the $YCSB_HOME/workloads directory. They break out like this:&lt;br /&gt;&lt;div class="p1"&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;workloada = 50/50 read/update ratio&lt;/li&gt;&lt;li&gt;workloadb = 95/5 read/update ratio&lt;/li&gt;&lt;li&gt;workloadc = 100/0 read/update ratio&lt;/li&gt;&lt;li&gt;workloadd = 95/5 read/update ratio&lt;/li&gt;&lt;li&gt;workloade = 95/5 scan/insert ratio&lt;/li&gt;&lt;li&gt;workloadf = 50/50 read/read-modify-write ratio&lt;/li&gt;&lt;/ul&gt;Because they are property files, workload files can be copied/tweaked as needed. &amp;nbsp;If needed, I can also override the CoreWorkLoad class to do something different, but I haven't had to do that yet, even though I've added new functionality.&lt;br /&gt;&lt;br /&gt;I followed the section on &lt;a href="https://github.com/brianfrankcooper/YCSB/wiki/Running-a-Workload"&gt;Running a Workload&lt;/a&gt;, below are my notes in addition to those instructions.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Building YCSB&lt;/b&gt;&lt;br /&gt;Pretty self explanatory: there is an ant target for each db client you wish to compile with:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;b&gt;&lt;i&gt;ant dbcompile-[DB Client Name]&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Just make sure your all of the jars your DB Client class needs are in the $YCSB_HOME/db/[Client DB]/lib directory. Note that sometimes, like in the case of Mongo, you may have to find those jars (slf4j, log4j) in other places.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Data Store Setup&lt;/b&gt;&lt;br /&gt;There are some a generic setup steps for all Data Stores:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create a [namespace/schema-like-element] called 'userspace'. For example, in Cassandra this would be a keyspace. In Mongo, a database, etc.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Create a [table-like element] in 'userspace', called 'data'. Again, in Cassandra this would be a column family, in Mongo, a collection, in HBase a column family, in MySQL a table.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;DB Specific details are found on the &lt;a href="https://github.com/brianfrankcooper/YCSB/wiki/Using-the-Database-Libraries"&gt;usage page&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Running YCSB&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When running YCSB, make sure you specify the jar files used for the DB Client. The first command you will run is the load command:&amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="line-height: 22px;"&gt;&lt;b&gt;&lt;i&gt;java -cp $YCSB_INSTALL/db/[DB Client dir]/lib/*:$YCSB_INSTALL/build/ycsb.jar com.yahoo.ycsb.CommandLine -db [DB Client class]&amp;nbsp;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="line-height: 22px;"&gt;&lt;b&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="line-height: 22px;"&gt;Once you are on the commandline make sure you can connect and see the namespace/keyspace you've created.&amp;nbsp;&lt;/span&gt;With that sanity check done, it's time to&amp;nbsp;run a workload. In order to do this I also need to load some data. This is done using the command line client from ycsb.jar:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;&lt;span class="Apple-style-span" style="line-height: 22px;"&gt;&lt;span class="Apple-style-span" style="font-style: normal; font-weight: normal;"&gt;&lt;b&gt;&lt;i&gt;java -cp $YCSB_INSTALL/db/[DB Client dir]/lib/*:$YCSB_INSTALL/build/ycsb.jar&amp;nbsp;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;&lt;/span&gt;&amp;nbsp;com.yahoo.ycsb.Client -db &lt;span class="Apple-style-span" style="font-style: normal; font-weight: normal; line-height: 22px;"&gt;&lt;b&gt;&lt;i&gt;[DB Client class]&amp;nbsp;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&amp;nbsp;-p [commandline props] -P [property files] -s -load &amp;nbsp;&amp;gt; out&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Some explanation of the available commandline parameters, note that in the above I'm running with one thread, no target ops, and loading the database via the -load parameter.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;-threads n: execute using n threads (default: 1) - can also be specified as the&amp;nbsp;"threadcount" property using -p&lt;/li&gt;&lt;li&gt;-target n: attempt to do n operations per second (default: unlimited) - can also&amp;nbsp;be specified as the "target" property using -p&lt;/li&gt;&lt;li&gt;-load:  run the loading phase of the workload&lt;/li&gt;&lt;li&gt;-t:  run the transactions phase of the workload (default)&lt;/li&gt;&lt;li&gt;-db dbname: specify the name of the DB to use (default: com.yahoo.ycsb.BasicDB) -&amp;nbsp;can also be specified as the "db" property using -p&lt;/li&gt;&lt;li&gt;-P propertyfile: load properties from the given file. Multiple files can&amp;nbsp;be specified, and will be processed in the order specified&lt;/li&gt;&lt;li&gt;-p name=value:  specify a property to be passed to the DB and workloads;&amp;nbsp;multiple properties can be specified, and override any&amp;nbsp;values in the propertyfile&lt;/li&gt;&lt;li&gt;-s:  show status during run (default: no status)&lt;/li&gt;&lt;li&gt;-l label:  use label for status (e.g. to label one experiment out of a whole batch)&lt;/li&gt;&lt;li&gt;-truncate, my own &lt;a href="https://github.com/arunxarun/YCSB/commit/69325cce31bac722f163bc75d8cecb694360dc19"&gt;special addition&lt;/a&gt;, to clean out data stores between runs.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;Some notes on the properties files I'm loading: the first one specifies the actual workload configuration. I'm using workloads/workloada, which looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;recordcount=100000&lt;br /&gt;operationcount=100000&lt;br /&gt;workload=com.yahoo.ycsb.workloads.CoreWorkload&lt;br /&gt;&lt;br /&gt;readallfields=true&lt;br /&gt;&lt;br /&gt;readproportion=0.5&lt;br /&gt;updateproportion=0.5&lt;br /&gt;scanproportion=0&lt;br /&gt;insertproportion=0&lt;br /&gt;&lt;br /&gt;requestdistribution=zipfian&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When I specify later properties files, they override the values set in the previous ones (commandline props override everything). I take advantage of this by creating other property files that override recordcount, insertioncount, and set db specific properties that are accessed in the DB Client classes.&lt;br /&gt;&lt;br /&gt;The output of the run is the average, min, max, 95th and 99th percentile latency for each operation type (read, update, etc.), a count of the return codes for each operation, and a histogram of latencies for each operation.&lt;br /&gt;&lt;br /&gt;The histogram shows the number of calls that were returned within the specified number of milliseconds. For example:&lt;br /&gt;&lt;pre&gt;0 45553&lt;br /&gt;1 2344&lt;br /&gt;2 399&lt;br /&gt;3 25&lt;br /&gt;4 5&lt;br /&gt;5 0&lt;br /&gt;&lt;/pre&gt;reads like this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;45553 calls returned in 0 ms&lt;/li&gt;&lt;li&gt;2344 calls returned in 1ms&lt;/li&gt;&lt;li&gt;399 calls returned in 2 ms&lt;/li&gt;&lt;li&gt;....&lt;/li&gt;&lt;/ul&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;That's All For Now...&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;YCSB as it is provides a very solid foundation for me to do testing across candidate data stores. &amp;nbsp;While it is very well documented, the code could use some love. I intend to give it enough love to evaluate the data stores I want to (in my fork), and push that love upstream. In the future I may end up writing a DB Client for some of the commercial stores we need to evaluate, as well as fix things that bug me.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2846324287843277874?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2846324287843277874/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2011/03/setting-up-ycsb-for-low-latency-data.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2846324287843277874'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2846324287843277874'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2011/03/setting-up-ycsb-for-low-latency-data.html' title='Setting up YCSB for low latency data store testing'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7422768655186229885</id><published>2011-01-05T21:46:00.000-08:00</published><updated>2011-07-19T21:24:33.079-07:00</updated><title type='text'>Setting up CDH3 Hadoop on my new Macbook Pro</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: large;"&gt;A New Machine&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;div style="margin: 0px;"&gt;I'm fortunate enough to have recently received a Macbook Pro, 2.8 GHz Intel dual core, with 8GB RAM. &amp;nbsp;This is the third time I've turned a vanilla mac into a ninja coding machine, and following my design principle of "first time = coincidence, second time = annoying, third time = pattern", I've decided to write down the details for the next time. &lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Baseline&lt;/span&gt;&lt;br /&gt;This section details the pre-hadoop installs I did.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Java&lt;/b&gt;&lt;br /&gt;Previously I was running on Leopard, i.e. 10.4, and had to install &lt;a href="http://landonf.bikemonkey.org/static/soylatte/"&gt;soylatte&lt;/a&gt; to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That's good enough for me, for now.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Gcc, etc&lt;/b&gt;.&lt;br /&gt;In order to get these on the box, I had to &lt;a href="http://developer.apple.com/technologies/xcode.html"&gt;install XCode&lt;/a&gt;, making sure to check the 'linux dev tools' option.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;MacPorts&lt;/b&gt;&lt;br /&gt;I installed &lt;a href="http://www.macports.org/"&gt;MacPorts&lt;/a&gt; in case I needed to upgrade any native libs or tools.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Eclipse&lt;/b&gt;&lt;br /&gt;I downloaded the &lt;a href="http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/helios/SR1/eclipse-jee-helios-SR1-macosx-cocoa-x86_64.tar.gz"&gt;64 bit Java EE version of Helios&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Tomcat&lt;/b&gt;&lt;br /&gt;Tomcat is part of my daily fun, and t&lt;a href="http://www.malisphoto.com/tips/tomcatonosx.html"&gt;hese instructions to install tomcat6&lt;/a&gt; where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;role rolename="manager"/&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;prior to defining &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;user username="admin" password="password" roles="standard,manager,admin"/&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using &lt;a href="http://commons.apache.org/daemon/jsvc.html"&gt;jsvc&lt;/a&gt;&amp;nbsp;for 64 bit sounded iffy at best.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Hadoop&lt;/span&gt;&lt;br /&gt;I use the &lt;a href="http://www.cloudera.com/hadoop/"&gt;CDH&lt;/a&gt; distro. The install was amazingly easy, and their support rocks. Unfortunately, they don't have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):&lt;br /&gt;&lt;ol&gt;&lt;li&gt;distinct processes for namenode, job tracker node, and datanode/task tracker nodes.&lt;/li&gt;&lt;li&gt;formatted HDFS&lt;/li&gt;&lt;li&gt;Pig 0.8.0&lt;/li&gt;&lt;/ol&gt;I'm not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.&lt;br /&gt;&lt;br /&gt;I am installing CDH3, because I've been running it in &lt;a href="https://wiki.cloudera.com/display/DOC/CDH3+Deployment+in+Pseudo-Distributed+Mode"&gt;psuedo-mode&lt;/a&gt; on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.&lt;br /&gt;&lt;br /&gt;All of the CDH3 Tarballs can be found at&amp;nbsp;http://archive.cloudera.com/cdh/3/, and damn, that's a lot of tarballs. &lt;br /&gt;&lt;br /&gt;I downloaded &lt;a href="http://archive.cloudera.com/cdh/3/hadoop20-0.20.2+737.releasenotes.html"&gt;hadoop 0.20.2+737&lt;/a&gt;, it's (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I'll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under &amp;nbsp;a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I'll update it if CDH3 updates their version of hadoop.&lt;br /&gt;&lt;br /&gt;After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Psuedo Mode Config&lt;/b&gt;&lt;br /&gt;I'm going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own. &lt;br /&gt;&lt;br /&gt;Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;name&amp;gt;hadoop.tmp.dir&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;value&amp;gt;&lt;i&gt;&lt;b&gt;changed_to_a_valid_dir_I_own&lt;/b&gt;&lt;/i&gt;&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;lt;/property&amp;gt;&lt;br /&gt;&lt;br /&gt;and the following in hdfs-site.xml:&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;!-- specify this so that running 'hadoop namenode -format' formats the right dir --&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;name&amp;gt;dfs.name.dir&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;value&amp;gt;&lt;i&gt;&lt;b&gt;changed_to_a_different_dir_I_own&lt;/b&gt;&lt;/i&gt;&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;nbsp; &amp;lt;/property&amp;gt;&lt;br /&gt;&lt;br /&gt;I also had to create masters and slaves files in the example-confs/conf.pseudo directory:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;echo localhost &amp;gt; master&lt;br /&gt;echo localhost &amp;gt; slave&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;mv ./conf install-conf&lt;br /&gt;ln -sf ./example-confs/conf.pseudo conf&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Pig&lt;/span&gt;&lt;br /&gt;Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;More To Come&lt;/span&gt; &lt;br /&gt;This psuedo node install was relatively painless. I'm going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7422768655186229885?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7422768655186229885/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2011/01/setting-up-cdh3-hadoop-on-my-new.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7422768655186229885'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7422768655186229885'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2011/01/setting-up-cdh3-hadoop-on-my-new.html' title='Setting up CDH3 Hadoop on my new Macbook Pro'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5093753634886563695</id><published>2010-12-20T22:43:00.000-08:00</published><updated>2010-12-24T15:30:01.245-08:00</updated><title type='text'>Pig SPLITs, JOINs, and COGROUPs to manipulate multiple relations</title><content type='html'>I've been playing around with Pig and UDFs for the last couple of weeks as we try to convert an application from using SQL to do ETL to using Pig for the same transforms.&lt;br /&gt;&lt;br /&gt;In this particular application, we need to 'thread' logged messages together by fields that they can be joined on. Different messages represent different state around a single meta-state, kind of like a session, that unifies the different mesages. &amp;nbsp; Messages can have a specific type, lets call those A,B,C,and D. The joining rules are: &lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A joins B on field y&lt;/li&gt;&lt;li&gt;B joins C on field y&lt;/li&gt;&lt;li&gt;D joins A on field x,y,z&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Split&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The first step prior to joining messages is to separate them into relations that only contain A,B,C, or D messages using the Pig &lt;a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT"&gt;SPLIT&lt;/a&gt; statement. SPLIT works like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;SPLIT tuple INTO something IF condition, something else IF other condition.....);&amp;nbsp;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;/span&gt;basically SPLIT is a case statement, and I needed to write UDFs to implement the condition tests by comparing the input GMT  against the specified day.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Writing UDFs for the SPLIT&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In previous posts I've written eval UDFs. Those take input and  transform it to something else. In this case I needed to implement  filter UDFs. Filter UDFs return a boolean value based on their input. &lt;br /&gt;&lt;br /&gt;I've found that the 'top down' approach works well when designing UDFs. By that I mean write the UDFs as they would be used in script:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;SPLIT RAW_DATA INTO A IF isA(), B IF isB(), C IF isC(), D IF isD();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;and then implement them. Because of the boolean nature of the UDFs I need to implement four different methods because I need to perform four  tests in the SPLIT statement above. I'm basically going to implement the pattern: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class IsA extends FilterFunc {&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public Boolean exec(Tuple someTuple) throws IOException {&lt;br /&gt;           return testForA(someTuple); &lt;br /&gt;&lt;br /&gt; }&lt;br /&gt;        &lt;br /&gt;        protected Boolean testForA(Tuple someTuple) {&lt;br /&gt;              ..... // determine if this is a type A, or not.&lt;br /&gt;        }&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So the SPLIT statement above works as advertised, partitioning the original raw data out by message type. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;JOINing Relations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The next part of threading the messages together is to &lt;a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#JOIN+%28inner%29"&gt;JOIN&lt;/a&gt; them along common fields. The JOIN statement groups relations by a single field: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;JOINED_AB = JOIN A BY y, B BY y;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;NOTE that this JOIN is an inner join, &lt;a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#JOIN+%28outer%29"&gt;outer joins&lt;/a&gt; are a whole other beast. It simply aggregates all fields of B and C together.So the JOINED_AB relation looks like:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;a::x,a::y,a::p,b::q,b::y,b::z&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If you want to have an authoritative value of y for each tuple of JOINED_AB, you would need to explicitly generate it:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;JOINED_AB = FOREACH JOINED_AB GENERATE a::y as y, .....;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In the case above, recall that &lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A joins B on field y&lt;/li&gt;&lt;li&gt;B joins C on field y&lt;/li&gt;&lt;li&gt;D joins A on field x,y,z&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;to knit these fields together, you would&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;JOINED_AB = JOIN A ON y, B on y;&lt;br /&gt;&lt;br /&gt;JOINED_AB = FOREACH JOINED_AB GENERATE B::y as y,*;&lt;br /&gt;&lt;br /&gt;JOINED_AB_C = JOIN JOINED_AB ON y, C on y;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;At this point we want to join D to JOINED, but that needs to be done along a multiple column match. JOIN only handles single column matches. It's time to use &lt;a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#COGROUP"&gt;COGROUP&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;COGROUPing Relations&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The first thing we need to do (for clarity) is to regenerate some of the fields in the JOINED relation:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;JOINED = FOREACH JOINED generate A::x as x, A::y as y, A::z as z;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;This allows us to COGROUP without having to dereference by sub-tuple:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ALL_DATA = COGROUP JOINED ON (x,y,z) D on (x,y,z);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This relation is actually comprised of all fields of A,B,C,and D, but because we joined A,B,and C into JOINED before joining it to D, the tuple structure looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ALL_DATA: (x,y.z), {JOINED_AB_C: { JOINED_AB::x,JOINED_AB::y,JOINED_AB::z,JOINED_AB::A::field1,&lt;/pre&gt;&lt;pre&gt;JOINED_AB::B::field2}, D: {x,y,z,..}}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In other words like a &lt;a href="http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#GROUP"&gt;GROUP&lt;/a&gt;, that takes members of the same relation and binds tuples by similar fields ,creating a group and a bag that holds a list of matching tuples, COGROUP takes members of different relations, binds them by similar fields, and creates a bag that contains a single instance of both relations where those relations have common fields. In fact the COGROUP and GROUP operations are the same, it's just common practice to use COGROUP when grouping multiple relations, GROUP when grouping the same relation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5093753634886563695?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5093753634886563695/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/12/pig-splits-joins-and-cogroups-to.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5093753634886563695'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5093753634886563695'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/12/pig-splits-joins-and-cogroups-to.html' title='Pig SPLITs, JOINs, and COGROUPs to manipulate multiple relations'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5574383641421134692</id><published>2010-12-01T18:17:00.000-08:00</published><updated>2010-12-03T11:58:53.022-08:00</updated><title type='text'>Writing a custom PIG Loader</title><content type='html'>&lt;span style="font-size: x-large;"&gt;Foreward&amp;nbsp; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I've been pretty happy using the default pig loader, which takes as input the delimiter of a CSV, and loads tuples into memory as specified:&lt;br /&gt;&lt;br /&gt;A = LOAD '/csv/input/inputs*' USING PigStorage() AS (field1,field2,..fieldN)&lt;br /&gt;&lt;br /&gt;However I'm in the middle of doing some transforms on a csv with ~ 226 fields. Yikes. For most of these transforms, we don't need all 226 fields, in fact we probably only need a reasonable subset, but which reasonable subset depends on what we are trying to do. Ideally I'd like to be able to extract the values I want into a tuple like this:&lt;br /&gt;&lt;br /&gt;A = LOAD 'somefile' using CustomLoader(1,4,45,100...) as (timestamp:long, id:long, url:chararray...);&lt;br /&gt;&lt;br /&gt;So, it's time to write a custom loader.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;Setup&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Hadoop&lt;/span&gt;&lt;br /&gt;I first installed the&amp;nbsp; CDH3 distro of hadoop -- specifically the &lt;a href="https://wiki.cloudera.com/display/DOC/Hadoop+Deployment+%28CDH3%29+in+Pseudo-Distributed+Mode"&gt;psuedo mode configuration&lt;/a&gt;, which runs all hadoop core services, i.e. namenode, datanode, jobtracker and tasktracker on a single box. I then installed hadoop-pig. Cloudera makes this easy by leveraging apt-get and installing to the standard *nix hierarchical locations. The CDH3 version of Hadoop uses&amp;nbsp; &lt;a href="http://www.freelists.org/post/hllug/The-Magic-Behind-etcalternatives"&gt;/etc/alternatives&lt;/a&gt;&amp;nbsp; to allow for easy version switching, and logs reside in the usual /var/logs location. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo apt-get install hadoop-0.20-conf-pseudo&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;after starting core services as described in the link, I then installed CDH3 pig (version 0.7.0) via apt-get:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;sudo apt-get install hadoop-pig&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;CDH3 Pig installs the pig shell script in /usr/bin, and provides libs in /usr/lib/pig. In order to run the pig shell, you need to set JAVA_HOME to /usr/lib/jvm/java-6-sun.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Mavenization&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;With the necessary services installed, I set up a maven project, mainly for brainless dependency management:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;mvn archetype:create  -DarchetypeGroupId=org.apache.maven.archetypes&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;-DgroupId=org.arunxarun.data.prototypes  -DartifactId=pigloader&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I then wanted to bring in the pig jars as dependencies. I found hadoop-0.20-core in the mvnrepository, but could not find pig.jar or pig-core.jar in any maven repository. So I installed the pig and pig-core jars to my local repository from the /usr/lib/pig directory where they had been put by the apt-get install. I did that after creating versionless symlinks to the real jars whose names contained version information:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;mvn install:install-file -Dfile=/usr/lib/pig/pig-core.jar -DgroupId=org.apache.hadoop -DartifactId=hadoop-pig-core -Dversion=0.7.0 -Dpackaging=jar&lt;br /&gt;&lt;br /&gt;mvn install:install-file -Dfile=/usr/lib/pig/pig.jar -DgroupId=org.apache.hadoop -DartifactId=hadoop-pig -Dversion=0.7.0 -Dpackaging=jar&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Finally, I made sure that the dependencies were referenced in my pom file: &lt;br /&gt;&lt;pre&gt;&amp;lt;dependency&amp;gt;&lt;br /&gt;      &amp;lt;groupId&amp;gt;org.apache.hadoop&amp;lt;/groupId&amp;gt;&lt;br /&gt;      &amp;lt;artifactId&amp;gt;hadoop-pig-core&amp;lt;/artifactId&amp;gt;&lt;br /&gt;      &amp;lt;version&amp;gt;0.7.0&amp;lt;/version&amp;gt;&lt;br /&gt;    &amp;lt;/dependency&amp;gt;&lt;br /&gt;    &amp;lt;dependency&amp;gt;&lt;br /&gt;      &amp;lt;groupId&amp;gt;org.apache.hadoop&amp;lt;/groupId&amp;gt;&lt;br /&gt;      &amp;lt;artifactId&amp;gt;hadoop-pig&amp;lt;/artifactId&amp;gt;&lt;br /&gt;      &amp;lt;version&amp;gt;0.7.0&amp;lt;/version&amp;gt;&lt;br /&gt;    &amp;lt;/dependency&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;Implementation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As of 0.7.0, Pig loaders extend the &lt;a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup"&gt;LoadFunc&lt;/a&gt; abstract class.This means they need to override 4 methods:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;getInputFormat()&lt;/b&gt; this method returns to the caller an instance of the InputFormat that the loader supports. The actual load process needs an instance to use at load time, and doesn't want to place any constraints on how that instance is created.&lt;/li&gt;&lt;li&gt;&lt;b&gt;prepareToRead() &lt;/b&gt;is called prior to reading a split. It passes in the reader used during the reads of the split, as well as the actual split. The implementation of the loader usually keeps the reader, and may want to access the actual split if needed. &lt;/li&gt;&lt;li&gt;&lt;b&gt;setLocation()&lt;/b&gt; Pig calls this to communicate the load location to the loader, which is responsible for passing that information to the underlying InputFormat object. This method can be called multiple times, so there should be no state associated with the method (unless that state gets reset when the method is called).&lt;/li&gt;&lt;li&gt; &lt;b&gt;getNext()&lt;/b&gt; Pig calls this to get the next tuple from the loader once all setup has been done. If this method returns a NULL, Pig assumes that all&amp;nbsp; information in the split passed via the prepareToRead() method has been processed.&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;Here is the current implementation: note that the constructor takes a var arg set of Strings, which is the only kind of argument that can be used with a Pig Loader. Also note from above that RecordReader is set in prepareToRead, but actually used in getNext().&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class CustomLoader extends LoadFunc {&lt;br /&gt;&lt;br /&gt; private static final String DELIM = "\t";&lt;br /&gt; private static final int DEFAULT_LIMIT = 226;&lt;br /&gt; private int limit = DEFAULT_LIMIT;&lt;br /&gt; private RecordReader reader;&lt;br /&gt; private List&lt;integer&gt; indexes;&lt;br /&gt; private TupleFactory tupleFactory;&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * Pig Loaders only take string parameters. The CTOR is really the only interaction&lt;br /&gt;  * the user has with the Loader from the script.  &lt;br /&gt;  * @param indexesAsStrings&lt;br /&gt;  */&lt;br /&gt; public CustomLoader(String...indexesAsStrings) {&lt;br /&gt;  this.indexes = new ArrayList&lt;integer&gt;();&lt;br /&gt;  for(String indexAsString : indexesAsStrings) {&lt;br /&gt;   indexes.add(new Integer(indexAsString));&lt;br /&gt;  }&lt;br /&gt;  &lt;br /&gt;  tupleFactory = TupleFactory.getInstance();&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; &lt;br /&gt; @Override&lt;br /&gt; public InputFormat getInputFormat() throws IOException {&lt;br /&gt;   return new TextInputFormat();&lt;br /&gt;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * the input in this case is a TSV, so split it, make sure that the requested indexes are valid, &lt;br /&gt;  */&lt;br /&gt; @Override&lt;br /&gt; public Tuple getNext() throws IOException {&lt;br /&gt;  Tuple tuple = null;&lt;br /&gt;  List&lt;string&gt; values = new ArrayList&lt;string&gt;();&lt;br /&gt;  &lt;br /&gt;  try {&lt;br /&gt;   boolean notDone = reader.nextKeyValue();&lt;br /&gt;   if (!notDone) {&lt;br /&gt;       return null;&lt;br /&gt;   }&lt;br /&gt;   Text value = (Text) reader.getCurrentValue();&lt;br /&gt;   &lt;br /&gt;   if(value != null) {&lt;br /&gt;    String parts[] = value.toString().split(DELIM);&lt;br /&gt;    &lt;br /&gt;    for(Integer index : indexes) {&lt;br /&gt;     &lt;br /&gt;     if(index &amp;gt; limit) {&lt;br /&gt;      throw new IOException("index "+index+ "is out of bounds: max index = "+limit);&lt;br /&gt;     } else {&lt;br /&gt;      values.add(parts[index]);&lt;br /&gt;     }&lt;br /&gt;    }&lt;br /&gt;    &lt;br /&gt;    tuple = tupleFactory.newTuple(values);&lt;br /&gt;   }&lt;br /&gt;   &lt;br /&gt;  } catch (InterruptedException e) {&lt;br /&gt;   // add more information to the runtime exception condition. &lt;br /&gt;   int errCode = 6018;&lt;br /&gt;            String errMsg = "Error while reading input";&lt;br /&gt;            throw new ExecException(errMsg, errCode,&lt;br /&gt;                    PigException.REMOTE_ENVIRONMENT, e);&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  return tuple;&lt;br /&gt;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void prepareToRead(RecordReader reader, PigSplit pigSplit)&lt;br /&gt;   throws IOException {&lt;br /&gt;  this.reader = reader; // note that for this Loader, we don't care about the PigSplit.&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void setLocation(String location, Job job) throws IOException {&lt;br /&gt;  FileInputFormat.setInputPaths(job, location); // the location is assumed to be comma separated paths. &lt;br /&gt;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;}&amp;nbsp;&lt;/string&gt;&lt;/string&gt;&lt;/integer&gt;&lt;/integer&gt;&lt;/pre&gt;&lt;pre&gt;&lt;integer&gt;&lt;integer&gt;&lt;string&gt;&lt;string&gt;&amp;nbsp;&lt;/string&gt;&lt;/string&gt;&lt;/integer&gt;&lt;/integer&gt;&lt;/pre&gt;&lt;span style="font-size: x-large;"&gt;Testing&lt;/span&gt;&lt;br /&gt;Testing a Pig UDF requires two steps: basic unit testing and integration testing via a script. I'm including this section because it also shows how the loader is accessed via Pig Latin.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Unit Testing: Mocking the Reader&lt;/span&gt;&lt;br /&gt;I've implemented a MockRecordReader that I can pass into my CustomLoader via prepareToRead(). The MockRecordReader will be accessed when getNext() is called. Note that I've only implemented the methods I need. This is by no means a fully functional RecordReader:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class MockRecordReader extends RecordReader&lt;long, text=""&gt; {&lt;br /&gt;&lt;br /&gt; private BufferedReader reader;&lt;br /&gt; private long key;&lt;br /&gt; private boolean linesLeft;&lt;br /&gt;&lt;br /&gt; &lt;br /&gt; /**&lt;br /&gt;  * call this to load the file&lt;br /&gt;  * @param fileLocation&lt;br /&gt;  * @throws FileNotFoundException &lt;br /&gt;  */&lt;br /&gt; &lt;br /&gt; &lt;br /&gt; public MockRecordReader(String fileLocation) throws FileNotFoundException {&lt;br /&gt;  reader  = new BufferedReader(new FileReader(fileLocation));&lt;br /&gt;  key = 0;&lt;br /&gt;  linesLeft = true;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public void close() throws IOException {&lt;br /&gt;   // TODO Auto-generated method stub&lt;br /&gt;   &lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public Long getCurrentKey() throws IOException, InterruptedException {&lt;br /&gt;   return key;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public Text getCurrentValue() throws IOException, InterruptedException {&lt;br /&gt;  String line = reader.readLine();&lt;br /&gt;  &lt;br /&gt;  if(line != null) {&lt;br /&gt;   key++;&lt;br /&gt;  } else {&lt;br /&gt;   linesLeft = false;&lt;br /&gt;  }&lt;br /&gt;   return new Text(line);&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public float getProgress() throws IOException, InterruptedException {&lt;br /&gt;   // dont need this for unit testing&lt;br /&gt;   return 0;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public void initialize(InputSplit arg0, TaskAttemptContext arg1)&lt;br /&gt;      throws IOException, InterruptedException {&lt;br /&gt;   // not initializing anything during unit testing.&lt;br /&gt;   &lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt;  public boolean nextKeyValue() throws IOException, InterruptedException {&lt;br /&gt;   return linesLeft;&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/long,&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Implementing Units using the MockRecordReader is super easy. Note that I load the MockRecordReader up with some fake data for testing purposes.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class CustomLoaderTest {&lt;br /&gt;&lt;br /&gt; @Test&lt;br /&gt; public void testValidInput() throws Exception{&lt;br /&gt;  MockRecordReader reader = new MockRecordReader("src/test/resources/valid1line_hit_data.tsv");&lt;br /&gt;  &lt;br /&gt;  CustomLoader custLoader = new CustomLoader("0","2","4");&lt;br /&gt;  &lt;br /&gt;  custLoader.prepareToRead(reader, null);&lt;br /&gt;  &lt;br /&gt;  Tuple tuple = custLoader.getNext();&lt;br /&gt;  &lt;br /&gt;  assertNotNull(tuple);&lt;br /&gt;  &lt;br /&gt;  String ts = (String)tuple.get(0);&lt;br /&gt;  assertNotNull(ts);&lt;br /&gt;  assertEquals(ts,"1130770920");&lt;br /&gt;  &lt;br /&gt;  String language = (String)tuple.get(1);&lt;br /&gt;  assertEquals(language,"en-ca");&lt;br /&gt;  String someCt = (String)tuple.get(2);&lt;br /&gt;  assertEquals(someCt,"675");&lt;br /&gt;  &lt;br /&gt;  &lt;br /&gt; }&lt;br /&gt; &lt;br /&gt; @Test(expected=IOException.class)&lt;br /&gt; public void testInvalidInput() throws Exception {&lt;br /&gt;  MockRecordReader reader = new MockRecordReader("src/test/resources/valid1line_hit_data.tsv");&lt;br /&gt;  &lt;br /&gt;  CustomLoader custLoader = new CustomLoader("300");&lt;br /&gt;  &lt;br /&gt;  custLoader.prepareToRead(reader, null);&lt;br /&gt;  &lt;br /&gt;  Tuple tuple = custLoader.getNext();&lt;br /&gt; }&lt;br /&gt;  &lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;Integration Testing&lt;/span&gt;&lt;br /&gt;Now that I know I haven't generated any NPEs from basic usage (NOTE: there are plenty more tests that I could do around bad format), it's integration test time. Integration testing a loader via Pig Latin is pretty simple: load data, then dump it, and validate that it looks like it should. Right now this is manual, basically running the script below, but output could/should be automatically validated.&lt;br /&gt;&lt;br /&gt;Note that in order to use the UDF I've written, I need to specifically register it as shown in the first line below. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;register '../../../target/CustomLoader-1.0-SNAPSHOT.jar'&lt;br /&gt;&lt;br /&gt;-- the loader is fully path specified, and args are passed in using single quotes.&lt;br /&gt;-- the file being loaded exists at the specified location in HDFS&lt;br /&gt;&lt;br /&gt;A = LOAD '/test/hit_data.tsv' USING com.foo.bar.CustomLoader('0','2','6','19') AS (zero:long,two:chararray,six:long,nineteen:chararray);&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;C =  GROUP A BY zero;&lt;br /&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;-- this forces pig to execute the query plan up to the DUMP, which means invoking the loader. &lt;br /&gt;&lt;br /&gt;DUMP C;&lt;br /&gt;&lt;br /&gt;-- note that the same loader can be invoked with a different number of arguments, and&lt;br /&gt;-- fields don't have to be cast&lt;br /&gt;-- the file being loaded exists at the specified location in HDFS&lt;br /&gt;&lt;br /&gt;B = LOAD '/test/hit_data.tsv' USING com.foo.bar.CustomLoader('100','200');&lt;br /&gt;&lt;br /&gt;DUMP B;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;Conclusion&lt;/span&gt; &lt;br /&gt;&lt;span style="font-size: x-large;"&gt;&lt;span style="font-size: small;"&gt;Writing the code took about 10 minutes. Testing it took much longer. That seems to be the pattern for me when writing (simple) UDFs. What I've noticed about Pig scripts and UDFS is that in order to validate functionality throughout the script/UDF lifecycle you always need to validate the generated Tuples to feel confident that changes have been made without regression.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-large;"&gt;&lt;span style="font-size: small;"&gt;Other than the lack of automation around integration testing, the actual Loader works as advertised -- it might need to change to accommodate new requirements,&amp;nbsp; but it will work just fine for prototypical work with multi column CSV files. &lt;/span&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5574383641421134692?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5574383641421134692/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5574383641421134692'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5574383641421134692'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/12/writing-custom-pig-loader.html' title='Writing a custom PIG Loader'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4873676085785908512</id><published>2010-11-17T20:18:00.000-08:00</published><updated>2010-11-17T20:46:23.699-08:00</updated><title type='text'>Pull Parsing with STAX</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Foreward&lt;/span&gt;&lt;br /&gt;Due to some events beyond my control, I lost the source code for parsing my &lt;a href="http://developer.garmin.com/schemas/tcx/v2/"&gt;Garmin TCX&lt;/a&gt; data into CSV format. I had not gotten around to Git-ifying my source, and also had not backed it up via JungleDisk/Mozy, so the original Ruby code that used the &lt;a href="http://www.jamesh.id.au/articles/libxml-sax/libxml-sax.html"&gt;LibXML SAX parser&lt;/a&gt; is gone. &lt;br /&gt;&lt;br /&gt;That might not necessarily be a bad thing. In &lt;a href="http://www.saxproject.org/"&gt;SAX &lt;/a&gt;parsing, events happen without any surrounding context. It is up to the programmer to supply the context, and doing that in a legible, maintainable way with the SAX event driven model is a challenge. When I had discovered issues with the parsing code, fixing those issues required a lot of time to determine the actual state at the time the bug occurred. SAX parsing code has a significant maintenance penalty. &lt;br /&gt;&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Document_Object_Model"&gt;DOM &lt;/a&gt;parsing is much more straightforward because you navigate the structure of the XML, and implicitly get the associated context. Unfortunately, it is prohibitively expensive because it requires that the entire document get loaded into memory. &lt;br /&gt;&lt;br /&gt;&lt;a href="http://download.oracle.com/docs/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP3.html"&gt;STAX &lt;/a&gt;parsing is a reasonable compromise between the two extremes. It streams the file (i.e. only loading into memory what it needs), while allowing the developer to navigate the structure of the XML. In other words, you have context without memory overhead. &lt;br /&gt;&lt;br /&gt;STAX can be used to read or write XML files -- in my application, that converted .TCX files to a CSV format, I was focused on reading XML, not writing it. For reading files, STAX provides two different APIs. The cursor based API uses XMLStreamReader. The iterator based API uses XMLEventReader. The key difference between the two is that the iterator based API treats events as first class objects and allows the user to peek ahead at the next element to be fetched. This supplies the user with more context than the iterator based API. That context comes with additional resource consumption, but still much less than loading an entire DOM into memory. &lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Onward&lt;/span&gt;&lt;br /&gt;In my first STAX program, I wanted to see how far I could get with the cursor based API. Specifically, given it's better performance characteristics, would it provide enough context for me to write easily maintainable code? &lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;A Moment...&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;A quick side note on why (again) I'm choosing to convert an XML stream to CSV. XML is great because DTDs and Schemas provide a way to validate document integrity when there are large numbers of optional elements. In the case of TCX, there are few optional elements -- the elements that exist always contain the same kinds of data. With an unchanging format, CSV makes more sense because it is a more compact representation of a stable data set. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Implementation Details&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;In my re-implementation of TCX-&amp;gt;CSV parsing code, I needed to transform a set of nested parameters into two different CSV formats. In order to explain what I needed to do, I need to go into detail about what the Garmin TCX XML looks like, and what I wanted to extract from it. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;?xml version="1.0" encoding="UTF-8" standalone="no" ?&amp;gt;&lt;br /&gt;&amp;lt;TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd"&amp;gt;&lt;br /&gt;&lt;br /&gt;  &amp;lt;Activities&amp;gt;&lt;br /&gt;    &amp;lt;Activity Sport="Biking"&amp;gt;&lt;br /&gt;      &amp;lt;Id&amp;gt;2010-11-04T22:02:59Z&amp;lt;/Id&amp;gt;&lt;br /&gt;      &amp;lt;Lap StartTime="2010-11-04T22:02:59Z"&amp;gt;&lt;br /&gt;        &amp;lt;TotalTimeSeconds&amp;gt;1798.6000000&amp;lt;/TotalTimeSeconds&amp;gt;&lt;br /&gt;        &amp;lt;DistanceMeters&amp;gt;15412.0742188&amp;lt;/DistanceMeters&amp;gt;&lt;br /&gt;        &amp;lt;MaximumSpeed&amp;gt;12.8037510&amp;lt;/MaximumSpeed&amp;gt;&lt;br /&gt;        &amp;lt;Calories&amp;gt;581&amp;lt;/Calories&amp;gt;&lt;br /&gt;        &amp;lt;AverageHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t"&amp;gt;&lt;br /&gt;          &amp;lt;Value&amp;gt;140&amp;lt;/Value&amp;gt;&lt;br /&gt;        &amp;lt;/AverageHeartRateBpm&amp;gt;&lt;br /&gt;        &amp;lt;MaximumHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t"&amp;gt;&lt;br /&gt;          &amp;lt;Value&amp;gt;158&amp;lt;/Value&amp;gt;&lt;br /&gt;        &amp;lt;/MaximumHeartRateBpm&amp;gt;&lt;br /&gt;        &amp;lt;Intensity&amp;gt;Active&amp;lt;/Intensity&amp;gt;&lt;br /&gt;        &amp;lt;Cadence&amp;gt;0&amp;lt;/Cadence&amp;gt;&lt;br /&gt;        &amp;lt;TriggerMethod&amp;gt;Manual&amp;lt;/TriggerMethod&amp;gt;&lt;br /&gt;        &amp;lt;Track&amp;gt;&lt;br /&gt;          &amp;lt;Trackpoint&amp;gt;&lt;br /&gt;            &amp;lt;Time&amp;gt;2010-11-04T22:03:00Z&amp;lt;/Time&amp;gt;&lt;br /&gt;            &amp;lt;Position&amp;gt;&lt;br /&gt;              &amp;lt;LatitudeDegrees&amp;gt;47.5834731&amp;lt;/LatitudeDegrees&amp;gt;&lt;br /&gt;              &amp;lt;LongitudeDegrees&amp;gt;-122.2491668&amp;lt;/LongitudeDegrees&amp;gt;&lt;br /&gt;            &amp;lt;/Position&amp;gt;&lt;br /&gt;            &amp;lt;AltitudeMeters&amp;gt;17.4411621&amp;lt;/AltitudeMeters&amp;gt;&lt;br /&gt;            &amp;lt;DistanceMeters&amp;gt;26.8590393&amp;lt;/DistanceMeters&amp;gt;&lt;br /&gt;            &amp;lt;HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t"&amp;gt;&lt;br /&gt;              &amp;lt;Value&amp;gt;85&amp;lt;/Value&amp;gt;&lt;br /&gt;            &amp;lt;/HeartRateBpm&amp;gt;&lt;br /&gt;            &amp;lt;SensorState&amp;gt;Absent&amp;lt;/SensorState&amp;gt;&lt;br /&gt;          &amp;lt;/Trackpoint&amp;gt;&lt;br /&gt;          ...&lt;br /&gt;        &amp;lt;/Track&amp;gt;&lt;br /&gt;     &amp;lt;/Lap&amp;gt;&lt;br /&gt;     ...&lt;br /&gt;   &amp;lt;/Activity&amp;gt;&lt;br /&gt;     ...&lt;br /&gt; &amp;lt;TrainingCenterDatabase&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I want to extract two kinds of data out of the XML stream above: &lt;br /&gt;&lt;ol&gt;&lt;li&gt;Lap summary data. Lap summary data is good for high level comparisons of effort. The basic components of Lap summary data of the same general duration can be compared across laps.&lt;/li&gt;&lt;li&gt;Trackpoint data. Trackpoint data -- elevation, heart rate, lat/long can be analyzed/transformed across arbitrary intervals to measure input effort and output speed.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;Lap Summary Data will look like this:&lt;br /&gt;&lt;i&gt;activity_id,lap_id,total_time,total_distance,max_speed,total_calories,average_heartrate,max_heartrate&lt;/i&gt;&lt;br /&gt;Trackpoint detail data will look like this:&lt;/div&gt;&lt;div&gt;&lt;i&gt;lap_id,trackpoint_id,time, latitude,longitude,altitude,distance,heartrate&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Both files may end up being used when correlating track points to their parent laps and activities.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Initializing the SAX Parser&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;I've created a class TCXPullParser to parse Garmin TCX data. In the constructor I initialize the STAX parser:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;XMLStreamReader parser;&lt;br /&gt;         .....&lt;br /&gt;         /**&lt;br /&gt;  * ctor, with all file names to write to and read from.&lt;br /&gt;  * &lt;br /&gt;  * @param lapSummaryWriter&lt;br /&gt;  * @param trackDetailWriter&lt;br /&gt;  * @param fileToParse&lt;br /&gt;  * @throws IOException&lt;br /&gt;  * @throws XMLStreamException&lt;br /&gt;  */&lt;br /&gt; public TCXPullParser(CSVWriter lapSummaryWriter, CSVWriter trackDetailWriter,&lt;br /&gt;     String fileToParse) throws IOException, XMLStreamException {&lt;br /&gt;  &lt;br /&gt;  FileInputStream fis = new FileInputStream(fileToParse);&lt;br /&gt;  XMLInputFactory factory = XMLInputFactory.newInstance();&lt;br /&gt;  parser = factory.createXMLStreamReader(fis);&lt;br /&gt;                ...&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Initialization is simple, the parser is created from the XMLInputFactory, and takes the FileInputStream created on the input file name. From this point my primary access to the file is through the parser object. I use the parser object to advance the cursor (by calling next()), inspect the type of element (as a return value from parser.next()), and grab text (parser.getText()). These three methods, with some additional functionality I've added, give me enough context to actually top-down parse the XML.&amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Pull Parsing&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;One advantage of using pull parsing is that the code is in charge of when events are fired. This lets us do things like skip processing/move directly to an element that we are interested in by using the parser.next() method and checking the returned element type:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;/**&lt;br /&gt; * skips parser to the start element of the specified element name, while&lt;br /&gt; * stopElementName has not been encountered.&lt;br /&gt; * &lt;br /&gt; * @param parser&lt;br /&gt; * @param elementName&lt;br /&gt; * @param stopElementName&lt;br /&gt; *          if we get this far, we've gone too far.&lt;br /&gt; * @return true if element is found&lt;br /&gt; * @throws Exception&lt;br /&gt; */&lt;br /&gt;protected boolean skipTo(XMLStreamReader parser, String elementName,&lt;br /&gt;    String stopElementName) throws Exception {&lt;br /&gt; boolean found = false;&lt;br /&gt; int parseType = parser.getEventType();&lt;br /&gt; while (parser.hasNext()) {&lt;br /&gt;     parseType = parser.next();&lt;br /&gt;     if (parseType == XMLStreamReader.CHARACTERS) {&lt;br /&gt;             continue;&lt;br /&gt;     }&lt;br /&gt;     String elName = parser.getLocalName();&lt;br /&gt;        if (parseType == XMLStreamReader.START_ELEMENT) {&lt;br /&gt;  if (elName.equals(elementName)) {&lt;br /&gt;         found = true;&lt;br /&gt;      break;&lt;br /&gt;  } else if (elName.equals(stopElementName)) {&lt;br /&gt;      // in the case where we are looking across parallel elements&lt;br /&gt;      // or into a container element,&lt;br /&gt;      // stop when we find the stop element&lt;br /&gt;     found = false;&lt;br /&gt;    break;&lt;br /&gt;  }&lt;br /&gt;         } else if (parseType == XMLStreamReader.END_ELEMENT&lt;br /&gt;         &amp;amp;&amp;amp; elName.equals(stopElementName)) {&lt;br /&gt;  // in the case where we are looking within a container element, stop&lt;br /&gt;  // when we reach the end of that container element&lt;br /&gt;  found = false;&lt;br /&gt;  break;&lt;br /&gt;     }&lt;br /&gt; }&lt;br /&gt; return found;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I typically use skipTo() to move to the next instance of an element, before it's containing element end tag is reached. For example, when I'm parsing the contents of a Trackpoint tag:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&amp;lt;Trackpoint&amp;gt;&lt;br /&gt;   &amp;lt;Time&amp;gt;2010-11-04T22:03:00Z&amp;lt;/Time&amp;gt;&lt;br /&gt;   &amp;lt;Position&amp;gt;&lt;br /&gt;   &amp;lt;LatitudeDegrees&amp;gt;47.5834731&amp;lt;/LatitudeDegrees&amp;gt;&lt;br /&gt;   &amp;lt;LongitudeDegrees&amp;gt;-122.2491668&amp;lt;/LongitudeDegrees&amp;gt;&lt;br /&gt;   &amp;lt;/Position&amp;gt;&lt;br /&gt;   &amp;lt;AltitudeMeters&amp;gt;17.4411621&amp;lt;/AltitudeMeters&amp;gt;&lt;br /&gt;   &amp;lt;DistanceMeters&amp;gt;26.8590393&amp;lt;/DistanceMeters&amp;gt;&lt;br /&gt;   &amp;lt;HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t"&amp;gt;&lt;br /&gt;   &amp;lt;Value&amp;gt;85&amp;lt;/Value&amp;gt;&lt;br /&gt;   &amp;lt;/HeartRateBpm&amp;gt;&lt;br /&gt;   &amp;lt;SensorState&amp;gt;Absent&amp;lt;/SensorState&amp;gt;&lt;br /&gt;&amp;lt;/Trackpoint&amp;gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This is the code to parse that data:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;/**&lt;br /&gt; * parses a single trackpoint and writes an output line to the trackDetail CSV&lt;br /&gt; * writer.&lt;br /&gt; * &lt;br /&gt; * @param parser&lt;br /&gt; * @throws Exception&lt;br /&gt; */&lt;br /&gt;protected void parseTrackPoint(XMLStreamReader parser, String lapId,&lt;br /&gt;    String trackPointId) throws Exception {&lt;br /&gt; trackDetailWriter.writeArg(lapId);&lt;br /&gt; trackDetailWriter.writeArg(trackPointId);&lt;br /&gt; skipTo(parser, TIME, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getTimeValue(parser, parser.next()));&lt;br /&gt; skipTo(parser, LAT, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getValue(parser, parser.next()));&lt;br /&gt; skipTo(parser, LONG, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getValue(parser, parser.next()));&lt;br /&gt; skipTo(parser, ALT, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getValue(parser, parser.next()));&lt;br /&gt; skipTo(parser, DIST, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getValue(parser, parser.next()));&lt;br /&gt; skipTo(parser, HEARTRATE, TRACKPOINT);&lt;br /&gt; trackDetailWriter.writeArg(getValue(parser, parser.next()));&lt;br /&gt; trackDetailWriter.flushArgs();&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;With skipTo in place, I needed to extract the data from the XML. The text inside of elements is CHARACTER data, and is accessed by calling parser.next() after hitting the enclosing tag START_ELEMENT. The CHARACTER data is accessed via XMLStreamParser.getText(): &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;/**&lt;br /&gt; * extracts a double value from a character stream.&lt;br /&gt; * &lt;br /&gt; * @param parser&lt;br /&gt; * @param parseType&lt;br /&gt; * @return the double, or -1 if the element is not CHARACTERS. will also&lt;br /&gt; *         thrown runtime exception if parsing fails.&lt;br /&gt; */&lt;br /&gt;private double getValue(XMLStreamReader parser, int parseType) {&lt;br /&gt; if (parseType == XMLStreamConstants.CHARACTERS) {&lt;br /&gt;  return Double.parseDouble(parser.getText());&lt;br /&gt; } else {&lt;br /&gt;  return -1;&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The combination of skipTo and getValue() allows me to extract Double values from the XML. I'm using Double to validate the format of the value, even though I'm going to persist that value back as a string. When extracting timestamps, I extract the data to a long:&lt;br /&gt;&lt;pre&gt;private long getTimeValue(XMLStreamReader parser, int parseType)&lt;br /&gt;    throws ParseException {&lt;br /&gt; if (parseType == XMLStreamConstants.CHARACTERS) {&lt;br /&gt;  String raw = parser.getText();&lt;br /&gt;  String date = raw.substring(0, raw.indexOf('T'));&lt;br /&gt;  String time = raw.substring(raw.indexOf('T') + 1, raw.indexOf('Z'));&lt;br /&gt;  SimpleDateFormat sdf = new SimpleDateFormat("MM-dd-yyyy-HH:mm:ss");&lt;br /&gt;  Date actual = sdf.parse(date + '-' + time);&lt;br /&gt;  return actual.getTime();&lt;br /&gt; } else {&lt;br /&gt;  return -1;&lt;br /&gt; }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Separating Reading XML from Writing CSV&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;In the code above there are calls to a trackDetailWriter object. I chose to separate the writing of the CSV from the reading of the XML in order to test the XML reading logic more easily. This simplified things a lot, it allowed me to pass in CSVWriter objects, it relieved the parsing code of having to manage/open/close destination files, and it allowed me to write test implementations of CSVWriter that stored the data in memory for me to check during unit tests.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public interface CSVWriter {&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * write a single arg&lt;br /&gt;  * @param arg&lt;br /&gt;  * @throws Exception&lt;br /&gt;  */&lt;br /&gt; public void writeArg(Object arg) throws Exception;&lt;br /&gt; &lt;br /&gt; /**&lt;br /&gt;  * flush all pending args as a single CSV line&lt;br /&gt;  * @throws Exception &lt;br /&gt;  */&lt;br /&gt; public void flushArgs() throws Exception;&lt;br /&gt; &lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The default implementation (used at runtime) looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;/**&lt;br /&gt; * &lt;br /&gt; * @author Arun Jacob&lt;br /&gt; *&lt;br /&gt; * push comma separated values to a file. &lt;br /&gt; */&lt;br /&gt;public class DefaultCSVWriterImpl implements CSVWriter {&lt;br /&gt;&lt;br /&gt; private FileWriter writer;&lt;br /&gt; private StringBuffer buffer;&lt;br /&gt; &lt;br /&gt; public DefaultCSVWriterImpl(String fileName) throws IOException {&lt;br /&gt;  writer = new FileWriter(fileName);&lt;br /&gt;  buffer = new StringBuffer();&lt;br /&gt; }&lt;br /&gt;  &lt;br /&gt; /**&lt;br /&gt;  * close the file: REQUIRED for all file writers.&lt;br /&gt;  * @throws Exception&lt;br /&gt;  */&lt;br /&gt; public void close() throws Exception {&lt;br /&gt;  writer.flush();&lt;br /&gt;  writer.close();&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void writeArg(Object arg) throws Exception {&lt;br /&gt;  buffer.append(arg);&lt;br /&gt;  buffer.append(",");&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void flushArgs() throws Exception {&lt;br /&gt;  writeToFile(buffer);&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * flush the contents of the buffer to file&lt;br /&gt;  * @param buffer&lt;br /&gt;  * @throws IOException&lt;br /&gt;  */&lt;br /&gt; private void writeToFile(StringBuffer buffer) throws IOException {&lt;br /&gt;  if(buffer.charAt(buffer.length()-1) == ',') {&lt;br /&gt;   // remove the last comma before writing. &lt;br /&gt;   writer.write(buffer.toString().substring(0,buffer.length()-1));&lt;br /&gt;  } else {&lt;br /&gt;   writer.write(buffer.toString());&lt;br /&gt;  }&lt;br /&gt;  &lt;br /&gt;  resetBuffer(buffer);&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; /**&lt;br /&gt;  * clear out the StringBuffer&lt;br /&gt;  * @param buffer&lt;br /&gt;  */&lt;br /&gt; private void resetBuffer(StringBuffer buffer) {&lt;br /&gt;  if(buffer.length() &amp;gt; 0) {&lt;br /&gt;   buffer.delete(0,buffer.length());&lt;br /&gt;  }&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The test implementation (used to verify that values are being pulled from the XML correctly) looks like this: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class TestTrackPointCSVWriter implements CSVWriter {&lt;br /&gt; &lt;br /&gt; static final String TRACKPOINTID = "TrackPointId";&lt;br /&gt;&lt;br /&gt; static final String LAPID = "LapId";&lt;br /&gt;&lt;br /&gt; static final String ACTIVITYID = "activityId";&lt;br /&gt;&lt;br /&gt; List&amp;lt;Object&amp;gt; args;&lt;br /&gt; Map&amp;lt;String,Object&amp;gt; argsMap;&lt;br /&gt; public TestTrackPointCSVWriter() {&lt;br /&gt;  args = new ArrayList&amp;lt;Object&amp;gt;();&lt;br /&gt;  argsMap = new HashMap&amp;lt;String,Object&amp;gt;();&lt;br /&gt; }&lt;br /&gt; &lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void writeArg(Object arg) throws Exception {&lt;br /&gt;  args.add(arg);&lt;br /&gt;  &lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; @Override&lt;br /&gt; public void flushArgs() throws Exception {&lt;br /&gt;  argsMap.put(LAPID, args.get(0));&lt;br /&gt;  argsMap.put(TRACKPOINTID, args.get(1));&lt;br /&gt;  argsMap.put(TCXPullParser.TIME, args.get(2));&lt;br /&gt;  argsMap.put(TCXPullParser.LAT,args.get(3));&lt;br /&gt;  argsMap.put(TCXPullParser.LONG, args.get(4));&lt;br /&gt;  argsMap.put(TCXPullParser.ALT, args.get(5));&lt;br /&gt;  argsMap.put(TCXPullParser.DIST, args.get(6));&lt;br /&gt;  argsMap.put(TCXPullParser.HEARTRATE, args.get(7));&lt;br /&gt;  args.clear();&lt;br /&gt;  &lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; &lt;br /&gt; /**&lt;br /&gt;  * validation method&lt;br /&gt;  * @param key&lt;br /&gt;  * @return&lt;br /&gt;  */&lt;br /&gt; public Object get(String key) {&lt;br /&gt;  return argsMap.get(key);&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Conclusion&lt;/span&gt;&lt;br /&gt;I'm not sure what I was thinking when I &lt;a href="http://arunxjacob.blogspot.com/2010/09/data-mining-my-gpshrm-data-step-1.html"&gt;wrote the original SAX parser for TCX data&lt;/a&gt;, other than I just like to write in Ruby. The additional context that I get by being able to pull tags instead of getting them pushed at me makes the code much easier to follow and therefore maintain.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4873676085785908512?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4873676085785908512/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/11/pull-parsing-with-stax.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4873676085785908512'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4873676085785908512'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/11/pull-parsing-with-stax.html' title='Pull Parsing with STAX'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7213121971131700133</id><published>2010-09-24T22:11:00.000-07:00</published><updated>2010-10-05T23:20:38.845-07:00</updated><title type='text'>Datamining my GPS/HRM Data, Step 2: Pig for ETL</title><content type='html'>&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Overview&lt;/span&gt;&lt;br /&gt;Now that I've &lt;a href="http://arunxjacob.blogspot.com/2010/09/data-mining-my-gpshrm-data-step-1.html"&gt;extracted 3 years worth of GPS/HRM data into CSV format&lt;/a&gt;, I want to get some basic summary information. Specifically, I want&lt;br /&gt;&lt;ul&gt;&lt;li&gt;the total distance covered for running, biking and 'other' (usually skate skiing) per month.&amp;nbsp;&lt;/li&gt;&lt;li&gt;average run and ride mileage per month&lt;/li&gt;&lt;li&gt;average running and cycling pace per month&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Pretty no brainer stuff that I could probably write a quick 'n dirty ruby script to do for my paltry 40MB of data.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But...I want to use &lt;a href="http://hadoop.apache.org/pig/"&gt;Pig&lt;/a&gt;. While this may seem like a candidate for the "&lt;a href="http://tirian.org/?p=20"&gt;Cutting Butter With A Chainsaw&lt;/a&gt;" award, I'm actually trying to show that&amp;nbsp;Pig Scripting is pretty damn useful for &lt;a href="http://en.wikipedia.org/wiki/Extract,_transform,_load"&gt;ETL&lt;/a&gt;. The nice thing about Pig is that when I'm dealing with 40GB or TB of input data, the same script can be run across a multi node cluster without changes to the basic logic. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'm also going to try and capture what I've learned from being in the trenches with Pig over the last couple of months. We use it at work to process TBs of log data, and while it definitely has it's warty aspects, I find that it is well suited for basic summary, grouping, and filtering operations.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Pig Setup&lt;/span&gt;&lt;/div&gt;&lt;div&gt;I downloaded and setup as described &lt;a href="http://hadoop.apache.org/pig/docs/r0.7.0/setup.html"&gt;here&lt;/a&gt;. Pig assumes that you've got &lt;a href="http://hadoop.apache.org/common/docs/current/single_node_setup.html"&gt;Hadoop installed&lt;/a&gt;, and specifically DFS started,&amp;nbsp;as well. &amp;nbsp;I start DFS from the bin directory of my hadoop install: start-dfs.sh.&lt;br /&gt;I'm running Pig in psuedo-local mode (with my local box set up as a single node hadoop cluster):&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;pig -f&lt;b&gt;&lt;i&gt;pigfile.pig&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I symlinked the pig shell script to /usr/bin/pig for ease of use. I have HADOOP_HOME and JAVA_HOME defined, the pig shell script uses both, and assumes that it is located in a bin directory in a standard pig install.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Pig Latin, UDFs, and Tuples&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Pig scripts are written in Pig Latin. I'm going to cover a small subset of Pig Latin in this script, see full references part &lt;a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html"&gt;I&lt;/a&gt;&lt;span id="goog_1024467980"&gt;&lt;/span&gt;&lt;span id="goog_1024467981"&gt;&lt;/span&gt;&lt;a href="http://www.blogger.com/"&gt;&lt;/a&gt; and &lt;a href="http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html"&gt;II&lt;/a&gt;&amp;nbsp;for comprehensive overviews. Where Pig Latin falls short, Pig has User Defined Functions that allow loading, storing, and transforming data. &lt;a href="http://hadoop.apache.org/pig/docs/r0.7.0/udf.html"&gt;UDFs&lt;/a&gt; are written in java and extend defined interfaces.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Pig Latin and UDFs manipulate Tuples to generate sets of Tuples. Tuples are groups of 1..N values, where a value can be an int, a float, a string, another tuple, or a bag (set) of values. In Pig, you see a lot of&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; X = {do something in pig on} Y;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;where Y is an original set of Tuples and X is the transformed result. Transformations of Data are done as Map-Reduce jobs. The complexity of what you are transforming determines how many map-reduce jobs are run.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;The Goods&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Load the data.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;I'm loading the detail files generated from the XML-&amp;gt;csv transform I did last time. I loaded all detail files into pig using the default PigStorage() UDF. In order to use this UDF, I had to specify the format of all fields in each row:&lt;/div&gt;&lt;div&gt;&lt;pre&gt;A = LOAD '/csv/input/details*' USING PigStorage() AS (activityId:chararray, lapId:chararray, time:chararray, latitude:float, longitude:float, alt:float,dist:float,hr:int);&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;This gives me a tuple A that contains data as specified above. Note that this tuple represents all of the rows of loaded data. Also note that I used a wildcard to load all files in at the same time. PigStorage() also allows the user to separate locations by comma, which is really great when loading from multiple locations.&lt;br /&gt;&lt;br /&gt;Note that the locations above are in HDFS. &amp;nbsp;I uploaded all of my data into HDFS using the hadoop fs -put command.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Extract the month from the timestamp using a custom UDF.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;Right now I've got the timestamp formatted as a character array. I'm going to need to extract month data and add it to the list of tuple fields in order to get monthly averages. This is where the simple functionality of the pig script is supplanted by a UDF.&lt;br /&gt;&lt;br /&gt;UDFs, as mentioned before, are used to load, store, and transform/extract data. I'm going to write one that extracts the month as an integer between 1-12 &amp;nbsp;from the &lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;time&lt;/span&gt; string. Note in the listing below I broke out the actual formatting into a separate function that I could unit test w/o having to create a Tuple instance.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;pre&gt;package com.infovoracious.udfs;&lt;br /&gt;&lt;br /&gt;import java.io.IOException;&lt;br /&gt;import java.util.Calendar;&lt;br /&gt;import org.apache.pig.EvalFunc;&lt;br /&gt;import org.apache.pig.data.Tuple;&lt;br /&gt;&lt;br /&gt;public class GetMonth extends EvalFunc package com.infovoracious.udfs;&lt;br /&gt;&lt;br /&gt;import java.io.IOException;&lt;br /&gt;import java.util.Calendar;&lt;br /&gt;import org.apache.pig.EvalFunc;&lt;br /&gt;import org.apache.pig.data.Tuple;&amp;nbsp;&lt;/pre&gt;&lt;pre&gt;public class GetMonth extends EvalFunc (Integer)&lt;integer&gt; {&amp;nbsp;&lt;br /&gt;// parens represent brackets, this is a templated class&amp;nbsp;&lt;/integer&gt;&lt;/pre&gt;&lt;pre&gt;&lt;integer&gt;private static final int TIME_INDEX = 2;&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public Integer exec(Tuple input) throws IOException {&lt;br /&gt;    if (input == null || input.size() == 0)&lt;br /&gt;      return null;&lt;br /&gt;    try {&lt;br /&gt;      // looks like this: 2010-02-24T14:22:29Z&lt;br /&gt;      String str = (String) input.get(TIME_INDEX);&lt;br /&gt;      if(str != null) {&lt;br /&gt;        return DateUtils.dateFromFormattedTime(str,Calendar.WEEK_OF_MONTH)&lt;br /&gt;      } else { &lt;br /&gt;        return -1;&lt;br /&gt;      }&lt;br /&gt;      return DateUtils.dateFromFormattedTime(str, Calendar.MONTH) + 1;&lt;br /&gt;    } catch (Exception e) {&lt;br /&gt;      throw new IOException("Caught exception processing input row ", e);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;  private static final int TIME_INDEX = 0;&lt;br /&gt;&lt;br /&gt;  @Override&lt;br /&gt;  public Integer exec(Tuple input) throws IOException {&lt;br /&gt;    if (input == null || input.size() == 0)&lt;br /&gt;      return null;&lt;br /&gt;    try {&lt;br /&gt;      // looks like this: 2010-02-24T14:22:29Z&lt;br /&gt;      String str = (String) input.get(TIME_INDEX);&lt;br /&gt;&lt;br /&gt;      return DateUtils.dateFromFormattedTime(str, Calendar.MONTH) + 1;&lt;br /&gt;    } catch (Exception e) {&lt;br /&gt;      throw new IOException("Caught exception processing input row ", e);&lt;br /&gt;    }&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/integer&gt;&lt;/pre&gt;&lt;br /&gt;I also created a UDF to generate the week of the month as well. I'm storing both the month and the week of the month as additional columns.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Load the UDFs.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In order for my script to see the UDFs I have just written, I need to load the jar that the UDFs reside in, and alias the methods so that they can be called in the script:&lt;br /&gt;&lt;br /&gt;This is done by using the REGISTER keyword at the top of the script to load the jar, and the DEFINE keywords to alias UDFs. Note that when I define the UDF, I'm specifying it's constructor. In the cases above, GetMonth and GetWeekOfMonth both have default (no parameter) constructors. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;REGISTER foo.jar;&lt;br /&gt;    DEFINE extract_month com.infovoracious.udfs.GetMonth();&lt;br /&gt;    DEFINE extract_week com.infovoracious.udfs.GetWeekOfMonth();&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I invoke the methods like this: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;-- the load from last time&lt;br /&gt;A = LOAD '/csv/input/details*' USING PigStorage() AS (activityId:chararray, lapId:chararray, time:chararray, latitude:float, longitude:float, alt:float,dist:float,hr:int);&lt;br /&gt;&lt;br /&gt;-- adding columns to a tuple.&lt;br /&gt;B = FOREACH A GENERATE *, extract_month(time) as month , extract_week(time) as week;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Sort by month and week of month.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;Now that we have the columns, we can sort by them. Sorting by month, week of month, groups values by unique month,week of month tuples:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;C = GROUP B BY (month,week);&lt;br /&gt;&lt;/pre&gt;This creates a tuple that looks like this:&lt;br /&gt;&lt;pre&gt;{month, week}, array length 1..N of {activityId,lapId,time,latitude,longitude,alt,dist,hr}&lt;br /&gt;&lt;/pre&gt;actual values would look like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;{1,1},[{..},{..},..]&lt;br /&gt;{1,2},[{..}...]&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt; &lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Summarize data.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;With columns now grouped, I can summarize data from the grouped tuples. I want to summarize mileage per week. I can do that as follows (showing the whole script for continuity)&lt;br /&gt;&lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;-- the load from last time&lt;br /&gt;A = LOAD '/csv/input/details*' USING PigStorage() AS (activityId:chararray, lapId:chararray, time:chararray, latitude:float, longitude:float, alt:float,dist:float,hr:int);&lt;br /&gt;&lt;br /&gt;-- adding columns to a tuple.&lt;br /&gt;B = FOREACH A GENERATE *, extract_month(time) as month , extract_week(time) as week;&lt;br /&gt;&lt;br /&gt;C = GROUP B BY (month,week);&lt;br /&gt;&lt;br /&gt;-- the tuple now looks like group(month,week) B(tuple contents), &lt;br /&gt;-- so I reference dist as a member of the B tuple.&lt;br /&gt;&lt;br /&gt;D = FOREACH C GENERATE *, SUM(B.dist) as total_dist;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;b&gt; &lt;/b&gt;&lt;br /&gt;&lt;b&gt; &lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt; &lt;/b&gt;&lt;b&gt;Flatten the grouping columns.&lt;/b&gt;&lt;/span&gt; &lt;br /&gt;Using the FLATTEN keyword to remove tuple nesting has different effects depending on where you use it. If you use FLATTEN to flatten the grouped columns, it merely removes the bag from around the grouped columns:&lt;br /&gt;&lt;pre&gt;{month, week}, [array of tuples]&lt;br /&gt;&lt;/pre&gt;becomes&lt;br /&gt;&lt;pre&gt;month, week, [array of tuples].&lt;br /&gt;&lt;/pre&gt;Flattening out the array generates a new row for each array element:&lt;br /&gt;&lt;pre&gt;1,2,[{a,b},{c,d}]&lt;br /&gt;&lt;/pre&gt;becomes&lt;br /&gt;&lt;pre&gt;1,2 ,a,b&lt;br /&gt;&lt;/pre&gt;and&lt;br /&gt;&lt;pre&gt;1,2,c,d&lt;br /&gt;&lt;/pre&gt;In this case I want to flatten out the grouping columns:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;-- the load &lt;br /&gt;A = LOAD '/csv/input/details*' USING PigStorage() AS (activityId:chararray, lapId:chararray, time:chararray, latitude:float, longitude:float, alt:float,dist:float,hr:int);&lt;br /&gt;&lt;br /&gt;-- adding columns to a tuple.&lt;br /&gt;B = FOREACH A GENERATE *, extract_month(time) as month , extract_week(time) as week;&lt;br /&gt;&lt;br /&gt;-- grouping&lt;br /&gt;C = GROUP B BY (month,week);&lt;br /&gt;&lt;br /&gt;-- summing (and discarding what I don't need)&lt;br /&gt;D = FOREACH C GENERATE SUM(dist) as total_dist;&lt;br /&gt;&lt;br /&gt;-- flattening:&lt;br /&gt;-- note how I'm referring to the grouping columns by the 'group' special keyword.&lt;br /&gt;-- note how total_dist is not scoped by an enclosing tuple b/c of the way it was generated above.&lt;br /&gt;E = FOREACH D GENERATE FLATTEN(group),total_dist;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;span style="font-size: large;"&gt;&lt;b&gt;Store it.&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;Finally, I want to store the results of my work. This is done using the default PigStorage() store UDF:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;STORE E into '$some_dir';&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note that I didn't need to specify 'using PigStorage()'. I also used a variable, which I pass into the pig script as a key-value pair, like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;pig -p some_dir=foo...&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;And that's it! I've summed my mileage from each week per month over the last three years: in six lines of script! &lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Conclusion&lt;/span&gt;&lt;/div&gt;Pig is a powerful tool to clean up and perform basic operations on data.&amp;nbsp; When you know what you need to do, and have to do it on a lot of data, it works remarkably well. That said, there are a couple of things that start to hurt over time:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;When you write a lot of Pig, you start wanting to re-use sections of script, i.e. make them functions. Except you cant. Yet. So you end up either cutting and pasting (bad), or saving variables off for later use (confusing when you try to use them later) instead of writing script based UDFs (preferred).&amp;nbsp;&lt;/li&gt;&lt;li&gt;There is no conditional execution of an operation. I can't test a variable state and then execute one or another statements based on that variable. What this means in real world use is that I do all of my conditional testing in bash, prior to invoking the pig script, and build up variables based on the results of those conditional tests. Those variables then get accessed by operations.  &lt;todo: conditional="" do="" loading,="" show="" source="" to="" wanted="" we="" what="" with="" workaround=""&gt;&lt;/todo:&gt;&amp;nbsp;&lt;/li&gt;&lt;li&gt;The default pig serialization format is great for the data I was using above because it had no commas which are the default Pig serialization format delimiter. You can change the delimiter, but that begs the question of what happens when someone has injected your delimiter into the data you are trying to process.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Higher level workflow is not solved with Pig. We started out running pig via a set of cron jobs, which soon turned into an admin's worst nightmare. Besides just being hard to maintain, we were also not taking advantage of the actual sources and destinations of pig data to schedule the work. In other words, if Pig job A produced intermediate format B and then continued running, we had no way of starting Pig Job C that loaded format B until Pig job A completed.&amp;nbsp; We are currently evaluating &lt;a href="http://github.com/yahoo/oozie"&gt;oozie&lt;/a&gt; at work to see if it can improve on cron+bash (a low bar, I know).&lt;/li&gt;&lt;/ol&gt;Warts and all, Pig sure beats writing map reduces for simple grouping/aggregation functionality. We still write map reduces when we are doing more complex operations, and our ETL job flows end up looking like a combination of pig, raw map-reduce, and HDFS access. But Pig really allows us to get away from a lot of boilerplate work (and maintenance!) for processing the data. Which gives us more time to analyze it.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-large;"&gt;Next&lt;/span&gt;&lt;br /&gt;Of course, just getting summary statistics on my data hasn't really scratched the itch that started me off in the first place. While I now have an idea of my general fitness month to month, I still don't know enough about the amount of hard versus easy running/cycling I did, or how fast I was going when I was going hard or easy. Answering those questions will help me determine my fitness, which I previously defined as heart rate vs pace, accounting for terrain. &amp;nbsp;In order to get a metric for that definition of fitness, I need to define of what 'hard' and 'easy' really are for me, and analyze how specific values of those workouts have changed for the better or worse over time.&amp;nbsp;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7213121971131700133?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7213121971131700133/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/09/datamining-my-gpshrm-data-step-2-pig.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7213121971131700133'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7213121971131700133'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/09/datamining-my-gpshrm-data-step-2-pig.html' title='Datamining my GPS/HRM Data, Step 2: Pig for ETL'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5383187541737349520</id><published>2010-09-07T22:45:00.000-07:00</published><updated>2010-10-05T21:15:20.286-07:00</updated><title type='text'>Data Mining My GPS/HRM Data: Step 1, Formatting the Data</title><content type='html'>I've been wanting to &lt;a href="http://arunxjacob.blogspot.com/2009/03/thoughts-on-gps-data-analysis.html"&gt;analyze the data from my Garmin 305 for a while now&lt;/a&gt;. I've been a casual runner/biker/hardcore data geek for a while now, and last year I &lt;a href="http://dethrockroolz.blogspot.com/2010/08/race-results-lake-meridan-olympic-dist.html"&gt;started doing triathlons&lt;/a&gt;, which means even more data to analyze. While I've always been curious, I just haven't had a great 'need' to analyze it until now...&lt;br /&gt;&lt;br /&gt;I'm switching from a primarily heart rate based training program to a &lt;a href="http://books.google.com/books?id=cNBCPvJioXMC&amp;amp;printsec=frontcover&amp;amp;dq=brain+training+for+runners&amp;amp;source=bl&amp;amp;ots=g0fji6fhgb&amp;amp;sig=6DAsimPsvmPmzpgGC9x6xzsjbAE&amp;amp;hl=en&amp;amp;ei=Zx6HTImBDYOasAPA75HVCg&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=2&amp;amp;ved=0CCcQ6AEwAQ#v=onepage&amp;amp;q&amp;amp;f=false"&gt;'pace based' training program&lt;/a&gt;. The former had me training within specific heart rate zones, the latter has me running at specific pace ranges. I'm a 'measurer', and I'm curious to see how effective (or not!) the pace based training program is. &amp;nbsp;Fortunately I have one device that tracks both heart rate and speed, and I can analyze the effect that the pace based training has relative to the effect that the heart rate based training has on my overall fitness.&lt;br /&gt;&lt;br /&gt;In this case I'm measuring fitness as a combination of heart rate vs terrain covered, i.e. hilly vs flat, vs pace. In order to measure my fitness up to now and going forward, I need to answer the following questions:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;how much time did I spend in 'recovery' mode, where my heart rate was &amp;lt; 70% max&lt;/li&gt;&lt;li&gt;how much time did I spend in 'pain cave' mode, where my heart rate was &amp;gt; 85-90% max?&lt;/li&gt;&lt;li&gt;how much faster (or slower) did I get at the same heart rate over the last year?&amp;nbsp;&lt;/li&gt;&lt;li&gt;for the new pace based program, how much time am &amp;nbsp;I spending at the different paces, i.e. recovery pace, base pace, marathon pace, 1/2 marathon pace, 10k pace, 5k pace, 1 mile pace?&amp;nbsp;&lt;/li&gt;&lt;li&gt;what is my average heart rate for those paces?&amp;nbsp;&lt;/li&gt;&lt;li&gt;how much faster (slower) am I getting?&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;I'm hoping to answer these questions using several approaches and several technologies that I've been using at work, and others that I've been itching to try.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The first thing I needed to do prior to doing any analysis was to format the data into a format that I could easily operate on. The data is exported from the device into a format called tcx, which is a schema-validated XML.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I need the data in csv format, mainly because the tools I want to process the data with are all hadoop based, and while I've read that it is not only possible, but easy, &lt;a href="http://gregorowicz.blogspot.com/2008/08/using-hadoop.html"&gt;to process XML with hadoop&lt;/a&gt;, &amp;nbsp;hadoop works best with csv formats. XML is a nice format for nested data, and this is nested data, with the following structure&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;activities&lt;/li&gt;&lt;ul&gt;&lt;li&gt;activity&lt;/li&gt;&lt;ul&gt;&lt;li&gt;laps&lt;/li&gt;&lt;ul&gt;&lt;li&gt;lap -- contains summary averages from trackpoint data (see below)&lt;/li&gt;&lt;ul&gt;&lt;li&gt;trackpoints&lt;/li&gt;&lt;ul&gt;&lt;li&gt;trackpoint -- contains snapshot heart rate, altitude, distance, etc.&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;XML is especially good when there are optional attributes. CSV tends to suck with optional attributes, because nothing is optional. In this case none of the attributes are optional, so ultimately XML is overkill for storing this data.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I collapsed this structure into two csv lists: summary data and detail data, because I planned to act on summary and detail data separately.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The summary data contains lap summary data:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;activity id, lap id, total time, total distance, max speed, max heart rate, average heart rate, calories, number of trackpoints.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;The detail data contains trackpoint data:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;lap id,time, latitude, longitude, altitude, distance, heart rate&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;To do the conversion, I used the &lt;a href="http://libxml.rubyforge.org/rdoc/"&gt;ruby libxml Sax parser&lt;/a&gt;.&amp;nbsp;In order to use the libxml sax parser, I needed to create a callback handler that implemented the methods I wanted to override.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;class PostCallbacks&lt;br /&gt;&amp;nbsp;&amp;nbsp;include XML::SaxParser::Callbacks&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;def on_start_element_ns(element, attributes, prefix, uri, namespaces)&lt;br /&gt;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;def on_characters(chars) &lt;br /&gt;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;end&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;def on_end_element_ns (name, prefix, uri) &lt;br /&gt;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;end&lt;br /&gt;end&lt;br /&gt;&lt;/code&gt;&lt;/div&gt;&lt;br /&gt;In the callback handler, I maintained state to track nested XML objects. Typically I would assign state in the on_start_element_ns() method, act on that state in the on_characters() method, and release state in the on_end_elemebt_ns() method. I would also flush my results to disk occasionally to avoid taking up an unreasonable amount of memory.&lt;br /&gt;&lt;br /&gt;I had about 40 Meg of data from the previous 3 years, which was parsed into csv files in approximately 44 seconds. I'm more than happy with that performance right now, because this is essentially a one-off job to get the data.&lt;br /&gt;&lt;br /&gt;Next Up: setting up a data processing pipeline using &lt;a href="http://hadoop.apache.org/pig/"&gt;Pig&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5383187541737349520?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5383187541737349520/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/09/data-mining-my-gpshrm-data-step-1.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5383187541737349520'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5383187541737349520'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/09/data-mining-my-gpshrm-data-step-1.html' title='Data Mining My GPS/HRM Data: Step 1, Formatting the Data'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1781130968197627532</id><published>2010-08-27T21:34:00.000-07:00</published><updated>2010-08-28T09:09:28.334-07:00</updated><title type='text'>Beware the Emerging God Object</title><content type='html'>Most &lt;a href="http://en.wikipedia.org/wiki/God_object"&gt;God Objects&lt;/a&gt;&amp;nbsp;are pretty obvious. Bloated, side effect filled, their existence guaranteed by collective fear about refactoring them to a state where their previous functionality is not reproducible.&lt;br /&gt;&lt;br /&gt;The God Object that derailed me for the last couple of days was not bloated, it's functions (for the most part) did not have side effects. It did a lot, but did not seem to be overwrought. &amp;nbsp;In fact, I'm pretty sure it only had a minor &lt;a href="http://www.urbandictionary.com/define.php?term=God%20complex"&gt;God Complex&lt;/a&gt; until I started refactoring it!&lt;br /&gt;&lt;br /&gt;As a software engineer, my goal is to produce the most effective, robust software with the least amount of effort. So, in an attempt to not repeat last week's thrash-fest, I'm going to catalog the warning signs, the resolution, and the corrected approach, because I'm pretty sure I'm going to be in this situation again.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Task:&lt;/b&gt;&lt;br /&gt;We need to make an existing protoype table driven. The prototype functionality configured and ran A/B tests. Its configuration data was currently hardcoded into an &lt;a href="http://download.oracle.com/javase/tutorial/java/javaOO/enum.html"&gt;enum&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Warning Signs Before We Even Started:&lt;/b&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The core logic of the A/B test configuration &amp;nbsp;resided in a single object.&lt;/li&gt;&lt;li&gt;Despite having several distinct (logical) sub components, this object was responsible for serializing itself and those sub objects.&lt;/li&gt;&lt;li&gt;This object existed in both our configuration time and runtime environments, which have significant differences, the biggest one being that at configuration time, all objects are read/written to a relational database, and at runtime, they are read from a BDB store. Most conceptual objects in the system have config time and runtime implementations.&lt;/li&gt;&lt;li&gt;We could not get rid of this object because it was still being used in production. We would have to implement a phased approach to upgrading it.&amp;nbsp;&lt;/li&gt;&lt;li&gt;I did not know the codebase well.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Our sprints are a week, so we felt compelled to fit in an end to end task into a week.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The first four warning signs are simply facts. The last two are things that I should pay more attention to. If I don't know the codebase well, and I'm trying to make significant improvements in a week, my chances of success go way down. I chose to ignore that reality.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Warning Signs Once We Started:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The table driven logic did not mesh well with the existing logic. The developer who had written the code had serialized all sub objects into a JSON string. The object was then serialized into a BDB file as a set of attributes and the JSON string, not an object graph. Trying to work with the new table data (an object graph) mean that:&lt;/li&gt;&lt;li&gt;Any change I made to the code had &lt;a href="http://net.pku.edu.cn/~course/cs201/2003/mirrorWebster.cs.ucr.edu/Page_AoAWin/HTML/IntroductionToProcedures4.html"&gt;side effects&lt;/a&gt;. This is because I was trying to preserve the original signature of the object to maintain backward compatibility with the rest of the system. This became obvious when:&lt;/li&gt;&lt;li&gt;The original code, fairly straightforward, became conditional-heavy. Instead of acknowledging the additional complexity,&amp;nbsp;&lt;/li&gt;&lt;li&gt;I was choosing to 'hack' more because I knew that we would have time to rework the code. I knew I was hacking because&lt;/li&gt;&lt;li&gt;The developer I was working with started to get more confused with every checkin I made.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;b&gt;Do Over!&lt;/b&gt;&lt;br /&gt;When his brow started cramping up from being so furrowed, I realized that things were wrong. We grabbed a whiteboard and went to work, depicting the original workflow, how adding table driven configurations changed that workflow, and how to change the existing workflow as little as possible to reduce the amount of rework that we would have to do in the runtime portion of the code. &amp;nbsp;Most of the good ideas came from our dev lead, who had been silently (and then not so silently) observing us trying to make sense of it all. I mention that because he was operating at a distance from a lot of the complexity, and his solution avoided most of it. I felt that he out of all of us was able to make the necessary leap out of the stew of constraints and details, and reframe the problem in a way that made it a much less complex problem. That's the genius of software. That's what I do, but only sometimes. That's why I'm writing this down.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Problems and Their Solutions:&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The old functionality contained both runtime behavior and configuration and behavior. We chose to separate the two, resulting in a POJO configuration data object used at configuration time, and the 'action' object that knew how to initialize given one of those data objects and contained runtime behavior.&lt;/li&gt;&lt;li&gt;More &lt;a href="http://en.wikipedia.org/wiki/Separation_of_concerns"&gt;separation of concerns&lt;/a&gt;: The old way of storing data in a JSON string &amp;nbsp;and the new table driven way of storing data did not belong together in the same object. &amp;nbsp;Even though we separated the old object into a data object and an 'action' object, we chose to keep the old data separate from the new data, and access both through getters/setters, which insulated us from the underlying implementations.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Composition: because we didn't have enough time to rewrite the infrastructure that wrote the configuration object into a BDB, we needed the new data to be persisted into &amp;nbsp;configuration object as part of the JSON string. Fortunately, once we separated out the data from the configuration object, we could compose the JSON string outside of the configuration object using the POJOs, then set it prior to serializing the configuration object to the BDB file.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;b&gt;Conclusions:&lt;/b&gt;&lt;br /&gt;The warning signs before start played out into more significant problems before they were fully resolved. And, when I think back on the process, I realize I was uncomfortable with the changes I was making, but I chose to push on. Because time was relatively compressed, I found myself making 'poor choices', which ultimately backfired. Next time (these are notes to myself, the 'you' I'm taking to is 'me' :)&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Acknowledge risks up front. Not a lot of time to get it done? That's a risk. Don't know the codebase? That's another risk (assuming the codebase is non trivial).&amp;nbsp;&lt;/li&gt;&lt;li&gt;Examine the old code and determine if there are concerns that should be separated before anything else is done.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Determine the scope of those changes to see if they can be met in the desired time frame.&lt;/li&gt;&lt;li&gt;If not, break the project into smaller chunks.&amp;nbsp;&lt;/li&gt;&lt;li&gt;If it feels easy it's right. If it feels hard, it's less right. Be more right.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Do not hack. When you find yourself hacking, stop. Back up and revisit initial assumptions. Step away from the problem.&lt;/li&gt;&lt;li&gt;Discuss ideas often, with others. If someone else can understand it easily, it can't be that bad. If, on the other hand, their brow starts to wrinkle, and your blood pressure starts to go up because there are not words in the English language to describe what you need to describe, it might just be worse than you think.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;b&gt;More Conclusions:&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;God objects never started out as God Objects. They started out as high level concepts -- "oh, I need something that does X", and mutate because of bolted on additional functionality.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Most developers don't set out to create a God Object. In this specific case I was trying to save effort by reducing the amount of change to the original code. However, the additional effort to explain and understand my 'bolt ons' was costing real time and effort.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The earlier you de-factor complexity out of a God Object, the better. Decomposing layered functionality into discrete objects makes it really easy for other people to recognize what those objects are for, and not overlay them with functionality that addresses other concerns.&amp;nbsp;&lt;/li&gt;&lt;li&gt;Conversely, the longer you wait, the harder it is. God Objects tend to have code with lots of side-effects and implicit assumptions. Refactoring those successfully is hard, even with comprehensive unit tests.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1781130968197627532?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1781130968197627532/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/08/beware-emerging-god-object.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1781130968197627532'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1781130968197627532'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/08/beware-emerging-god-object.html' title='Beware the Emerging God Object'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4225356513050049881</id><published>2010-04-21T21:48:00.000-07:00</published><updated>2010-04-23T23:09:09.230-07:00</updated><title type='text'>Synchronization Redux</title><content type='html'>In my last post, I discussed how I had refactored the synchronization around some data structures to reduce the need to lock those data structures except when necessary. My solution was cleaner in that the synchronization was localized, but the actual synchronization was still a little bit hairy in that it required a recheck inside the lock:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;int newVersion = specialFooCache.getCacheVersion();&lt;br /&gt;&lt;br /&gt;&amp;nbsp;// we only want one thread to refresh the cache, and we &lt;br /&gt;&amp;nbsp;// want the other ones to keep using the old data structures.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;if(newVersion != cacheVersion &amp;amp;&amp;amp; inProgress == false) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;synchronized(lock) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// first thread is in b/c cacheVersion has not been updated yet. &amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; // All other threads &amp;nbsp;&amp;nbsp;&amp;nbsp;evaluate to false.        &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;int checkVersion = flexibleStrategyCache.getCacheVersion();&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;if(checkVersion != cacheVersion) {&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; }&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp; }&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;}   &lt;/code&gt;&lt;br /&gt;&lt;br /&gt;But I was (a) stuck and (b) had other code to refactor under the gun. So I didn't think about it much until today in the code review. I'm really glad I code reviewed with John, because he is a relentless simplifier whose lives by the 'less code' motto.&lt;br /&gt;&lt;br /&gt;Right away I could tell that the whole recheck thing wasn't sitting well with him. The more I explained, the more concerned he looked, until he spotted a potential race condition between threads that might have piled up outside the synchronize  when the someCache had incremented while it was being reloaded. These threads would immediately reload even when they didn't have to, which didn't affect the integrity of the in memory data structures, but was definitely a bad use of cpu.  He also was insistent that there had to be an easier way, and under his direction this is that easier way:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&amp;nbsp;int newVersion = 0;&lt;br /&gt;&amp;nbsp;// lock is shorter&lt;br /&gt;&amp;nbsp;synchronized(lock) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;newVersion = specialFooCache.getCacheVersion();&lt;br /&gt;&amp;nbsp;      &lt;br /&gt;&amp;nbsp;&amp;nbsp;// leave right away if in progress&lt;br /&gt;&amp;nbsp;&amp;nbsp;if(cacheVersion == newVersion || inProgress == true) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;return;&lt;br /&gt;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;      &lt;br /&gt;&amp;nbsp;&amp;nbsp;inProgress = true;&lt;br /&gt;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;               &lt;br /&gt;&amp;nbsp;          &lt;br /&gt;&amp;nbsp;try {&lt;br /&gt;&amp;nbsp;&amp;nbsp;// load from cache&lt;br /&gt;&amp;nbsp;&lt;br /&gt;&amp;nbsp;} catch (Exception e) {&lt;br /&gt;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;} finally {&lt;br /&gt;&amp;nbsp;&amp;nbsp;synchronized(lock) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// set the version. &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;cacheVersion = newVersion;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// we are done here.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;inProgress = false;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;        &lt;br /&gt;&amp;nbsp;&amp;nbsp}&lt;br /&gt;&amp;nbsp;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;A couple of things to note: &lt;br /&gt;(1) This solution is definitely easier and more performant in addition to being correct. The lock is not held during the reload, and when it is released other threads will exit the function immediately if a cache rewrite is in progress.&lt;br /&gt;(2) The key thing that I learned from watching John solve the problem is that he wasn't happy until it was dead simple. When I don't take this approach, or abandon it under pressure, there are some potentially costly ramifications.  The original solution had a complexity smell around it that I chose to ignore. The problem with ignoring that smell is that it comes back to haunt after you've forgotten why you wrote it in the first place. The solution is, of course, to write code that is simple enough to re-understand months from now. Or simple enough for anyone not familiar with your code to understand.&lt;br /&gt;(3) the lock at the bottom is not actually necessary since (a) the variables being set are volatile and (b) only one thread will get this far. However, since only one thread is getting this far, it is not a performance issue, and actually lends some clarity to the resetting of test conditions. &lt;br /&gt;(4) I still feel like this is somewhat a work in progress...I'm not convinced that (3) is really required (even though we both signed off on it). But I do feel a lot better about the current algorithm. That said, &lt;br /&gt;(5) I love code reviews. The knowledge transfer is huge. I'm always looking for ways to strip my code down, and the insight I get from a good reviewer (like John) raises my game tremendously.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4225356513050049881?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4225356513050049881/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/synchronization-redux.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4225356513050049881'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4225356513050049881'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/synchronization-redux.html' title='Synchronization Redux'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-8157594588287449799</id><published>2010-04-15T22:42:00.000-07:00</published><updated>2010-04-22T09:26:39.185-07:00</updated><title type='text'>Synchronization Requirements</title><content type='html'>Recently I was lucky enough to pull the short straw and refactor the dreaded FooManager (name changed to protect the innocent). FooManager was a horrendously complex, overwrought class that suffered from at least two God Complexes, and my team had been playing hot potato with it for the last couple of months, since we learned we had inherited it.&lt;br /&gt;&lt;br /&gt;My mission was to make FooManager understandable/maintainable without requiring heavy ingestion of psychotropic substances or becoming suicidal. After doing the usual de-Godification and Cleverectomy procedures, i.e. choosing decent abstractions, removing static methods, etc, I was left with a fundamental synchronization dilemma.&amp;nbsp; It was far from the most annoying part of FooManager, but it was definitely the most interesting to refactor. &lt;br /&gt;&lt;br /&gt;FooManager received updates for each specific Foo subclass from different caches. It registered for those updates using a callback interface. The cache notified FooManager whenever it changed. This seemed like an innocent enough pub/sub pattern. But it made managing the datastructures in FooManager -- the ones that held pre-computed indexes of Foos by name, Foos by id, Foos by other id, etc -- tricky.&lt;br /&gt;&lt;br /&gt;All access to these data structures were done within read/write locks. I think the motivation behind using a read/write lock instead of a more generic synchronization mechanism was to optimize reads at the expense of writes. This made sense, given that writes were infrequent and driven by periodic cache changes, and reads did not need immediate access to new data. But it meant that any time you wanted to access the structures, you did so within the confines of a read/write lock. Even when all you wanted to do was read the data, you needed to acquire the read lock on it. &lt;br /&gt;&lt;br /&gt;The event-callback and the read/write lock had a &lt;a href="http://c2.com/cgi/wiki?AreDesignPatternsMissingLanguageFeatures"&gt;Design-Pattern-as-a-Hammer&lt;/a&gt; smell to it, and as I thought about it some more, I realized why.&lt;br /&gt;&lt;br /&gt;(1) The data structures under read/write locks did not have to be immediately consistent with the cache, they had to be eventually consistent with the cache. That meant that a cache change did not require immediate and total synchronization.&lt;br /&gt;(2) The read/write lock was overkill, but necessary because of the  asynchronous nature of the cache refill.&lt;br /&gt;(3) The loading performance of the cache was really minimal, since the cache data structures were in memory by the time the notification occurred.&lt;br /&gt;(4) What we really wanted to do was only block when a thread was updating the data structures. We didn't want other threads re-updating the data structures. But they shouldn't be blocked from accessing the current data structures while the update was going on. &lt;br /&gt;&lt;br /&gt;The eventual consistency and low reload overhead of the system made it possible to do away with the event callback interface and poll the cache for an update by checking a cache version number via a synchronized method. Heres how it worked:&lt;br /&gt;&lt;ol&gt;&lt;li&gt; Every request would check the cache version.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The first thread that got an updated version number would enter a lock and update a set of temporary data structures.&amp;nbsp;&lt;/li&gt;&lt;li&gt;All other threads would block on the lock, enter once the original writer method had exited, make the version check, get the same version that they had entered with, and not attempt a reload.&lt;/li&gt;&lt;/ol&gt;&lt;code&gt;&lt;br /&gt;synchronized protected int checkExpired(int currentVersion) throws Exception{&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;int newVersion = specialFooCache.getCacheVersion();&lt;br /&gt;&amp;nbsp;&amp;nbsp;if(newVersion != currentVersion) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// update temp data structures.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// assign temps to member data structures&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;...&lt;br /&gt;&amp;nbsp;&amp;nbsp;}&lt;br /&gt;&amp;nbsp;&amp;nbsp;return newVersion;&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;Prior to every request, I checked the cache version and reloaded if necessary: &lt;br /&gt;&lt;code&gt;&lt;br /&gt;&amp;nbsp;// cacheVersion is a member variable of FooManager&lt;br /&gt;&amp;nbsp;cacheVersion = checkExpired(cacheVersion);&lt;br /&gt;&amp;nbsp;// now access structure w/o locks.&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This was, well, OK. It was definitely more simple, but now all reader threads were blocked while waiting for the first one to update the data structures.&lt;br /&gt;&lt;br /&gt;Thinking back to the original conditions, I remembered that as long as the data structures were eventually updated, all non writing threads would be just as happy using the original structures in an unblocked manner. &lt;b&gt;In other words, we only needed to make sure that one thread reloaded the original structures. &lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This required synchronization at a more granular level: I removed synchronization on the checkExpired() method, and inside of it, blocked the rewrite code using a conditional statement followed by a lock, and allowed all threads that had gotten past the conditional and stuck on the lock to exit once (a) the lock was released and (b) they realized that they didn't need to do the cache reload. &lt;br /&gt;&lt;code&gt;&lt;br /&gt;protected void checkExpired() {&lt;br /&gt;&lt;br /&gt;&amp;nbsp;int newVersion = specialFooCache.getCacheVersion();&lt;br /&gt;&lt;br /&gt;&amp;nbsp;// we only want one thread to refresh the cache, and we &lt;br /&gt;&amp;nbsp;// want the other ones to keep using the old data structures.&lt;br /&gt;&lt;br /&gt;&amp;nbsp;if(newVersion != cacheVersion &amp;amp;&amp;amp; inProgress == false) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;synchronized(lock) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;// first thread is in b/c cacheVersion has not been updated yet. &amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; // All other threads &amp;nbsp;&amp;nbsp;&amp;nbsp;evaluate to false.        &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;int checkVersion = specialFooCache.getCacheVersion();&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;if(checkVersion != cacheVersion) {   &lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;inProgress = true;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// reload from cache&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.....&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// set data structures so that subsequent threads get&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// the new data&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;....&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// set the version. &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;cacheVersion = newVersion;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;// we are done here.&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;inProgress = false;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;} // end version recheck&lt;br /&gt;&amp;nbsp;&amp;nbsp;} // end lock. &lt;br /&gt;&amp;nbsp;} // end version and inProgress check &lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Note that in order to prevent thread local caching and allow all threads to get immediate access to the data structures after they were reassigned, I declared the data structures that I was updating as &lt;a href="http://www.javamex.com/tutorials/synchronization_volatile.shtml"&gt;volatile&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The simplification of the cache update process made it a whole lot easier to understand,&amp;nbsp; I think that the conditions -- specifically the eventual consistency --  allowed us some latitude in when reader threads actually got the data. There isn't a specific by the numbers approach to solving synchronization issues, but it always helps to understand what you are synchronizing, and more importantly,&amp;nbsp; what you don't have to synchronize.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-8157594588287449799?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/8157594588287449799/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/synchronization-requirements.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8157594588287449799'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8157594588287449799'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/synchronization-requirements.html' title='Synchronization Requirements'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1213039479098248697</id><published>2010-04-01T22:56:00.000-07:00</published><updated>2010-04-01T22:59:33.361-07:00</updated><title type='text'>Using Google Maps API v3 + JQuery.</title><content type='html'>My side project to &lt;a href="http://arunxjacob.blogspot.com/2009/03/thoughts-on-gps-data-analysis.html"&gt;display and do some detailed data mining of my GPS exercise data&lt;/a&gt;, has been languishing for the better part of a year while my day job(s) have been taking most of my day and night time. Garmin has a pretty decent Mac desktop program that provides graphing, graph zooming, and stats by mile, but I'd rather have all of that functionality (and more!) via a web UI. I'd also like to integrate the results of the data mining into that UI in a useful way, i.e. mining relative effort over similar terrain to track actual fitness over time. &lt;br /&gt;&lt;br /&gt;I decided that this project is going to be about having fun and nothing more, and as such decided to write the UI first, because all of my work these days is server side, Java, so 'fun' for me involves more dynamic languages, i.e. JavaScript and Ruby.&lt;br /&gt;&lt;br /&gt;I wanted to display my gps data as a route on a map, which meant getting up to speed on the Google Maps API, and writing a quick dummy server that could dump out some route data for me.&lt;br /&gt;&lt;br /&gt;I decided to use the latest &lt;a href="http://code.google.com/apis/maps/documentation/v3/"&gt;Google Maps API v3 &lt;/a&gt;, and of course &lt;a href="http://jquery.com/"&gt;JQuery&lt;/a&gt; for the front end work, and mocked up a quick backend server using &lt;a href="http://www.sinatrarb.com/"&gt;Sinatra&lt;/a&gt;. I can always redo that backend in something more robust once I want to actually deploy, but for now getting the data to the page is more important than how fast that data is retrieved, or machine resources consumed by serving that data.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Part 1: Displaying The Map&lt;/h3&gt;&lt;br /&gt;I needed to include the google maps api v3 js:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;http://maps.google.com/maps/api/js?sensor=false&lt;/code&gt;&lt;br /&gt;and the latest JQuery:&lt;br /&gt;&lt;code&gt;http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Then in the ready function, I created a map by pointing it to a div and passing in my options. &lt;br /&gt;&lt;code&gt;&lt;br /&gt;$(document).ready(function() {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; var myLatlng = new google.maps.LatLng(47.5615, -122.2168);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; var myOptions = {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; zoom: 12,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; center: myLatlng,&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; mapTypeId: google.maps.MapTypeId.TERRAIN&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; };&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; var map = new google.maps.Map($("#map_canvas")[0], myOptions);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;br /&gt;&lt;/code&gt;Note above that I'm assigning the map to a div with id = "map_canvas".&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Part 2: Parsing the GPS data&lt;/h3&gt;Eventually I'd like to upload data directly from my device, but for now I'm going to skip that part and 'pretend' I've already done it. GPS data is exported in the &lt;a href="http://developer.garmin.com/schemas/tcx/v2/"&gt;TCX&lt;/a&gt; format, which is Garmin-proprietary, but is the easiest to use right now. My current desktop program has the ability to export 1 to N days worth of data into tcx. &lt;br /&gt;&lt;br /&gt;At some point parsing tcx using a DOM based parser was going to start hurting, so I decided to use a SAX based parser from the start. My usual choice for quick n dirty XML/HTML parsing, hpricot, option was therefore not an option. I investigated &lt;a href="http://nokogiri.org/"&gt;nokogiri&lt;/a&gt;, but eventually settled on &lt;a href="http://libxml.rubyforge.org/rdoc/index.html"&gt;libxml&lt;/a&gt;, mostly because the rdoc on sax parsing was very clear, and it was &lt;a href="http://snippets.dzone.com/posts/show/7962"&gt;much faster for sax parsing&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I mainly wanted to parse lat-long data out of the tcx file and dump the coordinates into another file in JSON format. Here is my 5 minute hacked together code:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;class PostCallbacks&lt;br /&gt;include XML::SaxParser::Callbacks&lt;br /&gt;&lt;br /&gt;def initialize(write_file) &lt;br /&gt;&amp;nbsp; @state="unset"&lt;br /&gt;&amp;nbsp; @write_file = File.open(write_file,"w")&lt;br /&gt;&amp;nbsp; @buffer = "{\"data\" : ["&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;def on_start_element_ns(element, attributes, prefix, uri, namespaces)&lt;br /&gt;&lt;br /&gt;&amp;nbsp; if element == 'LatitudeDegrees'&lt;br /&gt;&amp;nbsp;&amp;nbsp; @state = "in_lat"&lt;br /&gt;&amp;nbsp; elsif element == 'LongitudeDegrees'&lt;br /&gt;&amp;nbsp;&amp;nbsp; @state = "in_long"&lt;br /&gt;&amp;nbsp; end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;def on_characters(chars) &lt;br /&gt;&amp;nbsp; if(@state=="in_lat") &lt;br /&gt;&amp;nbsp;&amp;nbsp; @buffer += "{\"lat\": #{chars}"&lt;br /&gt;&amp;nbsp; elsif(@state == "in_long")&lt;br /&gt;&amp;nbsp;&amp;nbsp; @buffer +=  ", \"long\": #{chars}},"&lt;br /&gt;&amp;nbsp; end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;def on_end_element_ns(element,prefix,uri)  &lt;br /&gt;&lt;br /&gt;&amp;nbsp; @state="unset"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;def on_end_document() &lt;br /&gt;&lt;br /&gt;&amp;nbsp; @buffer = @buffer.slice(0,@buffer.length-1)&lt;br /&gt;&amp;nbsp; @buffer += ("]}")&lt;br /&gt;&amp;nbsp; @write_file.puts(@buffer)&lt;br /&gt;&amp;nbsp; @write_file.close()&lt;br /&gt;end&lt;br /&gt;&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;end&lt;br /&gt;&lt;br /&gt;parser = XML::SaxParser.file(ARGV[0])&lt;br /&gt;parser.callbacks = PostCallbacks.new(ARGV[1])&lt;br /&gt;parser.parse&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Part 3: Serving The GPS Data Up&lt;/h3&gt;My goal here was to basically dump that generated file as a response to an AJAX request. &lt;a href="http://www.sinatrarb.com/"&gt;Sinatra&lt;/a&gt; is perfect for delivering quick services like this. I use Sinatra's DSL to handle a GET request as follows:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;require 'rubygems'&lt;br /&gt;require 'sinatra'&lt;br /&gt;require 'open-uri'&lt;br /&gt;require 'json'&lt;br /&gt;&lt;br /&gt;&amp;nbsp; get '/sample_path.json' do&lt;br /&gt;&amp;nbsp;&amp;nbsp; content_type :json&lt;br /&gt;&amp;nbsp;&amp;nbsp; File.open("../output/out.json") do | file | &lt;br /&gt;&amp;nbsp;&amp;nbsp; file.gets&lt;br /&gt;&amp;nbsp;&amp;nbsp; end&lt;br /&gt;&lt;br /&gt;&amp;nbsp; end&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;The ../output/out.json is where I parsed the lat-long data from the tcx file into. &lt;br /&gt;I ended up spending a lot of time debugging my get method (http://localhost:7000/sample_path.json). In FF I was unable to get a response back from my GET request. I googled around and &lt;a href="http://support.mozilla.com/tiki-view_forum_thread.php?comments_parentId=628602&amp;amp;forumId=1"&gt;apparently there are some compatibility issues with firefox 3.6.2 and JQuery&lt;/a&gt;. I was however able get the code to work in Safari, and I'm considering downgrading to FF 3.5 because I haven't seen those kind of problems with that browser, and Firebug is an essential part of my debugging library.&lt;br /&gt;&lt;br /&gt;Part 4: Drawing The Data On the Map&lt;br /&gt;The JS code that made the request loads the results into google.maps.MVCArray, which it then uses to create a polyline superimposed on the map:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;var url = "http://localhost:4567/sample_path.json";&lt;br /&gt;$.ajax({&lt;br /&gt;&amp;nbsp; type: "GET",&lt;br /&gt;&amp;nbsp;     url: url,&lt;br /&gt;&amp;nbsp;     beforeSend: function(x) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;      if(x &amp;amp;&amp;amp; x.overrideMimeType) {&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;       x.overrideMimeType("application/json;charset=UTF-8");&lt;br /&gt;&amp;nbsp;      }&lt;br /&gt;&amp;nbsp;     },&lt;br /&gt;&amp;nbsp;     dataType: "json",&lt;br /&gt;&amp;nbsp;     success: function(data,success){&lt;br /&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;       var latLongArr = data['data'];&lt;br /&gt;&amp;nbsp;&amp;nbsp;      var pathCoordinates = new google.maps.MVCArray();&lt;br /&gt;&amp;nbsp;&amp;nbsp;      for(i = 0; i &amp;lt; latLongArr.length; i++) {&amp;nbsp; &lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; // each coordinate is put into a LatLng.&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; var latlng = new&amp;nbsp; google.maps.LatLng(latLongArr[i]['lat'],&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; latLongArr[i]['long']);&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; pathCoordinates.insertAt(i,latlng);&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; }&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; // and this is where we actually draw it.&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; var polyOptions = {&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; path: pathCoordinates,&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; strokeColor: '#ff0000',&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; strokeOpacity: 1.0,&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; strokeWeight: 1&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; };&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; poly = new google.maps.Polyline(polyOptions);      poly.setMap(map);&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;&amp;nbsp; }&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;code&gt;&amp;nbsp;});&amp;nbsp;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Result&lt;/h3&gt;Not super impressive, but a good start!&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/_NqxvfgwIOvA/S7WG5HrGT3I/AAAAAAAAGJo/yoJw0wYTsyM/s1600/sample_map.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="400" src="http://2.bp.blogspot.com/_NqxvfgwIOvA/S7WG5HrGT3I/AAAAAAAAGJo/yoJw0wYTsyM/s400/sample_map.png" width="370" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1213039479098248697?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1213039479098248697/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/using-google-maps-api-v3-jquery.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1213039479098248697'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1213039479098248697'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/04/using-google-maps-api-v3-jquery.html' title='Using Google Maps API v3 + JQuery.'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_NqxvfgwIOvA/S7WG5HrGT3I/AAAAAAAAGJo/yoJw0wYTsyM/s72-c/sample_map.png' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5883038542041553474</id><published>2010-03-09T22:06:00.000-08:00</published><updated>2010-03-09T22:06:24.437-08:00</updated><title type='text'>Don't Drive by Dumb Dogmatic Data</title><content type='html'>My posting has fallen off severely since taking up the new &lt;a href="http://richrelevance.com/"&gt;job&lt;/a&gt;, mostly because I have had nothing intelligent to say while I'm trying to ramp up on the technology as well as the business drivers behind the technical decisions being made.&amp;nbsp; However, something did come up today that I want to remember, and I've  heard more than once that &lt;a href="http://lifehacker.com/5477231/it-didnt-happen-if-you-didnt-write-it-down"&gt;if  you don't write it down, you don't remember it&lt;/a&gt;.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;I've been thinking a lot about numbers. In this brave new world, a lot of companies that are 'data driven'. Meaning they make their decisions based on the data that is around them, and if &lt;a href="http://management.about.com/od/metrics/a/Measure2Manage.htm"&gt;they can't measure it, they can't manage it&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;That is a statement that sounds basically logical, assuming that&lt;br /&gt;(1) what is being measured is clearly understood by everyone, and&lt;br /&gt;(2) changes that occur to the measurement correlate well to overall system state.&lt;br /&gt;&lt;br /&gt;In the engineering world,&amp;nbsp; people don't have a lot of patience with numbers that are not explainable. So the mythical 'sales numbers' that drive entire sales teams off of cliffs every quarter are usually sneered at by engineers, who hold themselves up as the high priests and priestesses of logic.&lt;br /&gt;&lt;br /&gt;For engineering treams, numbers like &lt;a href="http://www.javvin.com/hardware/TPS.html"&gt;TPS&lt;/a&gt;, &lt;a href="http://www.sixsigmaspc.com/dictionary/MTTF-meantimetofailure.html"&gt;MTTF&lt;/a&gt;, etc, are not only easy to conceptualize, but changes in them are good indicators of system functionality. More importantly to engineering organizations, you don't have to be a software developer to understand what a decrease in TPS or MTTF means to the business.&lt;br /&gt;&lt;br /&gt;So engineering management is always looking for other numbers that encapsulate system health. Again, this is a perfectly reasonable goal, because good metrics serve as a useful abstraction layer around the grimy bits of the sausage factory. However, I think that the quest for engineering is one more piece of evidence that shows that how rational starting points end up being ridiculous the moment logic is abandoned in favor of dogma. &lt;br /&gt;&amp;nbsp; &lt;br /&gt;We've all laughed ourselves silly at the old stories of measuring programmer productivity by &lt;a href="http://en.wikipedia.org/wiki/Source_lines_of_code"&gt;lines of code written&lt;/a&gt;, but what are the programmers of tomorrow going to laugh at? My first candidate would be &lt;i&gt;&lt;b&gt;measuring the quality of unit tests by the&amp;nbsp; unit test code coverage metric&lt;/b&gt;&lt;/i&gt; -- specifically what percentage of total lines of code are covered by unit  tests. &lt;br /&gt;&lt;br /&gt;These days unit test code coverage is easy to get. We get ours from a &lt;a href="http://www.ibm.com/developerworks/java/library/j-cobertura/"&gt;Cobertura&lt;/a&gt; plugin for Maven. &lt;br /&gt;Code coverage is one of those measurements that initially sounds really good. If the test coverage decreases, that's bad, right? If it increases, well, good job to the developers! &lt;br /&gt;&lt;br /&gt;Wait, not so fast. If line coverage is supposed to be an indicator of quality, that implies that just because a test causes code to be exercised, the test is good. But wait,&amp;nbsp; I can write lots of tests with zero assertions. I've verified that in very specific cases there are no NPEs, but that's about it. &lt;br /&gt;&lt;br /&gt;If I take unit test line coverage to represent the quantity of unit tests written without looking at the number of assertions being made per test, I'm only seeing part of the picture. If unit test coverage number changes indicate that testing is or is not being done on new code, there could be lots of false positives and negatives. For example, when I add a bunch of code in a finally block, and the function I'm  adding that code to is in a unit test, my line coverage goes up without  me actually writing any more tests. Conversely, if I'm  adding that finally block to a function that is not covered, my line coverage goes down. In either case, do&amp;nbsp; the corresponding line coverage  increases and decreases actually mean anything about the quality of the tests written? &lt;br /&gt;&lt;br /&gt;What are good indicators of unit test quality if line coverage is misleading? As someone who writes a lot of unit tests, I would venture that test quality has some correlation to assertion density, with some caveats. In other words, what and how much is being checked when a method is tested?&amp;nbsp; Assuming that the tested method that returns a value, there is at a minimum one thing to check. If the value is a structure, there is more.&lt;br /&gt;&lt;br /&gt;In any case, assertion density usually means that verification is being taken seriously, and also that any changes to the code have to pass all assertions - or the assertions need to be changed to match the new code. Either case requires explicit validation of the contract put in place by the assertions in the unit test. Note that assertion density is only valid when measuring direct output -- if a test is verifying&amp;nbsp; data that is not a direct output of the method being tested, is it &lt;a href="http://en.wikipedia.org/wiki/Side_effect_%28computer_science%29"&gt;condoning code side effects&lt;/a&gt;? Assertion density needs to be normalized by the number of acceptable assertions, i.e. the number of things you can check in the return value, if there is a return value. The assertion density metric should score badly if data that is not explicitly related to the method output is being checked. But maybe that would be conflating the concerns of side effect free code and high quality unit testing. &lt;br /&gt;&lt;br /&gt;Another metric in unit testing that correlates to good coverage is conditional branch coverage.&amp;nbsp; If I can assume that every block of code may contain one or more possibly nested conditional statements,&amp;nbsp; then I know that I've at least got decent coverage when a high percentage of conditional branches are covered. I dont think that branch coverage means a lot without assertion density checks, but it does mean a lot more than simple line coverage. Ironically, Cobertura provides branch coverage, but all of the QA  managers I've worked with have gravitated towards line coverage as the more meaningful metric.&lt;br /&gt;&lt;br /&gt;Ideally I would like to see a number based on assertion density and branch coverage. This number would behave well across a wide range of assertion and branch coverage input, sort of like the &lt;a href="http://en.wikipedia.org/wiki/File:Half_Your_Age_Plus_Seven_Graph.JPG"&gt;half your age plus seven dating metric&lt;/a&gt;.&amp;nbsp; That would make it meaningful, and a good measurement to drive test quality 'up and to the right'.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5883038542041553474?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5883038542041553474/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2010/03/dont-drive-by-dumb-dogmatic-data.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5883038542041553474'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5883038542041553474'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2010/03/dont-drive-by-dumb-dogmatic-data.html' title='Don&apos;t Drive by Dumb Dogmatic Data'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6722331304611936432</id><published>2009-12-21T22:52:00.000-08:00</published><updated>2010-01-13T20:29:16.099-08:00</updated><title type='text'>Impedance Mismatch</title><content type='html'>&lt;h3&gt;Post Mortem:&lt;/h3&gt;Long story short: every time I've had a project fail, one recurring theme has been bad technology choices. There have been others, like group dysfunctionality aka 'collective asshat fatigue', where the entire group stops functioning to avoid dealing with one or more aberrant personalities. Bad project scoping/definition also contribute to failure rates, but I don't think that there is much intersection between groundbreaking work and the prototypical well defined, well understood project.&amp;nbsp; So it would seem that issues of scoping as well as team dynamics are&amp;nbsp; (a) exacerbated by and therefore (b) secondary to the bad technology choices that put the project in jeopardy.&lt;br /&gt;&lt;br /&gt;I thought I had gotten better at detecting when I was making a bad technology choice, but I recently made one. Fortunately we were able to turn things around, but it was hard.&amp;nbsp; In the interest of not making this class of bad decision again -- because the definition of insanity is to do the same thing and expect different results --&amp;nbsp; I want to dissect what went wrong, what went right, and what I learned. &lt;br /&gt;&lt;h3&gt;The Choice:&lt;/h3&gt;I recently started a new job. My first task was to jump in and assist on a prototype project by writing feed parsers that would parse millions of rows of comma separated values from feeds into various records in an SQL database. I was initially constrained to using Java.&amp;nbsp; I took a standard object=row approach persisting single objects at a time.&lt;br /&gt;&lt;br /&gt;The code was tdd, separation of concerns was good, all unit tests passed (coverage was good). Several hundred lines of code were required to parse, clean, validate, and insert data from various feeds into the database. I did not use an ORM, I used straight SQL via JDBC.&lt;br /&gt;&lt;h3&gt;The Results:&lt;/h3&gt;The performance was horrendous. Inserting several million rows took hours, hours that we simply didn't have. The performance impacted the effectiveness of every down the line operation, and was jeopardizing the success of the overall project. Not a good way to start a new job :)&lt;br /&gt;&lt;h3&gt;The Workaround:&lt;/h3&gt;We ended up ditching Java completely and rely on existing unix commandline tools to parse the files, insert into temp tables, and do bulk updates/inserts of rows from those temp tables into the main, 'canonical' tables.&amp;nbsp;&amp;nbsp; In other words, the 100s of lines of java parsing and insertion code that I wrote in a week or so (counting unit testing) and frantically reworked several times to try and speed up was replaced by something like this:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;cat rawfile.csv | cut -d, -f1 | tr ":upper:" ":lower:" | sed -e"s/^m//g" | sort | uniq &amp;gt; psql -c "copy tablename from stdin using delimiters ','" | actual_queries.pl&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;This took a multi hour query down to 5 minutes. There was a bunch of pre-formatting prior to inserting into the database, and a perl script that ran afterwards, using DBI to copy/update from the temp table.&lt;br /&gt;&lt;br /&gt;In general, the one rule that emerged was 'do as much processing before going to the db'. For example, determining set exclusion/intersection, which is something I would have definitely gone to code or SQL for, could be done via commandline via the &lt;a href="http://www.computerhope.com/unix/ucomm.htm"&gt;comm&lt;/a&gt; utility:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;comm -12 &amp;lt;(sort file1) &amp;lt;(sort file2) gives the intersection of file1 and file2.&lt;br /&gt;comm -13 &amp;lt;(sort file1) &amp;lt;(sort file2) gives unique lines from file2&lt;br /&gt;etc.&lt;br /&gt;&lt;/code&gt;&lt;br /&gt;&lt;h3&gt;Conclusions:&lt;/h3&gt;&lt;h2&gt;Conclusion 1: Tests Still Required.&lt;/h2&gt;The good thing about piping a bunch of common unix tools together is that they have been around for a long, long time. Meaning you don't have to worry about the integrity of the data as much as you have to worry about using the tool options correctly. The bad thing about this approach is that the only kind of testing is integration testing, and it is easy to blow off when the initial solution works (or seems to).&lt;br /&gt;&lt;br /&gt;After getting bitten when the queries worked but the data had integrity issues that manifested in the logic,&amp;nbsp; we ended up writing a bunch of scripts that verified data integrity by making queries and inspecting result sets. We also leveraged the database, adding constraints that would allow the script to fail fast and alert us to schematic integrity issues, like duplicate rows. &lt;br /&gt;&lt;h2&gt;Conclusion 2: It's Not the Databases Fault.&lt;/h2&gt;The database is a very convenient scapegoat, but the truth&amp;nbsp; is that I spoon fed data into the database, and I could only move as fast as I could move my spoon (in Java). The better approach is to bulk feed data into the database, via bulk copies and bulk inserts/updates. Again, verification/validation scripts and constraints are required.&lt;br /&gt;&lt;h2&gt;Conclusion 3: SQL Good, ORM Bad.&lt;/h2&gt;The truth is that we could have done this in Java, had we just used the same SQL we ended up using in the Workaround. My mistake when using Java was to put on my ORM blinders, which are great for when I want to pretend that there is some arbitrary data store underneath my code. This works until it doesnt, usually at 12AM the day of a release.&lt;br /&gt;&lt;br /&gt;Multiple FAILS mean I'm done pretending the database is some fuzzy abstract data 'store', because I will use one when I want to want to mine data along arbitrary axes -- in other words, I'll use a database precisely to use SQL and not some mapping to it.&amp;nbsp; SQL is a mature and extremely powerful way to ask open ended questions of a schema.&amp;nbsp; If I don't want to ask open ended questions of my data, I shouldn't use a database. Because that's what they're built for.&amp;nbsp; BTW I haven't used Hive or Pig yet, but these seem to be the QL solutions for much larger datasets than the one I was working with. &lt;br /&gt;&lt;h2&gt;Conclusion 4: When in Doubt, Go Cheap, Go Fast.&lt;/h2&gt;However, just because we could have done it in Java doesn't mean we should have. Perl or Ruby or Python or Bash and the plethora of solid utilities available will now always be my first option when putting together a data input operation at this particular scale.&lt;br /&gt;&lt;br /&gt;I think there will always be those opportunities that present themselves as vaguely defined chances to hit it big. Instead of taking lots of time up front to define the work involved at the expense of the actual opportunity, I'm going to move ahead with cheap and fast technologies that let me change path extremely quickly, because I'm sure I will need to at least once during the course of an ill defined project. &lt;br /&gt;&lt;h2&gt;Conclusion 5: Keep the blinders off!&lt;/h2&gt;This entire experience was a huge reminder to me to be open minded about choosing the right tool for the job. This was an instance where I let the technology choice mandate my implementation decisions, instead of the other way around. Every time I do that, I get screwed. If, instead of putting my head down and running as fast as I could,&amp;nbsp; I had initially asked questions about the duration of the project, the intention of the code, the performance constraints on data input, etc, I could have easily justified the use of Perl/unix tools/raw SQL, and saved a lot of late nights/coding angst.&lt;br /&gt;&lt;h2&gt;Conclusion 6: It's The People, Stupid&lt;/h2&gt;One thing that overwhelmingly shone through even in the grimmest of moments was the quality and class of the people I was working with. They all stayed completely focused on the solution, and did not point any fingers even when to do so would have been more than understandable. Furthermore, they were able to keep their sense of humor intact. While I didn't necessarily enjoy making this big of a mistake at a new job, the level of teamwork, professionalism, and respect from my new co-workers was complete confirmation of my reasons for jumping ship from my old company.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6722331304611936432?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6722331304611936432/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/12/impedance-mismatch.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6722331304611936432'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6722331304611936432'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/12/impedance-mismatch.html' title='Impedance Mismatch'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7459548493941439915</id><published>2009-10-15T21:59:00.000-07:00</published><updated>2009-10-16T15:40:01.145-07:00</updated><title type='text'>Using Zookeeper to Implement a Redundant Persistence Service</title><content type='html'>&lt;h3&gt;Preface&lt;/h3&gt;I've &lt;a href="http://arunxjacob.blogspot.com/2009/08/zookeeper-and-concurrency.html"&gt;previously detailed&lt;/a&gt; how we use &lt;a href="http://hadoop.apache.org/zookeeper/"&gt;Zookeeper&lt;/a&gt; to generate unique sequenceIDs for items created by a pool of worker processes. Zookeeper is the ideal solution for this problem because of its strict order/version enforcing that allows completely non-blocking access to data, it's redundancy, and its easy to understand node and data paradigm.&lt;br /&gt;&lt;br /&gt;Our current approach has multiple clients (one per worker process) requesting the latest sequence ID from a Zookeeper node. These sequence IDs are critical to the integrity of our data: they cannot be duplicated. Using Zookeeper version information, multiple clients request values and try to update those values with a specific version. This update will fail if someone else has updated the version in the meantime, so the client handles that failure, gets the latest value, increments it, and tries to update again. Again this is a non-blocking approach to updating a value and so there is no system level bottleneck.&lt;br /&gt;&lt;h3&gt;Redundancy Required&lt;/h3&gt;In order to make our system truly redundant, we need to account for what happens if all zookeeper instances went offline by persisting the latest generated sequence IDs -- again we absolutely need to not duplicate IDs to maintain system integrity.&amp;nbsp; When we persist sequence IDs,&amp;nbsp; it is possible to restart all zookeeper instances and have them pick up where they left off. Note that we can reduce the amount of persisting needed by 'reserving' a future ID, persisting it, and only modifying it when the generated IDs actually get to that value. In other words, persist ID 100, and update that value to 200 when the system generates ID = 200. This maintains ID uniqueness across system restarts at the loss of 100 values, which is a decent tradeoff.&lt;br /&gt;&lt;h3&gt;Persistence via Watches&lt;/h3&gt;The simplest implementation of a persistence service takes advantage of Zookeepers &lt;a href="http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperProgrammers.html#ch_zkWatches"&gt;watch&lt;/a&gt; functionality, which lets a client register for notifications when a node goes away or its value changes. The client gets notified every time a watched value changes, and receives an Event object with the details of the change. In our case, the client is a Persistence Service, which retrieves the location of the updated data from the Event object, retrieves the data, and determines whether it needs to reserve a future block of numbers as described above. Note that Zookeeper Watches are defined to be one time triggers, so it is necessary to reset the watch if you want to keep receiving notifications about the data of a specific node, or the children of a specific node. &lt;br /&gt;&lt;br /&gt;You watch data by registering a Watcher callback interface with Zookeeper. The Watcher interface implements the process() method, which handles the WatchedEvent parameter. The following process() method is determines when to persist the next reserved value. &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public void process(WatchedEvent event) {&lt;br /&gt;        Stat stat = new Stat();&lt;br /&gt;        &lt;br /&gt;        EventType type = event.getType();&lt;br /&gt;        String node = event.getPath();&lt;br /&gt;        if(type.equals(EventType.NodeDataChanged)) {&lt;br /&gt;            &lt;br /&gt;            try {&lt;br /&gt;                &lt;br /&gt;                byte inData[] = getData(sessionZooKeeper,event.getPath(),stat);&lt;br /&gt;                currentSequenceID = SequenceHelper.longFromByteArray(inData);&lt;br /&gt;                &lt;br /&gt;                if(currentSequenceID % RESERVE_AMOUNT == 0) {&lt;br /&gt;                    persistSequenceID(node,currentSequenceID+RESERVE_AMOUNT);&lt;br /&gt;                }&lt;br /&gt;                &lt;br /&gt;            } catch (Exception e) {&lt;br /&gt;                logger.error(e);&lt;br /&gt;            }&lt;br /&gt;            &lt;br /&gt;        }&lt;br /&gt;        else if(type.equals(EventType.NodeDeleted)) {&lt;br /&gt;            logger.error("data node "+event.getPath()+" was deleted");&lt;br /&gt;        }&lt;br /&gt;        &lt;br /&gt;        // every time you process a watch you need to re-register for the next one. &lt;br /&gt;        try {&lt;br /&gt;            addWatch(sessionZooKeeper,this,node);&lt;br /&gt;        } catch (Exception e) {&lt;br /&gt;            throw new RuntimeException(e);&lt;br /&gt;        }&lt;br /&gt;        &lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;In this case, I point Zookeeper to the Watcher object when I retrieve a ZooKeeper instance:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public ZooKeeper getZooKeeper() throws Exception {&lt;br /&gt;        &lt;br /&gt;        return new ZooKeeper(hostString,DEFAULT_TIMEOUT,watcher);&lt;br /&gt;    }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Watchers, as is implied above, are bound to the lifecycle of the Zookeeper object they are passed into. Watchers can also be established when checking to see if a node exists, or when retrieving children of a node, because the client may want to be notified if a node is created or if children of a node are created. &lt;br /&gt;&lt;h3&gt;Redundancy and Leader Election/Succession&lt;/h3&gt;Of course, a persistence service is useless if it is not redundant, especially if the integrity of our data requires us to persist reserved sequence IDs.&amp;nbsp; We only want one process at a time persisting sequence IDs. If that process goes down, we want another process to step in immediately.&amp;nbsp; In other words, we want a single leader to be selected from a group of 2 or more independent processes, and we want immediate succession if that leader were to go away.&lt;br /&gt;&lt;br /&gt;In order to handle Leader Election and Leader Succession, the multiple persistence processes create and watch&amp;nbsp; &lt;a href="http://hadoop.apache.org/zookeeper/docs/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes"&gt;Sequential-Ephemeral&lt;/a&gt; nodes. Sequential-Ephemeral nodes have the following properties:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;They automatically increment if there is a previous numbered node.&lt;/li&gt;&lt;li&gt;They go away if the ZooKeeper instance&amp;nbsp; that created them goes away.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The persistence processes use these two properties when they start up:&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;They create a Sequential-Ephemeral node (note, this requires that they have a ZooKeeper instance open for as long as they are alive, so that the node sticks around).&lt;/li&gt;&lt;li&gt;They check to see if there is a lower numbered Sequential-Ephemeral node.&lt;/li&gt;&lt;li&gt;If not, they are the leader, and they register to watch for changes on the nodes used to track sequence IDs.&lt;/li&gt;&lt;li&gt;If there is a lower numbered sequential-ephemeral node, they register to watch that node. Specifically, they want to get notified if that node goes away.&lt;/li&gt;&lt;li&gt;They only want to watch the nearest lower numbered node. This avoids a 'swarm' that would happen if all lower nodes watched the single leader node. So succession is guaranteed to be orderly and only done by one node at a time.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;pre&gt;public synchronized void startListening() throws Exception {&lt;br /&gt;       &lt;br /&gt;       // create a persistent node to store Sequential-Ephemerals under.&lt;br /&gt;       if(checkExists(sessionZooKeeper, SEQUENCE_MAINTAINER_LEADER_NODE) == false) {&lt;br /&gt;           createNode(sessionZooKeeper, SEQUENCE_MAINTAINER_LEADER_NODE, null,&lt;br /&gt;              CreateMode.PERSISTENT);&lt;br /&gt;       }&lt;br /&gt;       &lt;br /&gt;       // sequential nodes end in /n_&lt;br /&gt;&lt;br /&gt;       String path = SEQUENCE_MAINTAINER_LEADER_NODE+PREFIX;&lt;br /&gt;       createdNode= createNode(sessionZooKeeper, path, null,&lt;br /&gt;           CreateMode.EPHEMERAL_SEQUENTIAL);&lt;br /&gt;       &lt;br /&gt;       // this method transforms the sequential node string to a number&lt;br /&gt;       // for easy comparision&lt;br /&gt;       int sequence = sequenceNum(createdNode);&lt;br /&gt;       &lt;br /&gt;       // only watch the sequence if we are the lowest node.&lt;br /&gt;       if(doesLowerNodeExist(sequence) == false) {&lt;br /&gt;           logger.debug("this node is the primary node");&lt;br /&gt;           isPrimary = true;&lt;br /&gt;           loadSequenceMaintainers();&lt;br /&gt;       }&lt;br /&gt;       else &lt;br /&gt;       {&lt;br /&gt;           // this node is a backup, watch the next node for failure.&lt;br /&gt;           isPrimary = false;&lt;br /&gt;           &lt;br /&gt;           watchedNode = getNextLowestNode(sequence);&lt;br /&gt;           if(watchedNode != null) {&lt;br /&gt;               logger.info("this node is not primary, it is watching "+watchedNode);&lt;br /&gt;               boolean added = super.addWatch(sessionZooKeeper,this,watchedNode);&lt;br /&gt;               if(added == false) {&lt;br /&gt;                   throw new SequenceIDDoesntExist(watchedNode);&lt;br /&gt;               }&lt;br /&gt;           }&lt;br /&gt;           else {&lt;br /&gt;               throw new SequenceIDDoesntExist(watchedNode);&lt;br /&gt;           }&lt;br /&gt;       }&lt;br /&gt;       &lt;br /&gt;       isListening = true;&lt;br /&gt;    }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In the code above, I've abstracted a lot of the details of interacting with Zookeeper under methods. Hopefully the method names make it clear what I'm doing. There is enough specific documentation about the &lt;a href="http://hadoop.apache.org/zookeeper/docs/r3.2.1/api/index.html"&gt;ZooKeeper API,&lt;/a&gt; I'm more interested in showing the logic behind electing a leader. &lt;br /&gt;&lt;br /&gt;The code above shows that only one process will be the leader. The other processes will be watching the next lowest node, waiting for it to fail. This watching, of course, is done via the Watcher::process() method:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public void process(WatchedEvent event) {&lt;br /&gt;        &lt;br /&gt;        EventType type = event.getType();&lt;br /&gt;        String path = event.getPath();&lt;br /&gt;        &lt;br /&gt;        if((path != null &amp;amp;&amp;amp; path.equals(watchedNode)) &amp;amp;&amp;amp; (type != null &amp;amp;&amp;amp;&lt;br /&gt;           type.equals(EventType.NodeDeleted))) {&lt;br /&gt;            &lt;br /&gt;            try {&lt;br /&gt;                int watchSequence = this.sequenceNum(watchedNode);&lt;br /&gt;                if(this.doesLowerNodeExist(watchSequence) == false) {&lt;br /&gt;                    logger.debug("watcher of "+watchedNode+" is now primary");&lt;br /&gt;                    isPrimary = true;&lt;br /&gt;                    watchedNode = null;&lt;br /&gt;                    // now you are the leader! the previous leader has gone away.&lt;br /&gt;                    // note you are no longer watching anyone, so no need to&lt;br /&gt;                    // re-register the watch.&lt;br /&gt;                    loadSequenceMaintainers();&lt;br /&gt;                }&lt;br /&gt;                else {&lt;br /&gt;                    logger.debug("watcher of "+watchedNode+" is not yet primary");&lt;br /&gt;                    // there is lower node that is not the leader. &lt;br /&gt;                    // so watch it instead. &lt;br /&gt;                    watchedNode = getNextLowestNode(watchSequence);&lt;br /&gt;                    boolean success = addWatch(sessionZooKeeper, this, watchedNode);&lt;br /&gt;                    if(success == false) {&lt;br /&gt;                        throw new SequenceIDDoesntExist(watchedNode));&lt;br /&gt;                    }&lt;br /&gt;                }&lt;br /&gt;            } catch (Exception e) {&lt;br /&gt;                // fail fast for now&lt;br /&gt;                logger.error(e.getMessage());&lt;br /&gt;                throw new RuntimeException(e);&lt;br /&gt;            }&lt;br /&gt;        }&lt;br /&gt;         ......&lt;br /&gt;    }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note that if an intermediate process fails, the 'watching' process then gets the next available lowest node to watch and watches it. If all intermediate processes fail, the watching process becomes the leader.&lt;br /&gt;&lt;h3&gt;Dynamic Node Detection&lt;br /&gt;&lt;/h3&gt;&lt;div&gt;Once these long lived services are up and running, I don't want to have to restart them if I am adding another sequence node to be tracked. We do this all of the time, because we are running multiple instances of the data pipeline and tracking sequences across all of them. &lt;br /&gt;&lt;/div&gt;&lt;div&gt;This again can be handled by setting a watch, this time on the children of a top level node, and restricting sequence node creation to directly underneath that node. In other words, have a zookeeper node called /all-sequences and stick sequence-1...sequence-N underneath it. We set the watch on the node when we check to see if it has children: &lt;br /&gt;&lt;/div&gt;&lt;pre&gt;children = zooKeeper.getChildren(sequenceParentNode, true);&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;This registers the class that created the zooKeeper instance as the Watcher. In the process() handler, we detect whether children have been deleted or added. Unfortunately, we can only detect if children underneath a node have changed, so it is up to us to determine which ones have been deleted and which ones have been added:&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;public void process(WatchedEvent event) {&lt;br /&gt;        &lt;br /&gt;        EventType type = event.getType();&lt;br /&gt;        String path = event.getPath();&lt;br /&gt;        ....&lt;br /&gt;        if(path != null &amp;amp;&amp;amp; path.startsWith(sequenceParentNode) &amp;amp;&amp;amp; (type != null)) {&lt;br /&gt;            // getting this notification implies that you are a primary b/c that &lt;br /&gt;            // is the only way to register for it. &lt;br /&gt;            try {               &lt;br /&gt;                if(type.equals(EventType.NodeChildrenChanged)) {&lt;br /&gt;                    // figure out missing and new nodes (expensive, but this &lt;br /&gt;                    // only happens when a node is manually added)&lt;br /&gt;                    List&lt;string&gt; newSequences = new ArrayList&lt;string&gt;();&lt;br /&gt;                    List&lt;string&gt; missing = new ArrayList&lt;string&gt;();&lt;br /&gt;                    List&lt;string&gt; children = this.getChildren(sessionZooKeeper,&lt;br /&gt;                        sequenceParentNode,false);&lt;br /&gt;                    for(String child : children) {&lt;br /&gt;                        if(this.sequenceMaintainers.containsKey(child) == false) {&lt;br /&gt;                            newSequences.add(child);&lt;br /&gt;                        }&lt;br /&gt;                    }&lt;br /&gt;                    &lt;br /&gt;                    for(String currentSequence : sequenceMaintainers.keySet()) {&lt;br /&gt;                        if(children.contains(currentSequence) == false) {&lt;br /&gt;                            missing.add(currentSequence);&lt;br /&gt;                        }&lt;br /&gt;                    }&lt;br /&gt;                    &lt;br /&gt;                    // add new sequences to watch list&lt;br /&gt;                    for(String child : newSequences) {&lt;br /&gt;                        String sequencePath = sequenceParentNode+"/"+child;&lt;br /&gt;                        sequenceMaintainers.put(child,&lt;br /&gt;                           new SequenceMaintainer(s3Accessor,&lt;br /&gt;                             hosts,sequencePath,true));&lt;br /&gt;                    }&lt;br /&gt;                    &lt;br /&gt;                    for(String child : missing) {&lt;br /&gt;                        sequenceMaintainers.remove(child);&lt;br /&gt;                    }&lt;br /&gt;                    &lt;br /&gt;                }&lt;br /&gt;                &lt;br /&gt;                boolean success = addWatch(sessionZooKeeper, this, &lt;br /&gt;                   sequenceParentNode);&lt;br /&gt;                if(success == false) {&lt;br /&gt;                    throw new SequenceIDDoesntExist(sequenceParentNode));&lt;br /&gt;                }&lt;br /&gt;            } catch (Exception e) {&lt;br /&gt;                // fail fast for now&lt;br /&gt;                e.printStackTrace();&lt;br /&gt;                throw new RuntimeException(e);&lt;br /&gt;            }&lt;br /&gt;        }&lt;br /&gt;&lt;/string&gt;&lt;/string&gt;&lt;/string&gt;&lt;/string&gt;&lt;/string&gt;&lt;/pre&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;I'm pretty amazed how much I was able to leverage two main points of functionality -- nodes and watches -- to build a persistent sequence watching service. Once again it seems that picking the right primitives is what makes ZooKeeper so practical for distributed synchronization.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7459548493941439915?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7459548493941439915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/10/using-zookeeper-to-implement-redundant.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7459548493941439915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7459548493941439915'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/10/using-zookeeper-to-implement-redundant.html' title='Using Zookeeper to Implement a Redundant Persistence Service'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4242039644368373322</id><published>2009-10-01T20:57:00.000-07:00</published><updated>2009-10-01T21:06:29.183-07:00</updated><title type='text'>Quick Webserver setup with Jersey and Jetty</title><content type='html'>&lt;h3&gt;No More Hand Rolling&lt;br /&gt;&lt;/h3&gt;In our data pipeline, we have different components that we communicate with via web services.&amp;nbsp; In the beginning, there were only three commands needed: pause, restart, and reload. So I wrote a quick Servlet, loaded up embedded Jetty, and called it good. The Servlet contained some method handling code to parse path strings:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;String path = request.getPathInfo();&lt;br /&gt;        &lt;br /&gt;        if(path.equals(ROOT)) {&lt;br /&gt;            response.setContentType("text/html");&lt;br /&gt;            response.setStatus(HttpServletResponse.SC_OK);&lt;br /&gt;            response.getWriter().println("name");&lt;br /&gt;            response.getWriter().println("getInfo()");&lt;br /&gt;        }&lt;br /&gt;        else if(path.equals(STATS)) {&lt;br /&gt;            response.setContentType("text/json");&lt;br /&gt;            response.setStatus(HttpServletResponse.SC_OK);&lt;br /&gt;            &lt;br /&gt;            JSONObject responseObject = new JSONObject();&lt;br /&gt;            responseObject.put("service", name);&lt;br /&gt;            responseObject.put("status",status.toString().toLowerCase());&lt;br /&gt;            responseObject.put("statistics", jobStatsAnalyzer.toJSON());&lt;br /&gt;            response.getWriter().println(responseObject);&lt;br /&gt;        }&lt;br /&gt;        else if(path.startsWith(SERVICE_PREFIX)) {&lt;br /&gt;            response.setContentType("text/html");&lt;br /&gt;            &lt;br /&gt;            responseCode =  &lt;br /&gt;              processServiceRequest(path.substring(SERVICE_PREFIX.length(),&lt;br /&gt;                path.length()),response);&lt;br /&gt;            response.setStatus(responseCode);&lt;br /&gt;            &lt;br /&gt;        }&lt;br /&gt;        else {&lt;br /&gt;           ......&lt;br /&gt;        }&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;And the processServiceRequest call is equally complex because it has to parse the next section of the path. Still, because there were only three methods and it took little time to code up, I felt fine about hand rolling the Servlet, even through there was a lot of boilerplate path parsing code.&lt;br /&gt;&lt;br /&gt;One of our components now needs (much) more path handling. It is a validation component -- it collects transformation exceptions thrown by the view processing components. Those exceptions are dumped to an SQS queue, picked up by the validator, and dumped into a Lucene index that allows us to query for exceptions by various exception metadata. The validator needs to expose a more complex Rest Interface&amp;nbsp; that allows a data curator to find exception (resources) by that various metadata (sub resources), i.e. by exception type. They can then fix the the root cause of the exceptions, and validate that the exceptions go away via a set of scripts that query the validator web service.  &lt;br /&gt;&lt;br /&gt;One option to extend the current web service functionality would be to subclass the custom Servlet, but that's a lot more boilerplate code and I know that we are probably going to need to extend another component in another way, which would mean more code. More code to debug, more code to maintain, more code to understand. &lt;br /&gt;&lt;br /&gt;&lt;a href="https://jersey.dev.java.net/"&gt;Jersey&lt;/a&gt; aka JAX-RS aka JSR 311 allows you to compose restful services using annotations.&amp;nbsp; It is an alternative to hand rolling servlets that lets you declaratively bind REST methods (GET/PUT/POST/etc), paths, and handler functions. It handles serializing data from POJOs to XML/JSON and vice versa. I had been wanting to check it out for some time now, but simply didn't have a concrete need to do so.&lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Jersey And Jetty&lt;/h3&gt;&lt;div&gt;I decided to stick with Jetty as my servlet container because launching an embedded instance was so brain dead. But I decided to use the Jersey servlet and see how hard it would be to re-implement my hand rolled servlet. The way to bind the Jersey Servlet to Jetty uses Jetty's ServletHolder class to instantiate the Jersey servlet and initialize it's annotation location as well as the location of the resources it is going to use to handle web requests. The code below shows how the Jetty ServletHolder is initalized with the Jersey ServletContainer (which actually implements the standard Servlet interface) and then bound to a context that allows the ServletContainer to handle all requests to the server. &lt;br /&gt;&lt;/div&gt;&lt;pre&gt;sh = new ServletHolder(ServletContainer.class);&lt;br /&gt;        &lt;br /&gt;sh.setInitParameter("com.sun.jersey.config.property.resourceConfigClass", RESOURCE_CONFIG);&lt;br /&gt;sh.setInitParameter("com.sun.jersey.config.property.packages", handlerPackageLocation);&lt;br /&gt;        &lt;br /&gt;server = new Server(port);&lt;br /&gt;        &lt;br /&gt;Context context = new Context(server, "/", Context.SESSIONS);&lt;br /&gt;context.addServlet(sh, "/*");&lt;br /&gt;server.start();&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;The com.sun.jersey.config.property.packages parameter points to the location of the Jersey annotated resources that the Jersey ServletContainer uses when handling requests. Those resources are simply POJOs (Plain Old Java Objects) marked up with Jersey annotations.&lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Jersey Resources&lt;/h3&gt;&lt;div&gt;In order to parse a specific path, you create and object and use the @Path annotation. A method in the &lt;a href="http://en.wikipedia.org/wiki/Plain_Old_Java_Object"&gt;POJO&lt;/a&gt; is bound to that path by default. You can also parse subpaths by binding them to other methods via the @Path annotation. Here is an example: &lt;br /&gt;&lt;/div&gt;&lt;pre&gt;@Path("/")&lt;br /&gt;public class DefaultMasterDaemonService {&lt;br /&gt;&lt;br /&gt;        &lt;br /&gt;    private ServiceHandler serviceHandler;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;    // this one handles root requests&lt;br /&gt;    @GET&lt;br /&gt;    @Produces("text/plain")&lt;br /&gt;    public String getInformation() {&lt;br /&gt;        return serviceHandler.getInfo();&lt;br /&gt;    }&lt;br /&gt;    &lt;br /&gt;    // this one handles /stats requests&lt;br /&gt;    @GET&lt;br /&gt;    @Path("/stats")&lt;br /&gt;    @Produces("application/json")&lt;br /&gt;    public DaemonStatus getStatus() {&lt;br /&gt;        return serviceHandler.getStatus();&lt;br /&gt;        &lt;br /&gt;    }&lt;br /&gt;    .....&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;h3&gt;Basic Annotations&lt;br /&gt;&lt;/h3&gt;&lt;div&gt;There are a couple of annotations above worth discussing in addition to the @Path annotation.&lt;br /&gt;The HTTP method that is bound to the POJO method is specified via the @GET, @POST, @PUT, @DELETE, and @HEAD annotations.&lt;br /&gt;The returned content Mime type is specified with the @Produces annotation. In the example above, a request to the root path returns some informational text, and a request to /stats returns JSON. &lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Returning JSON and XML&lt;br /&gt;&lt;/h3&gt;&lt;div&gt;In order to return JSON/XML, you need to leverage JAXB annotations to make your data objects serializable to JSON/XML. Note: remember to always include a default constructor on your data objects. Otherwise you get exceptions trying to serialize those objects.&lt;br /&gt;&lt;br /&gt;I also found that unless I did _not_ declare getters and setters, I would also get serialization errors. I had not seen this before, and therefore assume that it is something specific to Jersey Serialization.&lt;br /&gt;&lt;br /&gt;Here is an example of a JAXB annotated object that I use to return Status:&lt;br /&gt;&lt;/div&gt;&lt;pre&gt;@XmlRootElement()&lt;br /&gt;public class DaemonStatus {&lt;br /&gt;    // apparently methods to access these are added at serialization time??&lt;br /&gt;    @XmlElement&lt;br /&gt;    public String serviceName;&lt;br /&gt;    @XmlElement&lt;br /&gt;    public String status;&lt;br /&gt;    @XmlElement&lt;br /&gt;    public JobStatsData jobStatsData;&lt;br /&gt;    &lt;br /&gt;    // need this one!&lt;br /&gt;    public DaemonStatus() {&lt;br /&gt;        &lt;br /&gt;    }&lt;br /&gt;    &lt;br /&gt;    public DaemonStatus(String serviceName,String status,JobStatsData jobStatsData) {&lt;br /&gt;        this.serviceName = serviceName;&lt;br /&gt;        this.status = status;&lt;br /&gt;        this.jobStatsData = jobStatsData;&lt;br /&gt;        &lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;So all I needed to do to get JSON/XML in my return type was to create a JAXB annotated object, and specify what I wanted the method to produce via the Jersey @Produces annotation. Less code = more fun!&lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Parameterizing Path Components&lt;br /&gt;&lt;/h3&gt;&lt;div&gt;Our components have Pause/Restart/Reload functionality accessible via the http://host/services/{pause|restart|reload} path, using POST. Jersey lets me parameterize the last part of the path, which makes the syntax of the command explicit while allowing me to only code string matching for the parameterized part:&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;pre&gt;@POST&lt;br /&gt;@Path("/service/{action}")&lt;br /&gt;public void doAction(@PathParam("action") String action) throws Exception {&lt;br /&gt;        &lt;br /&gt;  if(action.equals(MasterDaemon.PAUSE)) {&lt;br /&gt;    serviceHandler.pause();&lt;br /&gt;  }&lt;br /&gt;  else if(action.equals(MasterDaemon.RELOAD)) {&lt;br /&gt;    serviceHandler.reload();&lt;br /&gt;  }&lt;br /&gt;  else if(action.equals(MasterDaemon.RESUME)) {&lt;br /&gt;    serviceHandler.resume();&lt;br /&gt;  }&lt;br /&gt;  else {&lt;br /&gt;    throw new Exception("No such action supported: "+action);&lt;br /&gt;  }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;I've delegated the real meat of the action to a serviceHandler component, but this kind of path handling is about as easy as it gets. Note that the action parameter is specified via the @PathParam annotation directly in the method argument list. &lt;br /&gt;&lt;/div&gt;&lt;h3&gt;Conclusion&lt;br /&gt;&lt;/h3&gt;&lt;div&gt;I only really scratched the surface of what Jersey can do. In my case I don't have to parse query parameters, but that is easily done by specifying a @QueryParam argument to the handler method in the same way I specified the @PathParam. From what I've been able to understand, you can only access query params as strings (but that's pretty reasonable).&lt;br /&gt;&lt;br /&gt;I really liked how quickly I was able to toss out my hand coded servlet and trade up to the Jersey one. Other people on the team were able to wire up rich REST interfaces on several components almost immediately, which let all of us go back to focusing on real requirements. &lt;br /&gt;&lt;br /&gt;I usually 'cast a jaundiced eye' towards anything that even has a hint of framework in it, but Jersey was super fast to learn and using it instead of hand coded servlets has already saved us a lot of time and finger strain.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4242039644368373322?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4242039644368373322/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/10/quick-webserver-setup-with-jersey-and.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4242039644368373322'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4242039644368373322'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/10/quick-webserver-setup-with-jersey-and.html' title='Quick Webserver setup with Jersey and Jetty'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-9160343658222868891</id><published>2009-09-10T22:10:00.000-07:00</published><updated>2009-09-11T15:22:27.913-07:00</updated><title type='text'>Using ExecutorCompletionService to synchronize multithreaded workflow</title><content type='html'>Today I ran into a problem where I needed to make sure that one multithreaded phase of processing had completely ended before starting another. Specifically, I was retrieving documents from S3 to load into a lucene index, and wanted to retry all document requests that had failed due to S3 flakiness, connectivity issues, i.e. standard distributed computing error conditions.&lt;br /&gt;&lt;br /&gt;In other situations requiring synchronization between threads, I've used a &lt;a href="http://arunxjacob.blogspot.com/2009/08/using-javautilconcurrentcountdownlatch.html"&gt;CountDownLatch&lt;/a&gt;. This works really well when you know the exact number of threads that you need to synchronize. You initialize the latch with the number of threads that you are synchronizing. When they finish work they decrement the latch, when the latch count goes to 0 you continue processing.&lt;br /&gt;&lt;br /&gt;This time was different because instead of synchronizing threads, &amp;nbsp;I was trying to halt processing until all asynchronous submitted tasks had completed processing. I was queuing up several hundred thousand tasks into a thread pool, and did not know when that thread pool would be finished with the work, or how many threads would be running when the entire job completed, or even exactly how many tasks I had to run -- that number &amp;nbsp;depended on the number of documents being fetched, which is always growing.&lt;br /&gt;&lt;br /&gt;Fortunately my situation was not a unique one. &amp;nbsp;I figured that the first place to look was the &lt;a href="http://jcp.org/en/jsr/detail?id=166"&gt;java concurrency library&lt;/a&gt;, and when I did some research, I found that&amp;nbsp;&lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ExecutorCompletionService.html"&gt;ExecutorCompletionService&lt;/a&gt; was exactly what I needed.&lt;br /&gt;&lt;br /&gt;ExecutorCompletionService works with a supplied &lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Executor.html"&gt;Executor&lt;/a&gt; using &lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Runnable.html"&gt;Runnable&lt;/a&gt;/&lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Callable.html"&gt;Callable&lt;/a&gt; tasks. I decided to use Callable tasks, as they allowed me to return and inspect a value, and throw exceptions. &amp;nbsp;As those tasks complete, they are placed onto a queue that can be accessed via the poll() or take() methods. This approach simplified my life:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;It allowed me to use tasks and task status to control logic flow instead of threads and thread status.&amp;nbsp;&lt;/li&gt;&lt;li&gt;It freed me up from having to know how many threads or tasks I was working with.&lt;/li&gt;&lt;li&gt;It provided me with an access point to each task that I could use to analyze the results.&lt;/li&gt;&lt;li&gt;When the ExecutorCompletionService queue was empty, I knew it was time to then retry all failed results.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;Point (1) above is the foundation for the points that follow. When I used Future and Callable to implement the work I needed to do, I was able to return the results I cared about and process them. Specifically it allowed the code that ran the original key fetch loop to not worry about tracking and storing exceptions, which made for much simpler looping logic.&lt;br /&gt;&lt;br /&gt;ExecutorCompletionService is a great example of how picking the right primitives makes it easy to compose highly functional helper objects. In this case the primitives involved were a (thread pool) executor and a (linked blocking) queue. &lt;i&gt;(side note: I don't mean to sound like such a concurrent lib fanboy, but I'm really happy I didn't have to write this code myself, and really happy that the choice of primitives that the authors used enabled creation of classes like the ExecutorCompletionService. This stuff used to take significant effort on my part, and significant bugs were usually introduced :)&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;Here is an example of the ExecutorCompletionService in action, the relevant bits &lt;b&gt;&lt;i&gt;bolded and italicized&lt;/i&gt;&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public void buildEntireIndex() throws Exception {&lt;br /&gt;&lt;br /&gt;        boolean moreKeys = true;&lt;br /&gt;        int submittedJobCt = 0;&lt;br /&gt;        // tracking all failed keys for later retry&lt;br /&gt;        Map&lt;string,throwable&gt; failedKeys = new HashMap&lt;string,throwable&gt;();&lt;br /&gt;        &lt;br /&gt;        // ErrorCapturingExecutor is subclassed from ThreadPoolExecutor. &lt;br /&gt;        //I override ThreadPoolExecutor.afterExecution() to queue and later analyze exceptions.&lt;br /&gt;        LinkedBlockingQueue&lt;runnable&gt; linkedBlockingQueue = new LinkedBlockingQueue&lt;runnable&gt;();&lt;br /&gt;        ErrorCapturingExecutor executor = new ErrorCapturingExecutor(threadCount, &lt;br /&gt;            threadCount, 0l, TimeUnit.SECONDS, linkedBlockingQueue);&lt;br /&gt;        &lt;br /&gt;        // CompletionService works with the supplied executor and queues up tasks as they finish.&lt;br /&gt;        &lt;b&gt;&lt;i&gt;executorCompletionService = new ExecutorCompletionService&lt;/i&gt;&lt;buildresults&gt;&lt;i&gt;(executor);&lt;/i&gt;&lt;/buildresults&gt;&lt;/b&gt;&lt;br /&gt;        &lt;br /&gt;        &lt;br /&gt;        String lastKey = null;&lt;br /&gt;        &lt;br /&gt;        while(moreKeys  == true) {    &lt;br /&gt;            Set&lt;string&gt; keys =  viewKeyRetriever.retrieveKeys(lastKey);&lt;br /&gt;            &lt;br /&gt;            if(keys.size() &amp;gt; 0) {&lt;br /&gt;                String array[] = new String[keys.size()];&lt;br /&gt;                keys.toArray(array); &lt;br /&gt;                lastKey = array[keys.size()-1];&lt;br /&gt;                &lt;br /&gt;                // I need to keep the number of waiting tasks bounded. &lt;br /&gt;                if(linkedBlockingQueue.size() &amp;gt; MAXQUEUESIZE) {&lt;br /&gt;                    Thread.sleep(WAITTIME);&lt;br /&gt;                }&lt;br /&gt;                &lt;br /&gt;                // this is where I actually submit tasks&lt;br /&gt;                &lt;b&gt;&lt;i&gt;processKeys(keys);&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;                &lt;br /&gt;                // I only know how many jobs I need to wait for when I've retrieved all keys.&lt;br /&gt;                submittedJobCt++; &lt;br /&gt;                &lt;br /&gt;            }&lt;br /&gt;            else {&lt;br /&gt;                moreKeys = false;&lt;br /&gt;            }&lt;br /&gt;            &lt;br /&gt;        }&lt;br /&gt;&lt;br /&gt;        &lt;br /&gt;        // I use the ExecutorCompletionService queue to check on jobs as they complete. &lt;br /&gt;        for(int i = 0; i &amp;lt; submittedJobCt;i++) {&lt;br /&gt;            &lt;br /&gt;            &lt;b&gt;&lt;i&gt;Future&lt;/i&gt;&lt;buildresults&gt;&lt;i&gt; finishedJob = executorCompletionService.take();&lt;/i&gt;&lt;/buildresults&gt;&lt;/b&gt;&lt;br /&gt;            &lt;br /&gt;            // at this point, all I really care about is failures that I need to retry.&lt;br /&gt;            BuildResults results = finishedJob.get();&lt;br /&gt;            &lt;br /&gt;            if(results.hasFailures()) {&lt;br /&gt;                failedKeys.putAll(results.getFailedKeys());&lt;br /&gt;            }&lt;br /&gt;            &lt;br /&gt;            indexBuilderMonitor.update(results);&lt;br /&gt;        }&lt;br /&gt;        &lt;br /&gt;        // I can use failedKeys to retry processing on keys that have failed. &lt;br /&gt;        // Logic omitted for clarity.&lt;br /&gt;        &lt;br /&gt;        ...&lt;br /&gt;        executor.shutdown();&lt;br /&gt;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;&lt;/string&gt;&lt;/runnable&gt;&lt;/runnable&gt;&lt;/string,throwable&gt;&lt;/string,throwable&gt;&lt;/pre&gt;&lt;br /&gt;The ProcessKeys method is shown below: I broke it out because I needed to call it again when re-processing failed keys &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;private void processKeys(Set&lt;string&gt; keys) {&lt;br /&gt;        // the builder builds the index from the content retrieved using the passed in keys.&lt;br /&gt;        final IndexBuilder builder = indexBuilderFactory.createBuilder(keys);&lt;br /&gt;        &lt;br /&gt;            &lt;br /&gt;          &lt;b&gt;&lt;i&gt;executorCompletionService.submit(new Callable&lt;/i&gt;&lt;buildresults&gt;&lt;i&gt;() &lt;/i&gt;&lt;/buildresults&gt;&lt;/b&gt;&lt;i&gt;{&lt;/i&gt;&lt;br /&gt;          @Override  &lt;br /&gt;          public BuildResults call() throws Exception {&lt;br /&gt;              BuildResults buildResults = null; &lt;br /&gt;              try {&lt;br /&gt;                  // buildResults contains all information I need to post process the task.&lt;br /&gt;                  buildResults = builder.build();&lt;br /&gt;              } catch (Exception e) {&lt;br /&gt;                  throw e; // caught by ErrorCapturingExecutor&lt;br /&gt;              }&lt;br /&gt;            &lt;br /&gt;              return buildResults;&lt;br /&gt;          }&lt;br /&gt;&lt;br /&gt;        &lt;br /&gt;        });&lt;br /&gt;        &lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;&lt;/string&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-9160343658222868891?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/9160343658222868891/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/using-executorcompletionservice-to.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/9160343658222868891'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/9160343658222868891'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/using-executorcompletionservice-to.html' title='Using ExecutorCompletionService to synchronize multithreaded workflow'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7157208477907698855</id><published>2009-09-09T11:54:00.000-07:00</published><updated>2009-09-09T20:14:29.845-07:00</updated><title type='text'>Using ThreadLocal to pass in dummy components</title><content type='html'>Synchronization of data access across multiple threads is tricky. &amp;nbsp;While Java's &lt;a href="http://www.ibm.com/developerworks/java/library/j-rtj3/index.html"&gt;threading primitives&lt;/a&gt; are fairly easy to understand and use, there can be unintended performance consequences of making an object thread safe. Depending on how objects synchronize other objects, you can also end up with deadlocks that are no fun to debug. For example if object1 tries to lock object2 while object2 is trying to lock object1, you're in for a long night.&lt;br /&gt;&lt;br /&gt;In general, anything that reduces synchronization of data across threads reduces the potential for unintended consequences.&amp;nbsp;An alternative to making an object threadsafe is to make it thread-local. Thread local objects provide a separate copy of themselves for all threads. Each thread can only see it's local instance of that object, and is free to modify it at will, without needing to synchronize.&lt;br /&gt;&lt;br /&gt;Thread-local variables used to have significant performance issues, and there have been &lt;a href="http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5025230"&gt;bugs&lt;/a&gt; in previous to 1.6 versions. Also, it is possible to easily run out of memory with large numbers of threads using large thread-local objects. But assuming you go in with your eyes open, reducing synchronization across the threads in your application is good for performance and can significantly reduce complexity.&lt;br /&gt;&lt;br /&gt;Another benefit of thread-local variables (as if simplification and performance gains aren't enough) is that they make it easy to swap in stub components at unit test time. Why would you do this instead of passing in the component? I ended up using thread-local variables for my components when I had to instantiate an object via the Class.forName() method, and didnt know/want to know about how to wire up dependent components. It's a trick I want to remember (so I'm writing it down :)&lt;br /&gt;&lt;br /&gt;I implement the component as a thread-local variable is via an anonymous inner class:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;ThreadLocal&lt;sequenceclient&gt; sequenceClientLocal = new ThreadLocal&lt;sequenceclient&gt;() {&lt;br /&gt;        @Override&lt;br /&gt;        protected SequenceClient initialValue() {&lt;br /&gt;            SequenceClient sequenceClient =  null;&lt;br /&gt;            try {&lt;br /&gt;                sequenceClient = SequenceClientImpl.getInstance(hosts,hexIdNode);&lt;br /&gt;            } catch (Exception e) {&lt;br /&gt;                sequenceClient = null;&lt;br /&gt;            }&lt;br /&gt;            return sequenceClient;&lt;br /&gt;        }&lt;br /&gt;    };&lt;br /&gt;    &lt;br /&gt;&lt;/sequenceclient&gt;&lt;/sequenceclient&gt;&lt;/pre&gt;&lt;br /&gt;In order to swap this default value out for a stub file, I add a setter to override it:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public void setSequenceClientLocal(ThreadLocal&lt;sequenceclient&gt; sequenceClientLocal) {&lt;br /&gt;        this.sequenceClientLocal = sequenceClientLocal;&lt;br /&gt;    }&lt;br /&gt;&lt;/sequenceclient&gt;&lt;/pre&gt;&lt;br /&gt;At unit test time, I can stub in a dummy class by calling the setter: &lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class TestCollapserAgent {&lt;br /&gt;&lt;br /&gt;  @Before&lt;br /&gt;  public void setUp() throws Exception {&lt;br /&gt;&lt;br /&gt;    sequenceClient = new DummySequenceClientImpl(1);&lt;br /&gt;&lt;br /&gt;    collapserAgent.&lt;b&gt;s&lt;/b&gt;&lt;b&gt;etSequenceClientLocal&lt;/b&gt;(new ThreadLocal&lt;sequenceclient&gt;() {&lt;br /&gt;            @Override&lt;br /&gt;            protected SequenceClient initialValue() {&lt;br /&gt;                return sequenceClient;&lt;br /&gt;            }&lt;br /&gt;        });&lt;br /&gt;    ....&lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;  // unit tests follow....&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;/sequenceclient&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7157208477907698855?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7157208477907698855/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/using-threadlocal-to-pass-in-dummy.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7157208477907698855'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7157208477907698855'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/using-threadlocal-to-pass-in-dummy.html' title='Using ThreadLocal to pass in dummy components'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6568283180609662002</id><published>2009-09-03T22:02:00.000-07:00</published><updated>2009-09-03T22:04:02.145-07:00</updated><title type='text'>The Ratio of Ceremony vs Essence, aka Framework Debt</title><content type='html'>This is going to be the kind of post I had sworn off of: a lot of opinion, with some rambling mixed in. I apologize in advance for that, but occasionally I need to vent about things that deeply disturb me, and venting tends to be nonlinear.&lt;br /&gt;&lt;br /&gt;I just spent the last week trying to work with a legacy system component that was implemented using the &lt;a href="http://www.springsource.org/"&gt;Spring&lt;/a&gt; framework. This component read data from a database into a &lt;a href="http://lucene.apache.org/java/docs/"&gt;Lucene&lt;/a&gt; index wrapped by &lt;a href="http://www.compass-project.org/docs/2.2.0/reference/html/introduction.html#overview"&gt;Compass&lt;/a&gt;. At the time of implementation, the lead engineer was using &lt;a href="http://java.sun.com/developer/technicalArticles/J2EE/jpa/"&gt;JPA&lt;/a&gt;, to load database records into POJOs, which he then annotated so that they could be serialized via &lt;a href="http://java.sun.com/developer/technicalArticles/WebServices/jaxb/"&gt;JAXB&lt;/a&gt;, which enabled Compass to read them in as Lucene Documents. Whew!&lt;br /&gt;&lt;br /&gt;Because time was limited and the code was already in production, I decided to ignore my fundamental misgivings about frameworks and Java Acronyms, and make the minimal modifications to the existing source that would get it to take input from S3 instead of a database. &lt;br /&gt;&lt;br /&gt;After a day of struggle, I had figured out what was going on, and was astounded by the amount of code required just to set up the relatively simple business logic. When I hit a 'schema not found' error trying to load the application.xml, I gave up, ripped out the business logic, and re-implemented the entire thing in a matter of hours.&amp;nbsp;With a lot less code. I know that the original implementation of the Spring based code took a week or so to write.&lt;br /&gt;&lt;br /&gt;The massive increase in efficiency is not because I'm a brilliant coder. I wish I was, but I've worked with brilliant coders and I'm not one of them. It's because the actual business logic was pretty minimal. The logic required to implement and maintain the Spring application required a lot of code that could only be described as Ceremonial, as opposed to Essential business logic. I first read about &lt;a href="http://blog.thinkrelevance.com/2008/4/1/ending-legacy-code-in-our-lifetime"&gt;Ceremonial vs Essential code&lt;/a&gt; here, the night after I had exorcised Spring from the logic. The timing couldn't have been more appropriate. &lt;br /&gt;&lt;br /&gt;What is Ceremonial code? It is code that has nothing to do with implementing a business requirement. In Spring, I define Ceremonial code as:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Configuration as code&lt;/li&gt;&lt;li&gt;Dependency Injection&lt;/li&gt;&lt;li&gt;(Pedantic use of) Interfaces&lt;/li&gt;&lt;/ol&gt;The three examples above are not terribly bad, in fact they come from decent intentions ("the road to hell..."). But put together they have an exponentially bad effect. They are, when added to a developer's blind belief in the goodness of all things Frameworky, the Four Horsemen of the (Framework) Apocalypse.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Configuration As Code&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;Separating configuration into a data file is inherently a good idea. You don't want to hardcode variables that you would then have to rebuild the application to change. I'm not sure how this basically sound idea warped into "hey, let's put EVERYTHING into configuration", but the biggest problem with this approach is that now part of the logic is in code, the other part is in a massive XML file. You need both to understand the control flow of the application, so &amp;nbsp;you spend a lot of time toggling back and forth, saying "What class is being called? Oh, let me check in xml configuration. Oh, that's the class. Great. What was I doing?" Maybe some people see this kind of rapid mental stack management as interesting and novel brain training. I see it as wasting time, time that I could be spending either coding or testing a feature that someone is paying me to implement.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Dependency Injection&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;This too, starts off as a great idea. One of the big issues people (OK, I'm talking about myself, but using the word 'people' to get some more legitimacy) had with EJB 2.0 code was that it was really hard to test. You had to have the whole stack up and running, and validating the integrity of different components in the stack was so hard we just didnt do it.&lt;br /&gt;&lt;br /&gt;Dependency Injection/Inversion of Control allows you to parameterize the components of an object, making that object really easy to test. Just expose components with getters and setters, and you can dummy them up and test that object in true isolation! Again, there is still nothing really flawed at this point. &lt;br /&gt;&lt;br /&gt;The major flaw in Dependency Injection comes at implementation. &amp;nbsp;Objects need all of their components in a known, initialized state, in order to function effectively. Dependency Injection as implemented in Spring is usually done in the configuration file. Objects that are created in the configuration file &amp;nbsp;have all of their components set in their configuration.&lt;br /&gt;&lt;br /&gt;It is very easy to miss setting a component in the configuration file. This means that the object will initialize in a bad state that becomes apparent when you try to use it. People use constructors because they can specify components as parameters to the constructor, which is an explicit way of saying "This component needs components X, Y, and Z to run".&lt;br /&gt;&lt;br /&gt;Using a constructor provides a foolproof way to successfully initialize an object without having to test for initialization success. If the constructor returns, you're good. If not, you know that the object is not usable. &lt;br /&gt;&lt;br /&gt;In order to be able to be configurable via Spring, objects must (a) have a default (no argument) &amp;nbsp;public constructor and expose all of their required components via setters. There is no way to enforce that setup has been correct, so the developer has to spend time looking at the getters and setters of the object to determine what components they need to supply at configuration time. When I compare that effort to the effort of looking at the constructor parameters, it feels very inefficient.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Pedantic Use of Interfaces&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;The goal of the Java &lt;a href="http://java.sun.com/docs/books/tutorial/java/concepts/interface.html"&gt;Interface&lt;/a&gt; is (a) separate functionality from initialization, and (b) provide a contract that a caller and callee can communicate across. This makes sense in the following two cases:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;You have a complex object and you only want to expose a part of it to a caller. For example you have a parent class and you want to expose a callback interface to the child class.&lt;/li&gt;&lt;li&gt;You have multiple implementations of the same functionality and you don't want the caller to care about which object they are calling.&amp;nbsp;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;What I see all over Java-Land, and especially in Spring, is interfaces being used because someone got pedantic about separating functionality from initialization. I fail to see the use of an interface when used to abstract the implementation of all of the methods of a single class. You're writing two structures when one could do the job just fine. Actually, you end up writing three structures: the interface, the implementation, and a factory object, which is more ceremonial code. Even if you need the interface, you could still have the implementation object return an instance of itself cast to the interface via a static initialization method:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;public class S3AccessorImpl implements S3Accessor {&lt;br /&gt;&lt;br /&gt;    private static final int DEFAULT_SET_SIZE = 1000;&lt;br /&gt;    private S3Service service;&lt;br /&gt;    private Logger logger;&lt;br /&gt;&lt;br /&gt;    public static S3Accessor getInstance(AWSCredentials creds) throws S3ServiceException {&lt;br /&gt;        return new S3AccessorImpl(creds);&lt;br /&gt;    }&lt;br /&gt;    &lt;br /&gt;    &lt;br /&gt;    protected S3AccessorImpl(AWSCredentials creds) throws S3ServiceException {&lt;br /&gt;        logger = Logger.getLogger(this.getClass());&lt;br /&gt;        service = new RestS3Service(creds);&lt;br /&gt;&lt;br /&gt;    }&lt;br /&gt;    ...&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;In spite of my comments above, I am a fan of using interfaces as the boundaries between an components because it facilitates easier unit testing. But I'm not entirely sold on abstracting the creation of an object to a Factory that returns the interface that object implements -- not when the above method (a) hides creation from the caller and (b) doesn't require an extra class with a single 'createFoo' method. &lt;br /&gt;&lt;br /&gt;Also, I don't understand always writing interfaces first, then implementation classes second. I tend to implement classes until I have a real need for an interface, i.e. during unit testing when I am going to submit a 'dummy' component in place of a real one.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;My recent experience with Spring has reminded me of the existence of 'Framework Debt'. Framework Debt is the Technical Debt required to implement a solution with a given Framework. In short it is determined by the ratio of time spent writing &amp;nbsp;and maintaining ceremonial code vs the amount of time spent writing and maintaining essential business code. The problem with most frameworks, Spring included, is that they do not distinguish between ceremonial and essential code, because to them, it's _all_ essential code. And, to work in that particular framework, ceremonial code is absolutely essential, and having to maintain and understand a bunch of logic that has nothing to do with application functionality seems inherently wrong to me.&lt;br /&gt;&lt;br /&gt;I actually do like some frameworks I've run into. &lt;a href="http://rubyonrails.org/"&gt;Rails&lt;/a&gt; is great because of it's 'convention over configuration', but that is another kind of technical debt. Fortunately it is pretty low in Rails, and as a result applications can be rapidly developed in Rails without losing maintainability. But even Rails feels too heavy for me at times. &amp;nbsp;I do write apps that don't need the overhead of MVC. For these apps, &lt;a href="http://www.sinatrarb.com/"&gt;Sinatra&lt;/a&gt; allows me to quickly get path routing out of the way and concentrate on the underlying code.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6568283180609662002?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6568283180609662002/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/ratio-of-ceremony-vs-essence-aka.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6568283180609662002'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6568283180609662002'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/09/ratio-of-ceremony-vs-essence-aka.html' title='The Ratio of Ceremony vs Essence, aka Framework Debt'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4818163001605632193</id><published>2009-08-27T21:08:00.000-07:00</published><updated>2009-08-27T22:50:44.366-07:00</updated><title type='text'>Zookeeper and Concurrency</title><content type='html'>Recently I ran into a problem that seemed to require more of a solution than I was willing to implement. We are currently migrating an application from using a single worker, single database to having multiple workers running in parallel using s3 as the primary means of storage. &lt;i&gt;(Side note: this migration is only possible because the application doesn't actually require any of the unplanned, interactive queries that only a database is good at.  The choice of a database as the persistence mechanism for this application was not a popular one, and only grew less popular as more time was spent putting out database related fires than implementing new features)&lt;/i&gt;. One of the legacy requirements of the system, which supports an in production website, was that the IDs for all new items had to be distinct integers. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Without this legacy requirement I would have put in some kind of GUID scheme and called it an (early) day. The existence of previous items with IDs that other services relied upon made a GUID generation scheme not possible.  However the requirement of distinct integers requires coordination between the agents, who would need to make sure they are not creating the same ID for different objects. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My initial thought was to implement a simple service that provided an integer that it would auto increment with every request. The big problem with this approach is that the service would be a massive, non redundant bottleneck unless it too was replicated, and then it would be faced with the same problem that the original workers faced wrt keeping integers synchronized across different processes. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So I had a bad feeling about starting down that path, and was putting it off, when a colleague at work suggested that I check out &lt;a href="http://hadoop.apache.org/zookeeper/"&gt;Zookeeper&lt;/a&gt;. Zookeeper, was created by Yahoo research specifically to solve the kind of synchronization problems that I was having, in a highly performant, fault tolerant way. In other words, this was the service that I was trying not to write :)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Zookeeper at 10000 feet consists of multiple services that maintain a &lt;a href="http://hadoop.apache.org/zookeeper/docs/current/zookeeperOver.html#sc_dataModelNameSpace"&gt;hierarchical namespace&lt;/a&gt; consisting of nodes that can have child nodes. Each node can have associated data, limited to under 1MB, meant to be used for coordination/synchronization. &lt;div&gt;&lt;br /&gt;&lt;div&gt;Zookeeper is  in the words of its creators, "needed to be general enough to address our coordination needs and simple enough to implement a correct high performance service. We found that we were able to achieve {this} by trading strong synchronization for strong ordering guarantees and a wait-free interface."&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What that means is that node access is 'no wait', meaning when you ask for data you get it, but you do not have an exclusive lock on the data. This is quite different than the mutex based locking model that I'm used to, and at first  I didn't see how I could use this to guarantee unique IDs to multiple agents creating multiple items without getting an exclusive lock on the data and making a modification. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What I didn't get (until another colleague walked me through some of his code) is that any and all changes to the data are taken, and versioned. When I request data, I get back an object corresponding to the version of that data. When I submit data, I can specify that the submit will only succeed if the data (and therefore it's version) hasn't been updated from the version that I have. So if I get node data that is an integer ID, increment it, and try to update the data it back, two things can happen (excluding connection loss, which must also be dealt with):&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;I can return successfully, meaning that my change was accepted because no one else had made changes since I had retrieved the data.&lt;/li&gt;&lt;li&gt;I can get a Bad Version exception, which means I need to get the data again, and try to re-increment the new value.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt; The code below shows the method that requests, recieves, and attempts to increment the data:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;public String getNextId() throws Exception {&lt;br /&gt;     ZooKeeper zk = null;&lt;br /&gt;     String hexId = null;&lt;br /&gt;  &lt;br /&gt;     boolean keepGoing = true;&lt;br /&gt;  &lt;br /&gt;     while(keepGoing  == true) {&lt;br /&gt;         try {&lt;br /&gt;             Stat nodeStat = new Stat();&lt;br /&gt;             Stat setDataStat  = null;&lt;br /&gt;             zk = getZooKeeper();&lt;br /&gt;             byte data[] = getDataWithRetries(zk,&lt;br /&gt;                  sequenceName, &lt;br /&gt;                  nodeStat);&lt;br /&gt;             ByteBuffer buf = ByteBuffer.wrap(data);&lt;br /&gt;          &lt;br /&gt;             long value = buf.getLong();&lt;br /&gt;          &lt;br /&gt;             value++;&lt;br /&gt;          &lt;br /&gt;             buf.rewind();&lt;br /&gt;          &lt;br /&gt;             buf.putLong(value);&lt;br /&gt;          &lt;br /&gt;          &lt;br /&gt;             try {&lt;br /&gt;                 setDataStat = setDataWithRetries(&lt;br /&gt;                        zk,sequenceName,&lt;br /&gt;                        buf.array(),nodeStat);&lt;br /&gt;                 hexId = Long.toHexString(value);&lt;br /&gt;                 break;&lt;br /&gt;             }&lt;br /&gt;             catch(KeeperException e) {&lt;br /&gt;                 if(e.code().equals(Code.BADVERSION)) {&lt;br /&gt;                     nodeStat = setDataStat;&lt;br /&gt;                 }&lt;br /&gt;                  &lt;br /&gt;             }&lt;br /&gt;         } finally {&lt;br /&gt;          &lt;br /&gt;             // always need to close out the session!&lt;br /&gt;             zk.close();&lt;br /&gt;         }&lt;br /&gt;      &lt;br /&gt;     }&lt;br /&gt;  &lt;br /&gt;  &lt;br /&gt;     return hexId;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;/div&gt;&lt;div&gt;I've wrapped calls to zookeeper with a getZookeeper() method, and pass the retrieved Zookeeper instance into two methods: getDataWithRetries(), and setDataWithRetries(). Both methods try to recover from connection losses as best they can.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;The getDataWithRetries method takes a Zookeeper instance, the path to the node being accessed, and a Stat structure that will contain retrieved data version information. It returns the retrieved data in a byte array. Note how in this method I'm only going  to recover from connection losses, because this is a read operation. &lt;/div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;   protected byte[] getDataWithRetries(&lt;br /&gt;          ZooKeeper zooKeeper,&lt;br /&gt;          String path,&lt;br /&gt;          Stat nodeStat) throws Exception{&lt;br /&gt;    &lt;br /&gt;    &lt;br /&gt;      byte data[] = null;&lt;br /&gt;    &lt;br /&gt;      int i = 0;&lt;br /&gt;    &lt;br /&gt;      while(i &lt; RETRY_COUNT)   {     &lt;br /&gt;          try {                 &lt;br /&gt;                  i++;&lt;br /&gt;                data = zooKeeper.getData(path, &lt;br /&gt;                      false, &lt;br /&gt;                      nodeStat);&lt;br /&gt;                break;&lt;br /&gt;           } &lt;br /&gt;           catch(KeeperException e) {&lt;br /&gt;                 if(e.code().equals(Code.CONNECTIONLOSS)) &lt;br /&gt;                 {&lt;br /&gt;                     continue;&lt;br /&gt;                 }&lt;br /&gt;                 else if(e.code().equals(Code.NODEEXISTS)) &lt;br /&gt;                 {&lt;br /&gt;                     break;&lt;br /&gt;                 }&lt;br /&gt;                 else {&lt;br /&gt;                     throw e;&lt;br /&gt;                 }&lt;br /&gt;             }&lt;br /&gt;         }&lt;br /&gt;        if(i &gt;= RETRY_COUNT) {&lt;br /&gt;          throw new KeeperException.ConnectionLossException();&lt;br /&gt;      }&lt;br /&gt;    &lt;br /&gt;      return data;&lt;br /&gt;    &lt;br /&gt;  }&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once I have the data, I increment it. Note that Zookeeper data is always kept as a byte array, so I convert it in order to increment it:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;ByteBuffer buf = ByteBuffer.wrap(data);&lt;br /&gt;            &lt;br /&gt;long value = buf.getLong();&lt;br /&gt;            &lt;br /&gt;value++;&lt;br /&gt;            &lt;br /&gt;buf.rewind();&lt;br /&gt;            &lt;br /&gt;buf.putLong(value);&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;and then try to resubmit it back to Zookeeper. This is where things get interesting. If someone has modified the data before I could get back to it, I need to get the new value of the data and try again. In the setDataWithRetries() method below, I only handle connection exceptions, and blow out if there is a BADVERSION exception:&lt;span class="Apple-style-span"   style="font-family:monospace, serif;font-size:100%;"&gt;&lt;span class="Apple-style-span"  style=" white-space: pre;font-size:13px;"&gt;&lt;span class="Apple-style-span"   style="font-family:Georgia, serif;font-size:130%;"&gt;&lt;span class="Apple-style-span"  style=" white-space: normal;font-size:16px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&lt;br /&gt;protected Stat setDataWithRetries(&lt;br /&gt;            ZooKeeper zooKeeper,&lt;br /&gt;            String path, &lt;br /&gt;            byte data[],&lt;br /&gt;            Stat stat) throws Exception{&lt;br /&gt;    &lt;br /&gt;      int i = 0;&lt;br /&gt;      Stat statFromSet = null;&lt;br /&gt;    &lt;br /&gt;      while(i &lt; RETRY_COUNT)&lt;br /&gt;            try {&lt;br /&gt;                 i++;&lt;br /&gt;                 statFromSet = zooKeeper.setData(path, &lt;br /&gt;                      data,&lt;br /&gt;                      stat.getVersion());&lt;br /&gt;                 break;&lt;br /&gt;             }&lt;br /&gt;             catch(KeeperException e) {&lt;br /&gt;                 if(e.code().equals(Code.CONNECTIONLOSS)) &lt;br /&gt;                 {&lt;br /&gt;                     continue;&lt;br /&gt;                 }&lt;br /&gt;                 else if(e.code().equals(Code.BADVERSION)) &lt;br /&gt;                 {&lt;br /&gt;                     // differentiating for debug purposes&lt;br /&gt;                     throw e;&lt;br /&gt;                 }&lt;br /&gt;                 else {&lt;br /&gt;                     throw e;&lt;br /&gt;                 }&lt;br /&gt;             }&lt;br /&gt;         }&lt;br /&gt;&lt;br /&gt;      if(i &gt; RETRY_COUNT) {&lt;br /&gt;          throw new KeeperException.ConnectionLossException();&lt;br /&gt;      }&lt;br /&gt;    &lt;br /&gt;      return statFromSet;&lt;br /&gt;    &lt;br /&gt;  }&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The calling code of setDataWithRetries() handles the BADVERSION exception by getting the data again, and retrying the submit:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt; try {&lt;br /&gt;    setDataStat = setDataWithRetries(zk,&lt;br /&gt;        sequenceName,&lt;br /&gt;        buf.array(),&lt;br /&gt;        nodeStat);&lt;br /&gt;    hexId = Long.toHexString(value);&lt;br /&gt;    break;&lt;br /&gt; }&lt;br /&gt; catch(KeeperException e) {&lt;br /&gt;     if(!e.code().equals(Code.BADVERSION)) &lt;br /&gt;     {&lt;br /&gt;           throw e;&lt;br /&gt;     }&lt;br /&gt; }              &lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;So each agent tries to get an ID until they succeed, at which point they know they've got a unique one. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The thing  I really like about the strong versioning and ordering approach, now that  I understand it, is that it acknowledges concurrency and makes it easy to deal with. Locking, on the other hand, seems like an attempt to abstract away the concurrency by enforcing serialization, which works OK when you are managing machine or process local resources, but can have huge performance impacts when you are trying to synchronize access across multiple machines. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The next thing I'm considering using Zookeeper for is configuration changes. Right now I push configuration changes out to my worker nodes by hand and force a restart via their web service interface. I would like them to be able to reload themselves automatically when state changes. This is a step up from the simple code detailed in this post, it means I need to use Zookeepers notification capabilities to alert listening processes when the configuration changes. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4818163001605632193?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4818163001605632193/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/08/zookeeper-and-concurrency.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4818163001605632193'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4818163001605632193'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/08/zookeeper-and-concurrency.html' title='Zookeeper and Concurrency'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1665915607101041335</id><published>2009-08-11T10:54:00.001-07:00</published><updated>2009-08-11T11:28:02.791-07:00</updated><title type='text'>Using java.util.concurrent.CountDownLatch to synchronize startup/shutdown</title><content type='html'>I've been a big fan of the &lt;a href="http://jcp.org/en/jsr/detail?id=166"&gt;Java concurrency library&lt;/a&gt; since I stumbled upon it a while back. Before it came along, I was relegated to writing my own thread pools, schedulers, etc. Which meant, of course, that I was relegated to introducing lots of subtle and deviant bugs into code that had nothing to do with the actual product I was delivering. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The java.util.concurrent library freed me up to go ahead and focus on what I was really trying to deliver instead of re-inventing a hard to write wheel. Plus they're way smarter than me about concurrency. I highly recommend reading &lt;a href="http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601"&gt;Java Concurrency In Practice&lt;/a&gt;, even if you dont code in Java, because the concurrency issues they discuss are universal, even if the solutions are in Java. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the latest installment of 'how java.util.concurrent made me a happier, more productive developer', I was implementing a web service layer to control the run state of a set of worker threads. These workers needed to be started/stopped/paused/resumed/{insert favorite action here}.  I don't want to continue processing on the calling thread (the web service start/stop/etc methods) until I am sure that the action requested by the caller has completed across all worker threads, which are running asynchronously.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My first thought was to write a pair of interfaces that allowed me to synchronize when an action was requested and when it was completed. In other words:&lt;/div&gt;&lt;div&gt;&lt;pre&gt;&lt;br /&gt;public interface Worker {&lt;br /&gt;public void start(Master master);&lt;br /&gt;public void stop(Master master);&lt;br /&gt;public void pause(Master master);&lt;br /&gt;..&lt;br /&gt;};&lt;br /&gt;&lt;br /&gt;public interface Master {&lt;br /&gt;public void started(Worker worker);&lt;br /&gt;public void stopped(Worker worker);&lt;br /&gt;public void paused(Worker worker);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;The problem with these interfaces and this design is that every time I needed to add an action to Worker, I needed to add a corresponding 'completed' message to Master. Also, the implementation of Master would need to track each worker against a worker pool, and scan that pool to see if an action was completed. Clearly way too much work to write, let alone understand 3 months later.  Also, I knew that this was a pretty common problem, probably solved by the concurrency lib. So I cracked open the book....&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;java.util.concurrent.CountDownLatch is, in the words of the guys who wrote the library,  "a synchronizer that can delay the progress of threads until it reaches it's terminal state".   Hmm. Using the synchronizer frees me up from having to track the specific kind of state of N specific workers:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"   style="font-family:monospace, fantasy;font-size:100%;"&gt;&lt;span class="Apple-style-span" style="font-size: 13px; white-space: pre;"&gt;&lt;span class="Apple-style-span"   style="font-family:Georgia, -webkit-fantasy;font-size:130%;"&gt;&lt;span class="Apple-style-span" style="font-size: 16px; white-space: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;pre&gt;public interface Worker&lt;br /&gt; public enum RunState {&lt;br /&gt;  STOPPING,&lt;br /&gt;  STOPPED,&lt;br /&gt;  STARTING,&lt;br /&gt;  STARTED,&lt;br /&gt;  PAUSING,&lt;br /&gt;  PAUSED,&lt;br /&gt;  RESTARTING&lt;br /&gt; };&lt;br /&gt; public boolean start(TaskCompleted listener) throws Exception&lt;br /&gt; public void stop(TaskCompleted listener) throws Exception;&lt;br /&gt; public void pause(TaskCompleted listener) throws Exception;&lt;br /&gt; public void restart(TaskCompleted listener) throws Exception;&lt;br /&gt; public void reload(TaskCompleted listener) throws Exception;&lt;br /&gt; public RunState getState() throws Exception;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;public interface TaskCompleted {&lt;br /&gt;public void completed();&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;In the code above, &lt;div&gt;(1) I no longer care about which action is completed, or which worker completed the action, which means&lt;/div&gt;&lt;div&gt;(2) I no longer am keeping state for X workers in order to return from the call. &lt;/div&gt;&lt;div&gt;(3) The TaskCompleted interface can be implemented as an anonymous class in the response.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The CountDownLatch is pretty simple: it blocks until it's internal count reaches zero:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: monospace, fantasy; font-size: 13px; white-space: pre; "&gt;   &lt;/span&gt;&lt;/div&gt;&lt;pre&gt;private void pauseWorkers() throws InterruptedException {&lt;br /&gt;   final CountDownLatch waitForPause = new CountDownLatch(workers.size());&lt;br /&gt;   for(WorkerBase worker : workers) {&lt;br /&gt;       worker.pause(new TaskCompleted() {&lt;br /&gt;           public void completed() {&lt;br /&gt;               waitForPause.countDown();&lt;br /&gt;           }&lt;br /&gt;       });&lt;br /&gt;   }&lt;br /&gt;&lt;br /&gt;   waitForPause.await();&lt;br /&gt;&lt;br /&gt;   // and now we're paused.&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;In the code above, the CountDownLatch is initialized to the number of workers I have. I iterate through the list of workers and perform the 'pause' action on them. Then I wait for the latch to get counted down to 0 before proceeding. I am keeping no state on the workers, I only care when they've completed their requested action. I suppose that for  I could replace the anonymous implementation with an actual (dirt simple) implememtation that takes the counter and decrements it. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1665915607101041335?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1665915607101041335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/08/using-javautilconcurrentcountdownlatch.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1665915607101041335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1665915607101041335'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/08/using-javautilconcurrentcountdownlatch.html' title='Using java.util.concurrent.CountDownLatch to synchronize startup/shutdown'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6119445195353731974</id><published>2009-07-16T16:11:00.000-07:00</published><updated>2009-07-16T17:41:30.110-07:00</updated><title type='text'>Upgrading to Eclipse Galileo 3.5 from Ganymede 3.4 on Mac OSX</title><content type='html'>These are my notes on what I had to do to upgrade to Eclipse &lt;a href="http://www.eclipse.org/org/press-release/20090624_galileo.php"&gt;Galileo&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Why?&lt;/b&gt;&lt;br /&gt;&lt;div&gt;My main motivation was have 1.6 be my default JDK. With Ganymede I had to set my default JAVA_HOME env var to point to 1.5, and point 1.6 dependent apps -- like my command line mvn builds --  to the (non default) 1.6 JDK. That's exactly the kind of thing I forget 5 minutes after I do it. &lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;What?&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;Just in case I need to do this again: as far as I could tell, upgrading to a major version of Eclipse currently requires a full, clean install. Which means no associated plugins. So I'm writing down the plugins I need to install, where to get them, etc. &lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;Base Install&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;I installed the 32 bit Cocoa version of Galileo from http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/galileo/R/eclipse-jee-galileo-macosx-cocoa.tar.gz&lt;/div&gt;&lt;br /&gt;&lt;div&gt;The diff between Cocoa and Carbon and 32 vs 64 bit is explained in detail &lt;a href="http://eclipse.dzone.com/articles/eclipse-galileo-mac-cocoa-or"&gt;here&lt;/a&gt;.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;The tar file unloads to an eclipse directory: make sure you move your old version out of this dir if that's where you have it!&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;Plugin Installs&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt; these were in order (Maven required Subversion)&lt;/div&gt;&lt;br /&gt;&lt;div&gt;(1) Subversion Plugin: I followed &lt;a href="http://blogs.open.collab.net/svn/2009/06/subversion-eclipse35-easy.html"&gt;these instructions&lt;/a&gt; to install subclipse.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;(2) Maven Plugin: I installed from http://m2eclipse.sonatype.org/update.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;Follow Up&lt;/b&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;(1) I needed to change my JAVA_HOME environment var to point to my 1.6 install (I use &lt;a href="http://landonf.bikemonkey.org/static/soylatte/"&gt;soylatte&lt;/a&gt;).&lt;div&gt;&lt;br /&gt;&lt;div&gt;(2) I needed to upgrade my subversion client to &gt; 1.4 otherwise I saw an 'unsble to launch default SVN client' when trying to browse my SVN repo. I downloaded &lt;a href="http://blogs.open.collab.net/svn/2009/06/subversion-eclipse35-easy.html"&gt;the latest svn client&lt;/a&gt;, restarted eclipse, and all was well.&lt;/div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6119445195353731974?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6119445195353731974/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/07/upgrading-to-eclipse-galileo-35-from.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6119445195353731974'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6119445195353731974'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/07/upgrading-to-eclipse-galileo-35-from.html' title='Upgrading to Eclipse Galileo 3.5 from Ganymede 3.4 on Mac OSX'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-324361261775224571</id><published>2009-06-30T15:20:00.000-07:00</published><updated>2009-06-30T19:33:05.522-07:00</updated><title type='text'>Running Zookeeper on the Mac with Soylatte and Eclipse 3.4</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I've been using &lt;/span&gt;&lt;/span&gt;&lt;a href="http://hadoop.apache.org/zookeeper/docs/r3.1.1/"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;Zookeeper&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt; to store a sequence number that a large number of processes can access and increment in a coordinated manner.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;Zookeeper has a nice, simple interface, and exposes a set of primitives that easily allow me to implement guaranteed synchronized access to my magic sequence number. I'll post more later on the specific solution, but right now I want to detail some of the issues I've run into and the workarounds I've put in place.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I run on a Mac (OSX/Leopard), use &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.eclipse.org/"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;Eclipse&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt; 3.4 for my Java development, and use &lt;/span&gt;&lt;/span&gt;&lt;a href="http://landonf.bikemonkey.org/static/soylatte/"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;soylatte&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt; for my JDK. I think a lot of other people run with this setup. I'm using Zookeeper 3.1.3.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;My initial setup steps:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I downloaded Zookeeper, untarred it, and installed in /usr/local.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I created a symlink from zookeeper-3.1.1 to zookeeper&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;From that dir I ran sudo ./bin/zkServer start.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I immediately ran into a strange issue: I could connect to the zookeeper instance:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;ZooKeeper zk = new ZooKeeper("127.0.0.1:2181",ZookeeperBase.DEFAULT_TIMEOUT,this);&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;but could not create a node on it:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;zk.create("/HELLO", foo, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I kept getting timeouts. I've written my code to be 'timeout proof' because connection loss errors are to be expected under load in distributed environments, but I do kick out after 5 retries. Besides, I wouldn't expect to get the ConnectionLoss error when I am connecting to a localhost instance.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;It turns out that the &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg00915.html"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;there have been soylatte nio issues with Zookeeper&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;. I talked to Satish (he's on the email thread in the link, and we both work at Evri), and he said he had success using the latest version of 1.6 that mac 'officially' supports.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;I switched to the latest &lt;/span&gt;&lt;/span&gt;&lt;a href="http://support.apple.com/kb/HT1856"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;Apple supported java 1.6 version&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;: when I pointed my java binaries at 1.6, Zookeeper worked great, but Eclipse couldn't restart -- some more online research showed that &lt;/span&gt;&lt;/span&gt;&lt;a href="http://rolfje.wordpress.com/2008/12/28/eclipse-341-osx-and-java-16/"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;this was another known issue&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;So in the end: I&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;(1) created a java16 symlink to/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin/java&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;(2) used that symlink in zkServer.sh&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span"  style="font-family:'times new roman';"&gt;(3) kept my $JAVA_HOME pointing to 1.5 by symlinking/System/Library/Frameworks/JavaVM.framework/Versions/1.5 to/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-324361261775224571?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/324361261775224571/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/ive-been-using-zookeeper-to-store.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/324361261775224571'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/324361261775224571'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/ive-been-using-zookeeper-to-store.html' title='Running Zookeeper on the Mac with Soylatte and Eclipse 3.4'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1150022808215404713</id><published>2009-06-27T13:27:00.000-07:00</published><updated>2009-06-27T22:06:07.424-07:00</updated><title type='text'>Seattle Rock n Roll Marathon Race Report</title><content type='html'>Well, the &lt;a href="http://www.rnrseattle.com/"&gt;big event&lt;/a&gt; has come and gone. And I'm happy to have crossed the finish line under my own power! It was a great first marathon experience. To summarize, after not being able to run very much in the last six weeks due to 'life happening' and some late breaking bronchitis, I decided to run the marathon with no particular time goal in mind. However, if I could come close to my original goal of sub 4 hours, that would be gravy :) &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I ended up getting 4:10 (according to my watch, which I stopped during a short but necessary bathroom break). The race was going according to sub 4 pace and plan until mile 21, when massive calf cramps in both legs dramatically slowed me down. Cramping of any kind is frustrating, because you have to basically shut it down completely.  I wasn't even breathing hard, but I simply could not move any faster. I was able to fight the cramps off for a while, but by mile 25 they were pretty constant, and spreading from my calves to my groin and quads. In the last half mile I was cramping with every step, but I was basically in a tunnel of people at that point, and they cheered me on through the cramps. While it wasn't really 'fun' at the time, I'll never forget struggling down the finishing stretch and the support the crowd gave me that got me through it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The highlight of my race was seeing my friends Lori and Anthony at the 18.5 mile mark (when I was still feeling pretty good), with &lt;a href="http://www.facebook.com/photo.php?pid=1916894&amp;amp;id=590989067"&gt;their 'Run Arun' sign&lt;/a&gt;. Lori was training for the marathon when she injured her knee. In her place I would have been pouting and eating bon-bons on the couch, but she and Anthony came out and cheered us all on -- making the killer sign, giving me pretzels, gatorade and hugs (which must have been pretty gross since I was drenched in sweat). Thanks guys, seeing you both after the long grind uphill was exactly what I needed!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For a first marathon, the Seattle Rock n Roll was perfect. The weather was great, the bands were great (especially for a guy that doesn't train with an iPod), the water stations were perfectly placed, and the course was challenging, with the biggest hill coming on at mile 15-18, running up the false flat of highway 99. For me,  the toughest part was running past the turnoff towards the finish on mile 23, and knowing I still had 3 long, cramp filled miles to go.  That was mental. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My cramping was probably due to the lack of recent training. I had trained for a specific pace, and kept that pace for the first 21 miles. But the time off caught up with me in the end, and my target pace was probably too fast for my current level of fitness. I guess I'll have to claim a 'moral' victory.  And there will always be next year!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt; &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1150022808215404713?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1150022808215404713/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/seattle-rock-n-roll-marathon-race.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1150022808215404713'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1150022808215404713'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/seattle-rock-n-roll-marathon-race.html' title='Seattle Rock n Roll Marathon Race Report'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5763362979942491308</id><published>2009-06-23T20:55:00.000-07:00</published><updated>2009-06-30T08:41:04.280-07:00</updated><title type='text'>Streaming Data with a Worker/Agent based approach</title><content type='html'>&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Where I was going....&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;In my last &lt;a href="http://arunxjacob.blogspot.com/2009/06/streaming-data-with-hadoop-and-sqs-work.html"&gt;post&lt;/a&gt; I described how at work, we were investigating using &lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; in a non batch setting. I mentioned that despite not using Hadoop's ability to collate keyed data from large data sets, we were still investigating Hadoop because of the built in robustness of the system:&lt;div&gt;&lt;ul&gt;&lt;li&gt;Nodes are checked via 'heartbeat'&lt;br /&gt;&lt;/li&gt;&lt;li&gt;task status is centrally tracked&lt;br /&gt;&lt;/li&gt;&lt;li&gt;failed tasks are retried.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Work is pulled from the central JobTracker by TaskTrackers. &lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;The basic pain points of maintaining highly available and robust functionality across a cluster of machines is taken care of, and was the primary motivator for us to try and stream data across a batch driven system. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However as we moved into implementation it became fairly obvious that we were pounding a square peg into a round hole. &lt;a href="http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/"&gt;A lot has been written&lt;/a&gt; about how Hadoop and HDFS doesn't work particularly well with small files -- the recommended solutions usually involve concatenating those files into something bigger to reduce the number of seeks per map job. While these problems were understandable in a system optimized to process huge amounts of data in batch, waiting to batch up large files wasn't an option given the low latency requirement of our end users. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Especially disconcerting was the amount of work (and code) spent bundling queued work items into small files, and submitting those files as individual jobs. The standard worker model --having multiple processes with multiple threads per process running on multiple machines access &lt;a href="http://aws.amazon.com/sqs/"&gt;SQS&lt;/a&gt; and process the data -- seemed so much simpler than creating artificial batches.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;A Swift Change of Direction&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The rewrite took a matter of hours, dropped out a lot of code, and was a minor change to the overall architecture, which uses SQS to transition between workflow states, and S3 to persist the results of data transformations. The move away from Hadoop was limited to intermediate worker processes -- we still use Hadoop to get the data into the system, because we are collating data across a set of keys when importing data.  The latency went from somewhat indeterminate across mini batches to being the average time to process per thread. And the workers were easily subclassed from the &lt;a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Callable.html"&gt;Callable&lt;/a&gt; class -- developers could implement new workers by overriding a single method that took a string as input. When latency of the system went up, simply adding more machines running more processes would take care of the problem. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;Distributed Availability and Retry Logic&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Of course, that simplicity came with a price tag -- we lost the distributed bookkeeping that Hadoop provided.  Specifically, we would have to implement:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;thread and process failure detection&lt;br /&gt;&lt;/li&gt;&lt;li&gt;machine failure detection&lt;br /&gt;&lt;/li&gt;&lt;li&gt;retry logic&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;All of which is non trivial to implement. However, our need to stream instead of batch data meant that we would have ended up having to do the retry logic differently than Hadoop anyways. We need to catch and retry data failures at a work item level, not at an arbitrarily determined file split level. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our retry logic is pretty simple, and uses S3 to persist workflow state per work item. We traverse a list of items in the queue, determine which ones have 'stalled out', and submit them to the appropriate queue as part of a retry. At the same time we clean up work items that have been fully processed, and get average processing time per workflow process. These three things are best done in an asynchronous manner, as -- you guessed it -- Hadoop jobs. They need to take advantage of Hadoop's collation functionality. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our thread failure logic is also pretty simple. Because I'm starting up Callable tasks and making them run until I shut them down, I can check to see if any of them have finished prematurely by calling isDone() on the Futures returned when submitting them to the ExecutorService. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Process failure can be monitored (and logged) by a watchdog program. Repeated process failure in this case is symptomatic of an uncaught exception being thrown in one of the process threads. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Machine failure is also easily monitorable. I need to expose a simple service on each machine to detect process and thread failures, and if that process is not reachable, I can assume that the machine is offline. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;These may be fairly limited and crude methods of getting a highly available system in place, but they feel like the right primitives to implement because while I don't know why the system is going to fail, each of these methods gives me a way to know how it is failing.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: bold;"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;The Conclusion (so far)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The morals of the story at this point are:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;frameworks can be extremely powerful if used for their strengths, and extremely limiting if used for their secondary benefits. When it feels like I'm pounding a square peg into a round hole, I probably am. I think this is called 'design smell', and now that I know what it smells like, I'll start backing up a lot sooner in an effort to find the right tool for the job.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;It is always a good sign when a refactoring drops out lots of code.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Having to implement the availability and robustness of the system we are writing has actually made it easier to understand. Even though we are implementing functionality that we once got for free, at least we understand the limitations of the availability and robustness solutions we put in place. &lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5763362979942491308?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5763362979942491308/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/streaming-data-with-workeragent-based.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5763362979942491308'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5763362979942491308'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/streaming-data-with-workeragent-based.html' title='Streaming Data with a Worker/Agent based approach'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6531124977087586689</id><published>2009-06-08T21:40:00.001-07:00</published><updated>2009-06-09T13:06:33.189-07:00</updated><title type='text'>Streaming Data with Hadoop and SQS -- a work in progress</title><content type='html'>&lt;span class="Apple-style-span"  style=" ;font-family:Times;"&gt;&lt;div style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 3px; padding-right: 3px; padding-bottom: 3px; padding-left: 3px; width: auto; font: normal normal normal 100%/normal Georgia, serif; text-align: left; "&gt;&lt;div&gt;&lt;a href="http://hadoop.apache.org/"&gt;Hadoop&lt;/a&gt; is by design a batch oriented system. Take a bunch of data, run a series of maps and reduces with it across X machines, and come back when it's done. A Hadoop cluster has high throughput and high latency. In other words, it takes a while, but a lot of stuff gets done.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At &lt;a href="http://evri.com/"&gt;work&lt;/a&gt;, I'm leading a team that is implementing a data processing pipeline. The primary use case that we need to address involves quickly handling changes in the entity data we care about -- people, places, and things that act or act upon other people, places and things. Specifically, this means:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;detect that there has been a change&lt;/li&gt;&lt;li&gt;get the changed data into the system&lt;/li&gt;&lt;li&gt;apply a series of transforms to it for downline rendering and searching systems&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;This use case requires a streaming solution. Each piece of data needs to be transformed and pushed into production relatively quickly. Note that there is an acceptable latency. If, for instance, if Famous Actor X dies, our system needs to detect it and update the data within the hour. However, detecting/updating data within a day would be too slow to be useful to someone wanting to know what was up with Famous Actor X.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At first glance, Hadoop is not a good fit for this solution. It takes file based inputs and produces file based outputs, so any individual piece of data moving though the system is limited by the speed at which an entire input set can be run through a cluster.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, Hadoop has the following features that make it ideal for distributing work.&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;The &lt;a href="http://en.wikipedia.org/wiki/MapReduce"&gt;MapReduce&lt;/a&gt; abstraction is a very powerful one that can be applied to many different kinds of data transformation problems. &lt;/li&gt;&lt;li&gt;The primary &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Mapper.html"&gt;Mapper&lt;/a&gt; And &lt;a href="http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/Reducer.html"&gt;Reducer&lt;/a&gt; interfaces are simple enough to allow many different developers to ramp up on the system in minimal time.&lt;/li&gt;&lt;li&gt;&lt;a href="http://hadoop.apache.org/core/docs/current/hdfs_design.html"&gt;HDFS&lt;/a&gt; allows the developer to not worry about where they put job data.&lt;/li&gt;&lt;li&gt;Cluster setup and maintenance, thanks to companies like &lt;a href="http://cloudera.com/"&gt;Cloudera&lt;/a&gt; (who fixed the issues I was seeing with S3, thanks!), is taken care of.&lt;/li&gt;&lt;li&gt;The logic around distributing the work is completely partitioned away from the business logic that actually comprises the work.&lt;/li&gt;&lt;li&gt;Jobs that fail are re-attempted.&lt;/li&gt;&lt;li&gt;I'm sure there's more..&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;As our system scales, we can only assume that the number of concurrent inputs and therefore the system load will grow. If we were to take the lowest initial effort route and write our own multithreaded scheduler, we would have a much more straightforward solution.&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;However, as the workload grows to swamp a single machine, we would eventually end up having to deal with the headaches of distributed computing -- protocol, redundancy, failover/retry, synchronization, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In fact, I was fully intending to write a quick scheduler app to stream data and throw it away when we reached a scale that required distribution, at which point I was going to use Hadoop. However I soon realized that I would be solving problems -- scheduling, data access, retry logic -- and those were just the initial non distributed issues -- that were already addressed by Hadoop.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Still, there is the high latency/inherent batchiness of Hadoop. In order to work around the high latency problem, we're trying to enable streaming between MapReduces via &lt;a href="http://aws.amazon.com/sqs/"&gt;SQS&lt;/a&gt;. The input of the entity data into the system and the various transforms of that data can be treated as a series of MapReduce clusters, where the transformation is done during the map, and any necessary collation is done during the reduce. The Reduce phase of each MapReduce can send a notification for the piece of data it has to the next MapReduce cluster.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Of course, it is really inefficient to run MapReduce over a single piece of data at a time. So the listener on each SQS queue buckets incoming messages using a max messages/max time threshold to aggregate data. When a threshold is reached, the system then writes the collected messages to a file that it then starts a MapReduce job on. This is mini batching, and as long as it delivers the data within the specified system latency, it's all good.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What we are doing is definitely a work in progress. The reason why is that there are several 'dials' in this process that need to be tuned and tweaked based on the size of the input data.&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;number of messages/amount of time to wait. Of course, the larger the input files, the more 'efficient' the system. On the other hand, making those files larger may imply an increase in latency.&lt;/li&gt;&lt;li&gt;number of concurrent jobs -- there is a sweet spot here -- it (again) doesn't make sense to launch 100 concurrent jobs. We will need to pay attention to how long a job takes before deciding to adjust the number of concurrent jobs up or down.&lt;/li&gt;&lt;li&gt;number of transformations -- the bucketing required implies that every transform has a built in latency, that factors into the overall latency.&lt;/li&gt;&lt;li&gt;cluster size -- it makes no sense to run 20 nodes where 2 would suffice, but there will be times when 20 is necessary&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;Some of the dials may be adjustable on the fly -- sending messages to a cluster via an 'admin' queue allows us to change message size/max time/concurrent job numbers dyamically. Other dials may require a stop-reconfig-start of a cluster. One benefit of using SQS is that no messages are lost while we tweak these 'hard stop' dials.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We're not doing a classical MapReduce where we transform data and then collate it under a distinct key. We're doing simple transformations with the data. The reduce piece isn't really needed, because there is no collation to be done. Plenty of people use MapReduce this way, primarily because it allows them to easily decompose their logic into parallel processing units.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We are even going further by undermining one of the key strengths of MapReduce, the ability to operate on large chunks of data, and instead running lots of smaller jobs concurrently. The Hadoop framework makes this possible. I'm not sure how optimal this is, and expect that we will be tweaking the message size and concurrency dials for some time. One of the advantages that the underlying Hadoop framework offers us is flexibility, as evidenced by the kinds of dials we can tweak to get optimal system throughput and latency.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I'll keep updating as we learn more about how the system behaves, and what the 'right' settings of the dials are based on the amount of data being input. I don't know if this is necessarily the most correct or elegant way to build a pipeline, but I do know that it is a good sign the Hadoop framework is this flexible.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6531124977087586689?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6531124977087586689/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/streaming-data-with-hadoop-and-sqs-work.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6531124977087586689'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6531124977087586689'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/streaming-data-with-hadoop-and-sqs-work.html' title='Streaming Data with Hadoop and SQS -- a work in progress'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-307317737222546023</id><published>2009-06-01T11:58:00.000-07:00</published><updated>2009-06-01T12:09:52.066-07:00</updated><title type='text'>logrotate and scripts that just cant let go.</title><content type='html'>I just found out that a logrotate job I had configured X months ago wasn't working when the lead came up to me and said 'what the $#!k is  a 5GB file doing in my /var/log?' He was pissed because this was the second time logrotate had not been configured correctly (both times, my fault). The first time, we discovered logrotate.conf cannot have comments in it. This time, it looked like logrotate had run, but the script had kept the filehandle to the old log file open and was continuing to log to the rotated file.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;One way to do this is send a HUP to the process in the postrotate script of the logrotate. This would mean I had to modify the script to trap the HUP signal, release the filehandle, get a handle to the new (same name) log file, and keep rolling on. I decided not to do this because it involved modifying the original script and I didn't have that much time to relearn things.&lt;br /&gt;&lt;br /&gt;The second way I ended up doing this was to kill and restart the process during postrotate. Here is the config file in /etc/logrotate.d&lt;br /&gt;&lt;br /&gt;/var/log/response_process.log {&lt;br /&gt; daily&lt;br /&gt; missingok&lt;br /&gt; rotate 52&lt;br /&gt; compress&lt;br /&gt; delaycompress&lt;br /&gt; notifempty&lt;br /&gt; create 640 root adm&lt;br /&gt; sharedscripts&lt;br /&gt;&lt;b&gt;&lt;i&gt; postrotate&lt;br /&gt;   kill $(ps ax | grep process.rb | grep -v 'grep' | awk '{print $1}')&lt;br /&gt;   ruby process.rb -logfile:/var/log/response_process.log &gt; /dev/null 2&gt;&amp;amp;1&lt;br /&gt; endscript&lt;br /&gt;&lt;/i&gt;&lt;/b&gt;}&lt;br /&gt;&lt;br /&gt;kind of hacky, but I didn't have to expend any mental effort making sure that the filehandle was truly closed in the script.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-307317737222546023?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/307317737222546023/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/logrotate-and-scripts-that-just-cant.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/307317737222546023'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/307317737222546023'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/06/logrotate-and-scripts-that-just-cant.html' title='logrotate and scripts that just cant let go.'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-3963609623541422872</id><published>2009-05-13T14:39:00.001-07:00</published><updated>2009-05-15T22:03:57.477-07:00</updated><title type='text'>Running a Hadoop 0.20  Cluster using S3 as input/output</title><content type='html'>&lt;div&gt;I've been changing a database ETL application into a set of MapReduces up on EC2. I need s3 as my input and output for each MapReduce, and was excited to see that Hadoop had s3 filesystem support built in. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;After stumbling through the ec2 scripts in 18.3, and finding a &lt;a href="http://www.cloudera.com/blog/2009/05/11/using-clouderas-hadoop-amis-to-process-ebs-datasets-on-ec2/"&gt;much easier go of it via the Cloudera scripts&lt;/a&gt;, I ran into a &lt;a href="https://issues.apache.org/jira/browse/HADOOP-3361"&gt;blocking issue&lt;/a&gt; (for me, anyway) with the version of Hadoop (based on 18.3) on the Cloudera AMI -- there were issues writing to S3 as the output. &lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I then started looking at 0.19.0, where the issue was fixed, but found &lt;a href="https://issues.apache.org/jira/browse/HADOOP-4684"&gt;another issue&lt;/a&gt; (again, s3 related, this time reading the input directory). I was able to reproduce this issue on my local box immediately, which saved some time. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This left me with 0.20.0, which claimed to have both issues fixed. I tested 0.20.0 on my local box with a small data set, and it passed. The next step was to build an AMI with Hadoop 0.20.0 on it, deploy that AMI to a reasonable sized cluster, and try to get through an entire run of my 133 million record input set, which was estimated to reduce to a 7.5 million record output set. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I decided to start by using the 0.20.0 src/contrib/ec2 scripts. The learning experience working with the original src/contrib/ec2 files in 18.3, and then working with the Cloudera scripts allowed me to move much faster this time. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to build an image using the scripts, you need to specify the following (in addition to the account access variables detailed &lt;a href="http://www.cloudera.com/blog/2009/05/11/using-clouderas-hadoop-amis-to-process-ebs-datasets-on-ec2/"&gt;here&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;HADOOP_VERSION -- I set to 0.20.0&lt;/div&gt;&lt;div&gt;S3_BUCKET -- I used my own bucket to store the AMI.&lt;/div&gt;&lt;div&gt;INSTANCE_TYPE -- Amazon small and medium instances are 32 bit, Large and XLarge instances are 64 bit. Specifying INSTANCE_TYPE  lets the shell load the correct base OS image. &lt;/div&gt;&lt;div&gt;JAVA_BINARY_URL -- the download link to the version of Java you want to use. Note this varies depending on the architecture (i386 or X86_64). For i386 I used: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;http://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_Developer-Site/en_US/-/USD/VerifyItem-Start/jdk-6u13-linux-i586.bin?BundledLineItemUUID=teRIBe.ohNsAAAEhz0pwgkAW&amp;amp;OrderID=yI1IBe.oPMYAAAEhuEpwgkAW&amp;amp;ProductID=RGtIBe.ou1AAAAEfpVYcydOO&amp;amp;FileName=/jdk-6u13-linux-i586.bin&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note that I then had to change my JAVA_VERSION to match the minor version specified: i.e. for the link above I had to set JAVA_VERSION  to 1.6.0_13.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now that all variables were configured, I ran &lt;/div&gt;&lt;div&gt;&lt;pre&gt;hadoop-ec2 create-image&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;to create the exact Hadoop image I needed. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;With the image created I then used &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;hadoop-ec2 initialize-cluster mycluster 20 &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;to create a 20 node cluster. I logged in, and the first thing  I noticed was that JobTracker was not running on the master, and TaskTracker was not running on the slaves. Even though they were specified to start right after the NameNode and DataNode (respectively) in the shell file executed at AMI boot time:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;if [ "$IS_MASTER" == "true" ]; then&lt;br /&gt;# MASTER&lt;br /&gt;...&lt;br /&gt;# Hadoop&lt;br /&gt;# only format on first boot&lt;br /&gt;[ ! -e /mnt/hadoop/dfs ] &amp;amp;&amp;amp; "$HADOOP_HOME"/bin/hadoop namenode -format&lt;br /&gt;&lt;br /&gt;"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode&lt;br /&gt;"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker&lt;br /&gt;else&lt;br /&gt;# SLAVE&lt;br /&gt;...&lt;br /&gt;# Hadoop&lt;br /&gt;&lt;br /&gt;"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode&lt;br /&gt;"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker&lt;br /&gt;&lt;br /&gt;fi&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So I ran the following command to get the (internal) names of the slave nodes (from my laptop):&lt;/div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;ec2-describe-instances | grep -w 'infocloud' | grep -ve 'infocloud-cluster.*' | awk '{print $5}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-family:Georgia;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;In this line I grep for my security group (infocloud) and then excluded the non AMI lines that contained my cluster name (infocloud-cluster.*), and finally print the fifth element in the list. This gives me a list of (Amazon EC2) internal domain names, like this:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;domU-12-31-39-02-B4-F3.compute-1.internal&lt;br /&gt;domU-12-31-39-00-B5-12.compute-1.internal&lt;br /&gt;domU-12-31-39-00-5D-E3.compute-1.internal&lt;br /&gt;domU-12-31-39-00-56-46.compute-1.internal&lt;br /&gt;domU-12-31-39-00-58-51.compute-1.internal&lt;br /&gt;domU-12-31-39-00-A8-B6.compute-1.internal&lt;br /&gt;domU-12-31-39-00-85-D1.compute-1.internal&lt;br /&gt;domU-12-31-39-01-74-22.compute-1.internal&lt;br /&gt;domU-12-31-39-00-E8-94.compute-1.internal&lt;br /&gt;domU-12-31-39-00-C6-13.compute-1.internal&lt;br /&gt;domU-12-31-39-00-DC-65.compute-1.internal&lt;br /&gt;domU-12-31-39-00-4D-D3.compute-1.internal&lt;br /&gt;domU-12-31-39-01-5C-B6.compute-1.internal&lt;br /&gt;domU-12-31-39-00-B2-54.compute-1.internal&lt;br /&gt;domU-12-31-39-00-66-06.compute-1.internal&lt;br /&gt;domU-12-31-39-00-E5-B7.compute-1.internal&lt;br /&gt;domU-12-31-39-00-68-06.compute-1.internal&lt;br /&gt;domU-12-31-39-00-88-46.compute-1.internal&lt;br /&gt;domU-12-31-39-00-7D-C8.compute-1.internal&lt;br /&gt;domU-12-31-39-00-A1-08.compute-1.internal&lt;br /&gt;domU-12-31-39-00-C2-15.compute-1.internal&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Note that the first node in this list is the root node.  I echoed this output into a file that I then pushed up to the master:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;hadoop-ec2 push infocloud-cluster nodes.txt&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;and then wrote some ruby to parse it: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;HADOOP_HOME="/usr/local/hadoop-0.20.0"&lt;br /&gt;&lt;br /&gt;File.open("slaves.txt") do | file |&lt;br /&gt;cmd = "ssh #{slave} #{HADOOP_HOME}/bin/hadoop-daemon.sh start"&lt;br /&gt;while(slave = file.gets)&lt;br /&gt;&lt;br /&gt;slave = slave.chomp&lt;br /&gt;hostname = `hostname`&lt;br /&gt;if(slave.starts_with(hostname))&lt;br /&gt;  cmd += " jobtracker"&lt;br /&gt;else&lt;br /&gt;  cmd += " tasktracker"&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;`#{cmd}`&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;So I start the job tracker for the master node (where I run the job from), otherwise I start the task tracker. Note I shouldn't have to do this, and I'm still trying to figure out why the original command in the remote startup script didn't work. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once job tracker and task trackers were started, the cluster was effectively up. I'm going to see if I can get the remote startup script to work as designed, because that final step is hacky. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Finally, once I had started up the cluster successfully, I noticed that there was only one node configured to do reduces. I remedied this by changing my generated hadoop-site.xml, which, btw, is flagged deprecated for the 0.20.0 version (it still works, but probably not for the next version). The hadoop-site.xml is generated in hadoop-ec2-init-remote.sh, here is what I modified:&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;&amp;lt;?xml version="1.0"?&amp;gt;&lt;br /&gt;&amp;lt;?xml-stylesheet type="text/xsl" href="configuration.xsl"?&amp;gt;&lt;br /&gt;&amp;lt;configuration&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;hadoop.tmp.dir&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;/mnt/hadoop&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;fs.default.name&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;hdfs://$MASTER_HOST:50001&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.job.tracker&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;hdfs://$MASTER_HOST:50002&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;tasktracker.http.threads&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;80&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.reduce.parallel.copies&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;20&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.reduce.tasks&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;20&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.tasktracker.map.tasks.maximum&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;3&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.tasktracker.reduce.tasks.maximum&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;3&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.output.compress&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;mapred.output.compression.type&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;BLOCK&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;dfs.client.block.write.retries&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;3&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;fs.s3n.awsAccessKeyId&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;YOUR ACCESS KEY ID&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;property&amp;gt;&lt;br /&gt;&amp;lt;name&amp;gt;fs.s3n.awsSecretAccessKey&amp;lt;/name&amp;gt;&lt;br /&gt;&amp;lt;value&amp;gt;YOUR SECRET KEY&amp;lt;/value&amp;gt;&lt;br /&gt;&amp;lt;/property&amp;gt;&lt;br /&gt;&amp;lt;/configuration&amp;gt;&lt;br /&gt;&lt;/pre&gt;With these changes in place, I have started a hadoop job that has both s3 inputs and outputs.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-3963609623541422872?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/3963609623541422872/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/it-rained-on-my-cloud-computing-parade.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3963609623541422872'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3963609623541422872'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/it-rained-on-my-cloud-computing-parade.html' title='Running a Hadoop 0.20  Cluster using S3 as input/output'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1446719540141321427</id><published>2009-05-11T15:15:00.000-07:00</published><updated>2009-05-11T15:25:21.279-07:00</updated><title type='text'>My Guest Post on Cloudera</title><content type='html'>I used to think that this blog had an audience of one -- me. Not even my mom reads this blog. However the guys from &lt;a href="http://cloudera.com/"&gt;Cloudera&lt;/a&gt; stumbled upon a recent &lt;a href="http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html"&gt;post&lt;/a&gt; I had written about getting &lt;a href="http://hadoop.apache.org/core/"&gt;Hadoop&lt;/a&gt; running in &lt;a href="http://aws.amazon.com/ec2/"&gt;EC2&lt;/a&gt;, and pointed me to their EC2 setup scripts, which pick up where src/contrib/ec2 left off. They let me write up a &lt;a href="http://www.cloudera.com/blog/2009/05/11/using-clouderas-hadoop-amis-to-process-ebs-datasets-on-ec2/"&gt;guest post&lt;/a&gt; on configuring and running a MapReduce job using their scripts. Check it out!&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1446719540141321427?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1446719540141321427/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/my-guest-post-on-cloudera.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1446719540141321427'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1446719540141321427'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/my-guest-post-on-cloudera.html' title='My Guest Post on Cloudera'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-3228339674899292604</id><published>2009-05-11T11:43:00.000-07:00</published><updated>2009-05-11T11:59:20.759-07:00</updated><title type='text'>A Reality Check</title><content type='html'>From an email I sent last Friday:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style:italic;"&gt;Hi, &lt;/span&gt;&lt;div&gt;&lt;span style="font-style:italic;"&gt;&lt;br /&gt;Last week,  Leela had developed an abscess (bacterial infection) behind her throat wall under the base of the skull, where all the nerves and muscles attach to the spine. It inflamed her neck and shoulder muscles and they spasmed.  Over three days it had progressed from what we thought was a slight neck sprain to the point where her head was locked down to one side, she was running a fever, and was in a massive amount of pain. Yesterday I took her to the doctor first thing in the AM and he sent me directly to the ER at Childrens Hospital.&lt;br /&gt;&lt;br /&gt;After multiple tests, they figured out the abscess thing and did surgery to remove it around 7 last night. Because we and her doctor had originally thought it was a neck sprain and nothing more, the infection had time to spread and she needs to be hospitalized until it disappears from the surrounding tissue.&lt;br /&gt;&lt;br /&gt;The good news is that  -- unlike a lot of kids at Childrens -- the situation has a very high probability of a good resolution, and we were able to catch it before it occluded the carotid artery or her windpipe. Also, she's an extremely tough little girl, I thought so before this but after yesterday I'm convinced that she has the constitution of a Navy Seal. &lt;br /&gt;&lt;br /&gt;Childrens Hospital is an extremely hard place for a parent to be, but the staff is amazing. We are so lucky to have them here in Seattle. I have been in a lot of emergency rooms and I know that it would have been much harder to get the kind of treatment and attention that Leela received at Childrens at any other hospital (holy run on sentences, batman). We're not sure how long she is going to need to be hospitalized, it depends on how quickly she can recover. Lopa spent the night last night, and we'll be trading off over the next couple of days.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Update: She is coming home today (Monday) after going into the ER on Friday. It's amazing how fast kids bounce back. She had a rough Saturday, she was in pain, constipated, and still fighting the infection. But on Sunday she was back to herself again, giggling, teasing me, and smiling. &lt;/span&gt;&lt;span style="font-style:italic;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Before this whole thing went down, I had a big 'todo' list for the week and weekend:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(1) finish up guest blog post for cloudera.&lt;/div&gt;&lt;div&gt;(2) assemble extracted data up to cloud via MapReduce.&lt;/div&gt;&lt;div&gt;(3) Go on 26 mile slow run&lt;/div&gt;&lt;div&gt;(4) mow lawn and attempt to fix waterfall.&lt;/div&gt;&lt;div&gt;(5) Play goalie for Pele's Nightmare on Friday night.&lt;/div&gt;&lt;div&gt;(6) Take Kiran to his lacrosse game.&lt;/div&gt;&lt;div&gt;(7) Celebrate Mothers day down at Seward park on bikes.&lt;/div&gt;&lt;div&gt;(8) refactor some old code using java concurrency lib goodies.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;All of that got massively pre-empted, and as I sat in the ER looking at my little 5 year old daughter scream in pain, all of it really didn't matter anymore. The only thing that did matter was finding out what was wrong with her and fixing it. Fortunately the doctors were able to do just that, and through the weekend I was able to come up with a better todo list:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(1) be with my family and enjoy them.&lt;/div&gt;&lt;div&gt;(2) celebrate the little things in life.&lt;/div&gt;&lt;div&gt;(3) never take the people I love for granted.&lt;/div&gt;&lt;div&gt;(4) all that other stuff is gravy.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You would think that after watching my dad lose his battle with cancer 2 years ago that these priorities would be second nature, but that's not the way my mind works, I tend to 'glorify the mundane'. This past weekend was very hard, but at the same time it was a good reminder of what really matters. &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-3228339674899292604?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/3228339674899292604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/reality-check.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3228339674899292604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3228339674899292604'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/reality-check.html' title='A Reality Check'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-8462315393243839760</id><published>2009-05-07T14:43:00.000-07:00</published><updated>2009-05-07T14:44:47.031-07:00</updated><title type='text'>Video on Cloud Computing Amazon EC2</title><content type='html'>Again, more of a pointer than a blog post. &lt;br /&gt;&lt;br /&gt;&lt;object width="400" height="300"&gt;&lt;param name="allowfullscreen" value="true" /&gt;&lt;param name="allowscriptaccess" value="always" /&gt;&lt;param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=3616394&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" /&gt;&lt;embed src="http://vimeo.com/moogaloop.swf?clip_id=3616394&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="300"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;p&gt;&lt;a href="http://vimeo.com/3616394"&gt;Cloud Computing on Amazon AWS EC2&lt;/a&gt; from &lt;a href="http://vimeo.com/user1427769"&gt;ray@sharethis.com&lt;/a&gt; on &lt;a href="http://vimeo.com"&gt;Vimeo&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-8462315393243839760?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/8462315393243839760/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/video-on-cloud-computing-amazon-ec2.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8462315393243839760'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8462315393243839760'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/video-on-cloud-computing-amazon-ec2.html' title='Video on Cloud Computing Amazon EC2'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-9026759881059909657</id><published>2009-05-07T13:40:00.000-07:00</published><updated>2009-05-07T13:41:21.002-07:00</updated><title type='text'>Economy of Hadoop on AWS</title><content type='html'>This is more of a note to myself, so I don't lose the video, and can fwd to others.&lt;br /&gt;&lt;br /&gt;&lt;object width="400" height="225"&gt;&lt;param name="allowfullscreen" value="true" /&gt;&lt;param name="allowscriptaccess" value="always" /&gt;&lt;param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=4211288&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" /&gt;&lt;embed src="http://vimeo.com/moogaloop.swf?clip_id=4211288&amp;amp;server=vimeo.com&amp;amp;show_title=1&amp;amp;show_byline=1&amp;amp;show_portrait=0&amp;amp;color=&amp;amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="400" height="225"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;p&gt;&lt;a href="http://vimeo.com/4211288"&gt;Big Data: On Cloud Computing and Hadoop&lt;/a&gt; from &lt;a href="http://vimeo.com/user1600800"&gt;Roberto Monge&lt;/a&gt; on &lt;a href="http://vimeo.com"&gt;Vimeo&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-9026759881059909657?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/9026759881059909657/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/economy-of-hadoop-on-aws.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/9026759881059909657'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/9026759881059909657'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/05/economy-of-hadoop-on-aws.html' title='Economy of Hadoop on AWS'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-3297842957062044473</id><published>2009-04-14T15:24:00.001-07:00</published><updated>2009-04-27T14:43:45.187-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Hadoop EC2 Amazon AWS'/><title type='text'>Configuring a Hadoop cluster on EC2</title><content type='html'>I've been ramping up on Amazon Elastic MapReduce, but now I need to process a 32GB file located in an Elastic Block Store, and there is no way I know of to get the AMIs that Amazon Elastic MapReduce starts up to mount an arbitrary EBS. So now it's time to roll my own Hadoop Cluster out on EC2. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I looked around for a while, and found this &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873"&gt;somewhat out of date tutorial&lt;/a&gt; by Tom White that pointed me to a set of ec2 helper scripts in the src/contrib subdir of the Hadoop installation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Unfortunately, those scripts did not get me 'all the way there', but they were a start.  I'm going to try to roll my changes into those ec2 helper scripts before I have to do set up another cluster:)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"   style="font-family:Monaco;font-size:100%;"&gt;&lt;span class="Apple-style-span"  style=" ;font-family:Georgia;"&gt;&lt;div&gt;&lt;div&gt;&lt;b&gt;Setting Up a Multi Node Hadoop Cluster on EC2&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Prior to setting up a Multi Node Hadoop Cluster, I set up a &lt;a href="http://hadoop.apache.org/core/docs/current/quickstart.html#Local"&gt;single node standalone installation&lt;/a&gt;. I recommend doing this because it allowed me to make sure my code worked, i.e. my jar file was valid, my Mapper and Reducer were working, etc. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order to set up a multi node hadoop cluster, the &lt;a href="http://public.yahoo.com/gogate/hadoop-tutorial/html/module7.html"&gt;standard Hadoop Cluster setup instructions&lt;/a&gt; applied to the EC2 environment meant that I would have to do the following&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;find an AMI with hadoop on it&lt;/li&gt;&lt;li&gt;bring up N+1 of those&lt;/li&gt;&lt;li&gt;make one the master and the rest the slaves&lt;/li&gt;&lt;li&gt;change the master config to account for the slaves&lt;/li&gt;&lt;li&gt;change the slave config to point to the master&lt;/li&gt;&lt;li&gt;allow Hadoop component port access between master and slave for namenode and datanode communication&lt;/li&gt;&lt;li&gt;start the system up.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;The scripts at  {hadoop src location}/src/contrib/ec2/bin use the ec2 API shells to do attempt to do all of the above. They fall short in a couple of key areas, and need to be extended. I'm going to detail the necessary steps I took to get a cluster fully operational so that I can extend those scripts in the future. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What the scripts do:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Find AMIs and starting up instances, N slave instances and 1 master instance. &lt;/li&gt;&lt;li&gt;Allow you to log into the master as well as push files out to it.&lt;/li&gt;&lt;li&gt;Generate a private/public key on the master, and push the public key out to the slaves to enable password-less ssh.  &lt;/li&gt;&lt;li&gt;Push the master hadoop-site.xml out to all slaves.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;What they do not do: &lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;they do not configure the master conf/slaves file to contain the IPs of all slaves.&lt;/li&gt;&lt;li&gt;they do not set up security groups with overridden port values specified in the /etc/rc.local of the AMI I was using. Those values are catted to conf/hadoop-site.xml. To be honest, there is no way they could actually be aware of those values unless the scripts were synchronized to the image, which they weren't.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Both of these mean that true distributed startup doesn't happen. But the failure is 'silent', so unless you are looking at the logs on multiple machines, you don't know that things are failing. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Initial Script Setup Steps&lt;/b&gt; &lt;/div&gt;&lt;div&gt;Here are the steps I used to get working with the scripts. Note that the AMI the scripts point to by default has version 0.17 of Hadoop installed.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(1) I configured my EC2_PRIVATE_KEY and EC2_CERT env vars to point to the .pem files I generated for them.&lt;br /&gt;(2) In  {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2-env.sh, I set the following env vars:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;AWS_ACCOUNT_ID={acct number}&lt;/li&gt;&lt;li&gt;AWS_ACCESS_KEY_ID={key id}&lt;/li&gt;&lt;li&gt;AWS_SECRET_ACCESS_KEY={secret key}&lt;/li&gt;&lt;li&gt;KEY_NAME={name of KeyPair you want to use} NOTE: on the KeyPair, the hadoop-ec2 scripts assume that the generated private key for your keypair resides in the same directory you configured your EC2_PRIVATE_KEY in.&lt;/li&gt;&lt;/ul&gt;(3) {hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} {number of desired nodes} to start up a cluster with the AMI configured at the S3_BUCKET location specified in the conf file. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, I thought the cluster was up and running, but when I tried to copy a large file to the cluster, I got this error: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;org.apache.hadoop.ipc.RemoteException: java.io.IOException: File&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;/user/root/input could only be replicated to 0 nodes, instead of 1&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at java.lang.reflect.Method.invoke(Method.java:597)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;I googled this and found out that it implied that my data nodes were failing (but I hadn't seen that!). I checked the masters and slaves files in the master machine conf file, and they only contained localhost, which meant that the master knew nothing about the slaves at startup. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I stopped hadoop, changed the conf/slaves config file to include the Amazon internal names of all slaves,  and restarted. This time I could see the remote slave data nodes start up. So I tried the copy again, and got the same failure. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When I went out to a slave machine, I looked at the datanode log file in the log directory (on this AMI, configured at /mnt/hadoop/logs. I saw that the datanode service was trying to contact the master with no success. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This ended up being because of the security policy of EC2. In EC2, you need to explicitly configure which ports are accessible on each instance via EC2 Security Groups. In summary, the current scripts assumed defaults from hadoop-default.xml, and I had overridden some of those defaults in hadoop-site.xml.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Summary:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;extended conf/slaves to include IP addresses of slave instances.&lt;/li&gt;&lt;li&gt;added 50001 and 50002 access to the master security group (meaning that slave nodes could talk to the master on those ports)&lt;/li&gt;&lt;li&gt;added 50010 access to the slave security group (same meaning for master to slaves)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;At this point Hadoop was configured with 4 slave nodes and 1 master.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Attaching an EBS to the master and copying EBS data to HDFS&lt;/b&gt;&lt;/div&gt;&lt;div&gt;(1) My data source was located in a Elastic Block Store volume. These are mounted like so:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;ec2-attach-volume {volume ID} -i {image ID} -d {device location on image, i.e. /dev/sda}&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(2) In order to actually access the data, you mount that volume like you would mount a drive:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;mount {name of dir to map to} {name of device to mount, i.e. /dev/sda}&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Running the Job&lt;/b&gt;&lt;/div&gt;&lt;div&gt;(1) Once I mounted the volume, I needed to log into the master to start the job. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} login &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(2) Then  I need to push the file to HDFS for processing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;hadoop fs -mkdir input&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;hadoop fs -copyFromLocal {location of articles.tsv} input&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(3) From my local box, I  push my jar out to the master:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;{hadoop src location}/src/contrib/ec2/bin/hadoop-ec2 {name of cluster} push {name of my jar file}&lt;/span&gt;&lt;/span&gt; &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(4) On the master box, I start the job: &lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;hadoop jar {name of jar} input output&lt;/span&gt;&lt;/span&gt; (NOTE: job will fail immediately if output exists in hdfs)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-3297842957062044473?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/3297842957062044473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3297842957062044473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3297842957062044473'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html' title='Configuring a Hadoop cluster on EC2'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-3661546140913054747</id><published>2009-04-08T13:11:00.000-07:00</published><updated>2009-04-13T12:37:51.989-07:00</updated><title type='text'>Developing an Application on Amazon Elastic MapReduce</title><content type='html'>&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Follow up from &lt;/span&gt;&lt;a href="http://arunxjacob.blogspot.com/2009/04/how-i-got-rolling-in-cloud.html"&gt;&lt;span class="Apple-style-span"  style="color:#3333FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;last week&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="color:#3333FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;: &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I've been interested in parallel processing for a while, but didn't have the time/opportunity to get into it until recently, when a group of us were asked to rewrite a data extraction system that was built around a single (relational) database, and suffering from severe i/o contention.  &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The actual mechanics of what we were trying to do with that system lent themselves to parallel processing, specifically there was a lot of data transformation and aggregation that we needed to do -- in other words &lt;/span&gt;&lt;a href="http://labs.google.com/papers/mapreduce.html"&gt;&lt;span class="Apple-style-span"  style="color:#3333FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;map-reduce&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Right around the time that we decided to do the rewrite,  Amazon came out with &lt;/span&gt;&lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Elastic MapReduce&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;. Prior to EMR, our options were to either set up a local &lt;/span&gt;&lt;a href="http://wiki.apache.org/hadoop/FrontPage"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Hadoop&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; cluster, or assemble one out on EC2. My experience with Hadoop was zero, and my experience with the map/reduce algorithm was limited to working through the &lt;/span&gt;&lt;a href="http://en.wikipedia.org/wiki/MapReduce"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;canonical Inverted Index example&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; in my head.  So I wanted to minimize the learning curve of setting up a cluster while still being able to validate our ideas. I don't know if we are eventually going to end up building our own AMIs for a custom Hadoop cluster, but not having to deal with those details at this point allows us to focus on the ideas instead of the infrastructure. Yay!&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;In my &lt;/span&gt;&lt;a href="http://arunxjacob.blogspot.com/2009/04/how-i-got-rolling-in-cloud.html"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;last post&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; I explained the steps I took to run the basic streaming sample. Next up, I wanted to run my own map reduce job. In order to develop an EMR application utilizing a custom jar, I took the following steps:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(1) Set up a single node hadoop installation on my dev box.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Because &lt;/span&gt;&lt;a href="http://aws.amazon.com/elasticmapreduce/#pricing"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Time Is Money on EC2&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;, I wanted to get my code as right as I could before running it in the cloud. The Hadoop site provides a great &lt;/span&gt;&lt;a href="http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Node_Cluster)"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;step by step install sheet for Leopard&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(2) Went through the&lt;/span&gt;&lt;span class="Apple-style-span" style="line-height: 15px; "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; &lt;/span&gt;&lt;a href="http://hadoop.apache.org/core/docs/r0.18.3/mapred_tutorial.html"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;basic Hadoop Tutoria&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;l&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;. This tutorial provides great explanations of each class the developer has to implement, as well as the helper classes. Whenever I got stuck, the answer was usually found in the tutorial.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(3) Installed the &lt;/span&gt;&lt;a href="http://www.alphaworks.ibm.com/tech/mapreducetools"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;IBM MapReduce Tools Plugin&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; for Eclipse. While I am not quite able to get it debugging a map/reduce job from my local Hadoop install, I can run my program in 'standalone' mode as a Java Application, as long as Hadoop is started on my system. &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(4) After debugging in the IDE, I ran the job as a standard hadoop job: &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;hadoop jar {name of jar} {input dir} {output dir}&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I chose  to export the jar from eclipse (instead of writing a build.xml), as part of this I got to specify the main class and autogenerate the MANIFEST.MF. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;The input directory and output directory needed to be specified as HDFS directories. The output directory could not exist. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span" style="line-height: normal; "&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;NOTE on the application driver: I took a clue from example code (eventually) after running into unzip errors which a couple of posts said were due to bad configuration. I derived the driver class from Configuration, extended the Tool interface.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(5) After running successfully as a local Hadoop Job, I ran the job from the cloud, &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(a) I copied the jar up to s3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(b) I copied the input files up to s3&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;(c) I created a top level bucket for the output directories. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;I tried to run my job as it was configured for hadoop, with the input and output specified as s3n://{bucket-name} with no success. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;In order to diagnose the problem, I enabled logging, which I should have done immediately. When configuring a job via the AWS Management console, select 'Advanced Options' in the 'configure EC2 instances' section. You need to specify the log destination using s3n://{bucket-name} format. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;With logging enabled, I saw the following exception: &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;java.lang.IllegalArgumentException: Path must be absolute: s3n://java-emr-test&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px; "&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;and it made no sense, even with &lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.google.com/search?client=safari&amp;amp;rls=en-us&amp;amp;q=java.lang.IllegalArgumentException:+Path+must+be+absolute:&amp;amp;ie=UTF-8&amp;amp;oe=UTF-8"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;googling&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;. I ran the &lt;/span&gt;&lt;/span&gt;&lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2272"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;cloudburst sample&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;, and noticed that they specified a full filename as the input parameter. When I did the same, my run succeeded. I'm not sure why I had to do this, and want to make sure I really need to do it, because it will mean an additional read step of intermediate output directories prior to subsequent map reduce steps. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="line-height: 15px;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;Now that I've got this working through the UI, I'm going to look at the &lt;/span&gt;&lt;/span&gt;&lt;a href="http://github.com/entangledstate/amazon-elastic-mapreduce/blob/87507cbd7f40db6179d8825021bf4506b6416adc/README"&gt;&lt;span class="Apple-style-span"  style="color:#3366FF;"&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;entangledstate ruby tool&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span"  style="font-family:arial;"&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt; because it appears to allow programmatic configuration of multi step job flows.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-3661546140913054747?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/3661546140913054747/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/developing-application-on-amazon.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3661546140913054747'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3661546140913054747'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/developing-application-on-amazon.html' title='Developing an Application on Amazon Elastic MapReduce'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6587737190881164931</id><published>2009-04-06T16:14:00.000-07:00</published><updated>2009-04-06T16:58:02.785-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='map-reduce'/><category scheme='http://www.blogger.com/atom/ns#' term='cloud'/><category scheme='http://www.blogger.com/atom/ns#' term='amazon aws'/><title type='text'>How I got rolling in the cloud</title><content type='html'>I recently jumped at the chance to research re-implementing a project here at work in the Amazon cloud. I've been curious about running EC2 instances for a while, and when AMZN announced E&lt;a href="http://aws.amazon.com/elasticmapreduce/"&gt;lastic MapReduce&lt;/a&gt;, their cloud implementation that removed the need to hand assemble Hadoop clusters, I really didn't have any excuses left. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There was a bit of FUD involved in actually getting into an actual cloud -- creating running instances, talking to various services, etc, in addition to trying a new approach to a system that was quickly approaching non functional. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This FUD was complicated by my head cold and the medication I was taking, but despite that fog (yes, I always blame it on the drugs) I was able to muddle through and get something going. My notes (aka a series of pointers to other peoples work): &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(1) I needed to view some sample code at &lt;span class="Apple-style-span" style="font-family: 'Times New Roman'; color: rgb(51, 51, 51); "&gt;&lt;span class="Apple-style-span" style="font-size: medium;"&gt;http://elasticmapreduce.s3.amazonaws.com/&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Times New Roman'; color: rgb(51, 51, 51); font-size: 12px; "&gt; &lt;/span&gt;, i.e. code that was not in my personal s3 store. I tried to build a couple of S3 browsers, and was about to embark on a yak shaving exercise due to a misconfigured ant build on my dev box when I decided to try &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=128"&gt;s3curl&lt;/a&gt; instead. s3curl and irb loaded with hpricot allowed me to get an XML listing of keys in a bucket, then parse the returned XML and download the source code files I wanted to see specifically, the AWS Elastic MapReduce Freebase sample code. I'm 100% sure I could have done this via a UI, but really didn't want to get distracted trying to fix a secondary issue.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(2) For browsing and syncing my personal s3 store: I used the &lt;a href="https://addons.mozilla.org/en-US/firefox/addon/3247"&gt;S3 Firefox Organizer&lt;/a&gt; plugin. Especially useful when inspecting the output of a map-reduce run. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(3) For configuring AMIs and binding  EBS volumes of public instance data, I used &lt;a href="http://developer.amazonwebservices.com/connect/entry.jspa?externalID=609"&gt;ElasticFox&lt;/a&gt;, another FF plugin. The &lt;a href="http://ec2-downloads.s3.amazonaws.com/elasticfox-owners-manual.pdf"&gt;tutorial&lt;/a&gt; walks you through the details of how to generate a keypair, create an instance from an AMI, and bind to an EBS.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(4) The application I'm working on (for work) processes wikipedia and freebase, both of which can be painful and time consuming to get dumps of. Freebase has done the 'right thing' and posted &lt;a href="http://aws.amazon.com/publicdatasets/"&gt;public instances&lt;/a&gt; of the Freebase data store as well as a 'cleaned up' version of the Wikipedia data store that is suitable for a postgres database. Just having these volumes available removes at least 4 hours of setup and maintenance time from our process.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(5) As part of their announcement, Amazon posted a &lt;a href="http://awsmedia.s3.amazonaws.com/pdf/introduction-to-amazon-elastic-mapreduce.pdf"&gt;tutoria&lt;/a&gt;l on how to use Elastic MapReduce using Freebase data. I found this great PDF that walked me through using the CLI to set up several different workflows using different mappers and reducers to find the most popular people in American Football. The mappers and reducers output data to S3 and SimpleDB, which was great for me to see since I didn't have a lot of familiarity with either.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That's it for now. I'm going to write more as I prototype key parts of the system and try to figure out the best way to implement. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: 'trebuchet ms'; color: rgb(51, 51, 51); font-size: 13px; "&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6587737190881164931?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6587737190881164931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/how-i-got-rolling-in-cloud.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6587737190881164931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6587737190881164931'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/04/how-i-got-rolling-in-cloud.html' title='How I got rolling in the cloud'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-105304116090866655</id><published>2009-03-27T06:25:00.000-07:00</published><updated>2009-03-27T06:37:37.288-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GPS training'/><title type='text'>Thoughts on GPS data analysis.</title><content type='html'>Notes to self:&lt;div&gt;&lt;ol&gt;&lt;li&gt;I'm collecting craploads of GPS data, delimited by 'event' (i.e. a run/ride/ski).&lt;/li&gt;&lt;li&gt;I want to correlate that data to compare metadata associated with that GPS data (heart rate/pace)&lt;/li&gt;&lt;li&gt;I want to do this in a non n^2 fashion.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Some thoughts:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Similar 'events' usually center around the same GPS range. What is the margin of error for that range? &lt;/li&gt;&lt;li&gt;If I can quickly get similar events grouped by center, than I can chunk them up by distance. Every event has a starting point, and even though GPS data will not be identical, distance would be a good way to divide events into similar subsections.&lt;/li&gt;&lt;li&gt;Once I get similar event subsections, I can start comparing (i.e. graphing/analyzing) HR/pace data. &lt;/li&gt;&lt;li&gt;Of course, all of this assumes that I've got a system that displays basic statistics for each event, and I'm not there yet.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-105304116090866655?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/105304116090866655/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/thoughts-on-gps-data-analysis.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/105304116090866655'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/105304116090866655'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/thoughts-on-gps-data-analysis.html' title='Thoughts on GPS data analysis.'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-3088233938525820084</id><published>2009-03-16T20:48:00.000-07:00</published><updated>2009-03-27T06:32:01.297-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Leopard CouchDB'/><title type='text'>On the Couch(DB): Part 1: setup on mac OSX Leopard</title><content type='html'>&lt;div&gt;Why &lt;a href="http://couchdb.apache.org/"&gt;CouchDB&lt;/a&gt;? I'm running into situations where a traditional database is giving me headaches:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(1) At work, I'm seeing people spend more and more time maintaining the database and tweaking the schema behind an ETL application used to normalize input data into a standard format as the data being persisted starts to get into the TB  range.&lt;/div&gt;&lt;div&gt;(2) At home I want to write an app storing my GPS data from my runs and rides, and I have no idea what the metadata around that data is going to be. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The common denominator for both of these situations is that a traditional database schema, with very explicitly defined tables and relations, is a liability, either eventually, as in case 1, or initially, as in case 2.  In either case I'm more likely than not to be wrong with any schema design choices that I make. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In both situations, a more flexible approach would be to treat each discrete set of data as a bag of attributes (key value pairs) with  a couple of unique keys that  I can search on. I could see creating a Lucene or a BDB with those keys and a pointer to a file containing the rest of the data. This would work pretty well, until it was time to go to production or scale, or both. When I go to production, I want the data to be highly available. When I scale, I want the data to be partitionable. To do either I would have to extend my naive hashing scheme. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CouchDB is a document based database -- in other words each document is an attribute bag identified by 1..N keys -- that is fault tolerant and distributable, with incremental replication. In other words it is the extension my naive hashing scheme needs to &lt;a href="http://www.imdb.com/title/tt0032910/"&gt;be a real boy&lt;/a&gt; instead of a puppet. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Enough hype and bloviation...I need to install couchdb on a macbook pro running leopard. I followed the instructions found &lt;a href="http://blog.deadinkvinyl.com/2008/07/12/couchdb-on-macosx-leopard/"&gt;here&lt;/a&gt;, and quickly discovered I had neglected to &lt;a href="http://developer.apple.com/technology/Xcode.html"&gt;install XCode&lt;/a&gt; when I had upgraded to Leopard. Doh! &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;btw, if you ever do this, you'll find that macports has downloaded and staged components into /opt/local/var/macports/build, and may not be able to build them successfully after failure due to lack of compilers, etc. Removing the staged directories for the components that were downloaded is the best way to 'reset' macports. Trying to configure, build, and install macported components yourself is not :)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I did run into a minor bump: tcl was installed at 8.4.14_0 on my box, I needed to upgrade to 8.5.6_0 because sudo crapped out trying to build tk. I did this by hand:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;sudo port install tcl --version 8.5&lt;/div&gt;&lt;div&gt;sudo port deactivate tcl @8.4.14_0&lt;/div&gt;&lt;div&gt;sudo port activate tcl @8.5.6_0 (using sudo port list | grep tcl to get exact version numbers)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;OK, with XCode  and dependencies installed, I then followed the rest of the instructions, which, beyond building and installing couchdb, show you how to create the required couchdb system acct, install it as a service, and launch it at startup. All Leopard specific, but all highly convenient. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I also found this &lt;a href="http://jan.prima.de/~jan/plok/archives/142-CouchDBX-Revival.html"&gt;one step install&lt;/a&gt; -- good for getting up and going, but I think in the long run I want to build and understand couchdb a little more than this approach lets me.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next up: simple couchdb access.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-3088233938525820084?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/3088233938525820084/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/on-couchdb-part-1-setup-on-mac-osx.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3088233938525820084'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/3088233938525820084'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/on-couchdb-part-1-setup-on-mac-osx.html' title='On the Couch(DB): Part 1: setup on mac OSX Leopard'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-624927366886624029</id><published>2009-03-15T13:30:00.000-07:00</published><updated>2009-03-27T06:32:19.692-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='training running'/><title type='text'>Intervals -- why am I doing this again?</title><content type='html'>Today I got bitch slapped by reality, or actually the weather, which is one variant of reality. I was planning to do a long run of at 16-18 miles, mixed in with 10 1 mile intervals. It was raining slightly, or 'misting' as we like to say around here (in Seattle we have many different terms for rain), nothing too hard, but it was 35 degrees, so I put on a shell instead of a tech tee, strapped my headlamp on (thanks to Daylight Savings, it is now pitch black at 6:00 AM), and headed out, full of ambition. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I had downloaded the days workout into my Garmin. 10x1 mile at 8:40. I felt a little slow going out, but chalked it up to the disoriented feeling I was getting running in the rain, in the dark. It is fairly surreal -- my headlamp lights up a small cone of streaking raindrops, and very little else, so I start to zone out (and slow down).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When I started the intervals, I knew something was wrong. My get up and go had got up and went. And it was raining harder. No more mist, more like psuedo hail. Then snow, big fat flakes that chilled my face to numb as they fell and stuck.  Still, I soldiered on, trying to maintain a pace that seemed pretty achievable from the glow of my laptop that morning, but was proving to be a painful challenge on the hilly Mercer Island loop. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A younger and less intelligent me would have ground myself down into  a little nub trying to do the intervals, but I'm older, and less inclined to hurt for no reason, so I quit. I turned around, changed my plans from 18 miles to 11, and packed it in. I ran a couple of intervals on the way back, but by then my legs were so frozen that it felt like I was running through (very cold) molasses. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Later, in the shower, I analyzed the workout. Heres the thing. I know I'm fairly motivated. I know that I'm working hard. My heart rate told me so. And that ended up being the key.  Why am I trying to maintain some arbitrary pace up and down a hilly course? I should be running these intervals by heart rate -- by setting a target to stay above -- or below, depending on the goal of the interval -- I'll run at a specific effort, and the pace will be what it is. This feels a lot more real than trying to do intervals at X:30/mile 'because so and so said to', or because X:30 means I'll run a marathon in Y:45. Those are completely arbitrary, completely false goals, and I will crash and burn trying to make them work. On the other hand, I firmly believe that if I can run to a specific heart rate, I will see improvements in speed, because I will see improvements in efficiency.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-624927366886624029?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/624927366886624029/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/intervals-why-am-i-doing-this-again.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/624927366886624029'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/624927366886624029'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/intervals-why-am-i-doing-this-again.html' title='Intervals -- why am I doing this again?'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7050807074233303597</id><published>2009-03-13T21:21:00.000-07:00</published><updated>2009-03-27T06:32:38.926-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='training running'/><title type='text'>Old Guy Training</title><content type='html'>This whole marathon training thing has brought me face to face with my advancing years.&lt;br /&gt;The last time I seriously trained for running was when I was 30, training for my first 1/2 Marathon. I had never run 13 miles before and as a result I prepared as well as I could, which meant that I went out, ran as fast as I could for as long as I could, took the next day off, and did it again.&lt;br /&gt;&lt;br /&gt;This time around that approach really doesn't work. I just can't recover between efforts, even with a day off. Maybe it's the kids, maybe it's the job, maybe it's the additional 10 years. But after a long run or a speed workout I'm cooked for at least a day if not two.  However, I'm running much longer than I ever used to, and at least as fast.&lt;br /&gt;&lt;br /&gt;Which reminds me of my favorite 'old guy' story. Actually it's an 'old bull' story. There are two bulls, one young, one old, sitting at the top of a hill, overlooking a pasture, staring down at a bunch of attractive (to bulls, anyway) cows. The young bull says to the older one: "Hey Pops, lets run down the hill as fast as we can and fuck a cow!". The old bull turns, looks at the younger one and says "Hey kid, lets walk down and fuck em all."  I wish I had been this intelligent about training earlier, when I actually had the ability to recover quickly from a hard workout. But youth is definitely wasted on the young, and I can't turn back the clock.&lt;br /&gt;&lt;br /&gt;These days I'm running one long day, up to 1/2 to 3/4 is at marathon pace, and one fast day, consisting of mile splits at 1-2 minutes faster than marathon pace. Both runs include stretches of 'Gallow-walking', as recommended by Jeff Galloway. I usually walk until my pulse drops to &lt;&gt; 15 miles, the other one will be around 8-9 miles, and the easy runs all under 5.&lt;br /&gt;&lt;br /&gt;I'm hoping that the speed work as well as the marathon pace runs on the longer days will effectively get me faster. I definitely feel like speed is the last thing to come around. The other stuff, like stretching, climbing, pullups and squats, really helps balance out the pounding from the running. As the days get longer I'll probably skip one of the runs and get in a medium length bike ride.&lt;br /&gt;&lt;br /&gt;One benefit of being 40 is simply 10 more years of experience. In the last 10 years I have grown a lot as a person. I've become a father, I've lost my father to cancer, my entire perspective has shifted from 'what can I do for fun next?' to 'what can I do for my family?'. The additional responsibility gives me the perspective to not take training so seriously. At the same time I am able to fully commit to training when I am doing it, because my time is so limited. Any pretensions of actually being 'elite' have disappeared completely. I'm doing this because I like it, not because I'm good at it. I can honestly say that I enjoy running more than I ever did before, because it gives me 2-3+ hours of silence in which I can work things out.&lt;br /&gt;&lt;br /&gt;Last week my long run was a relatively flat 18 miles -- in freezing hail and rain. I froze parts of my body I never want to freeze again. However the run itself was a great confidence booster -- mostly due to the weather. Training has been a constant adventure, pushing these distances that I've never done before has been a lot of fun.&lt;br /&gt;&lt;br /&gt;The marathon course was published online yesterday, and it actually looks kind of hilly. So that, plus the fact that we're going to Disneyland next week, makes me want to 'burn one down' on Sunday, maybe a good 18 miles with hills. An easy run on Monday, climbing on Tuesday, and speedwork on Wednesday prior to leaving for CA will probably set me up just right to take it easy and recover while in the land of Mickey Mouse.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7050807074233303597?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7050807074233303597/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/old-guy-training.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7050807074233303597'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7050807074233303597'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/old-guy-training.html' title='Old Guy Training'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2441670976682067733</id><published>2009-03-12T11:32:00.000-07:00</published><updated>2009-03-27T06:33:08.053-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails ubuntu centos migration'/><title type='text'>Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 4: Trying not to get impaled by Vlad</title><content type='html'>This is a continuation from &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_3583.html"&gt;Part 3&lt;/a&gt;:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I've &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to.html"&gt;installed mod rails&lt;/a&gt;&lt;/li&gt;&lt;li&gt;I've &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_12.html"&gt;built and migrated the database&lt;/a&gt; (and enabled script/console)&lt;/li&gt;&lt;li&gt;I've &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_3583.html"&gt;built and installed RRDTools&lt;/a&gt;.&lt;/li&gt;&lt;/ol&gt;In this part I get &lt;a href="http://rubyhitsquad.com/Vlad_the_Deployer.html"&gt;Vlad the Deployer&lt;/a&gt; working.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Local Machine Changes&lt;/span&gt;&lt;br /&gt;(1) In my local /config/deploy.rb I changed the :domain value to point to my new box:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;set :domain, "xen-5.evri.corp"&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(2) I tried to run vlad:setup. My remote install commands, run via SSH, were not being found.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Server Machine Changes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem with ssh seems to that my user environment is not being established with a non interactive terminal. Even after I modified my /etc/profile or files in /etc/profile.d or ~/.bashrc , even though that is &lt;a href="http://kleeschulte.blogspot.com/2008/06/how-environment-variables-really-work.html"&gt;how I understood non interactive terminals get their env variables&lt;/a&gt;. The problem was that the default sshd_config does not allow user environment variables to be set via sshd. So in /etc/ssh/sshd_config, change&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;# PermitUserEnvironment no&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;to&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;PermitUserEnvironment yes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I reran vlad:setup&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-style: italic;"&gt;rake vlad:setup&lt;/span&gt; &lt;span style="font-style: italic;"&gt; rake full_vlad  (I've modified my rakefile as follows:&lt;/span&gt;  &lt;span style="font-style: italic;"&gt; task :full_vlad =&gt;['vlad:update','vlad:migrate'...all other tasks I need to do ]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt; &lt;/span&gt;&lt;/span&gt;  As a final step, I asked OPS  to map the old server name  "dashboard" to map to the new server.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Are We There Yet?&lt;/span&gt;&lt;br /&gt;No.&lt;br /&gt;&lt;br /&gt;I run a set of ruby cron jobs on the machine (TODO: migrate these to use DaemonSpawn??), I migrate cron settings over by&lt;br /&gt;&lt;br /&gt;crontab -l &gt; cron&lt;br /&gt;scp cron to new machine&lt;br /&gt;crontab cron&lt;br /&gt;&lt;br /&gt;Now I run rake full_vlad, and the server deploys after a full source update and database migration.&lt;br /&gt;&lt;br /&gt;Unfortunately, my graphs are not showing any data. wtf?&lt;br /&gt;&lt;br /&gt;I call rrdtool via IO.popen to extract my data from rrd files.  On the centos box, IO.popen is not returning a readable IO object. I cannot repro with script/console from the production box (i.e. that gives me valid IO).&lt;br /&gt;&lt;br /&gt;Some sleep and investigation reveals that the environment variables for my user are not being set when Ruby spawns a process. This, again, is unique to centos. So I hardcode the path of rrdtool in my IO.popen call, and am now receiving data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2441670976682067733?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2441670976682067733/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_4994.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2441670976682067733'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2441670976682067733'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_4994.html' title='Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 4: Trying not to get impaled by Vlad'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7392815655645574538</id><published>2009-03-12T10:53:00.000-07:00</published><updated>2009-03-27T06:33:28.582-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails ubuntu centos migration rrd'/><title type='text'>Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 3: Installing RRD</title><content type='html'>Continued from &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_12.html"&gt;Part 2&lt;/a&gt;:&lt;br /&gt;I've installed RRD before. I followed these &lt;a href="http://arunxjacob.wordpress.com/2008/06/19/rrdtool-on-ubuntu-704-feisty/"&gt;instructions&lt;/a&gt;, and took the following additional steps to install development headers for pango and libxml2 (missing from default install of centos)&lt;br /&gt;&lt;br /&gt;sudo yum install libxml2-devel&lt;br /&gt;&lt;br /&gt;sudo yum install pango-devel&lt;br /&gt;&lt;br /&gt;In the instructions above, I downloaded rrdtool source. Next I build and install rrdtool&lt;br /&gt;(1) ran make in rrdtool dir&lt;br /&gt;(2) sudo make install to install in /usr/local/rrdtool-1-3.6&lt;br /&gt;(3) modify the path in /etc/profile to include /usr/local/rrdtool-1-3.6/bin&lt;br /&gt;&lt;br /&gt;Next up, the home stretch, &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_4994.html"&gt;Impaling myself with Vlad&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7392815655645574538?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7392815655645574538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_3583.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7392815655645574538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7392815655645574538'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_3583.html' title='Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 3: Installing RRD'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2126922885183378606</id><published>2009-03-12T10:16:00.000-07:00</published><updated>2009-03-27T06:33:47.921-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails ubuntu centos migration postgresql'/><title type='text'>Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 2: Database Migration</title><content type='html'>Part 2 in a series of 4&lt;br /&gt;&lt;br /&gt;In &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to.html"&gt;Part 1&lt;/a&gt;, I got Phusion Passenger up and running. Now I needed to migrate my database schema across. I use Postgres 8.3, which was not installed on the centos box.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Installing Postgres&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;I need to make sure postgres is (a) installed and (b) configured.&lt;br /&gt;&lt;br /&gt;(a) installing postgres: following steps outlined &lt;a href="http://yum.pgsqlrpms.org/howtoyum.php"&gt;here&lt;/a&gt;.&lt;br /&gt;(b) configuring my database and user: my database settings from database.yml are:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;"&gt;development:&lt;br /&gt;adapter: postgresql&lt;br /&gt;host: localhost&lt;br /&gt;database: deploy_monitor_development&lt;br /&gt;username: deploy_monitor&lt;br /&gt;password: deploy_monitor&lt;br /&gt;&lt;br /&gt;test:&lt;br /&gt;adapter: postgresql&lt;br /&gt;host: localhost&lt;br /&gt;database: deploy_monitor_test&lt;br /&gt;username: deploy_monitor&lt;br /&gt;password: deploy_monitor&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;production:&lt;br /&gt;adapter: postgresql&lt;br /&gt;host: localhost&lt;br /&gt;database: deploy_monitor&lt;br /&gt;username: deploy_monitor&lt;br /&gt;password: deploy_monitor&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Because this is a production box, I only need to configure the production database, meaning I only need to add the production user deploy_monitor and the database deploy_monitor.&lt;br /&gt;&lt;br /&gt;I followed the instructions from the postgresql site to &lt;a href="http://www.postgresql.org/docs/8.3/interactive/sql-createuser.html"&gt;add a user&lt;/a&gt; and &lt;a href="http://www.postgresql.org/docs/8.3/interactive/sql-createdatabase.html"&gt;create a database&lt;/a&gt;.&lt;br /&gt;I also needed to install the following gems:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;postgres (0.7.9.2008.01.28)&lt;/li&gt;&lt;li&gt;postgres-pr (0.5.1)&lt;/li&gt;&lt;/ul&gt;to enable Rails to connect to Postgres.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Console Access&lt;/span&gt;&lt;br /&gt;A lot of times I need to log into the box and check something via script/console, i.e. to run an ActiveRecord query.&lt;br /&gt;In order to run console, I need to build &lt;a href="http://tiswww.case.edu/php/chet/readline/rltop.html"&gt;readline&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;(1) Install readline and readline-devel&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;"&gt;yum install readline&lt;br /&gt;yum install readline-devel&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;(2) build/install ruby readline bindings&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;"&gt;cd {ruby src}/ext/readline&lt;br /&gt;ruby extconf.rb&lt;br /&gt;make&lt;br /&gt;sudo make install&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Continued: &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_3583.html"&gt;Part 3, Installing RRD&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2126922885183378606?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2126922885183378606/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_12.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2126922885183378606'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2126922885183378606'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_12.html' title='Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 2: Database Migration'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4609375956715125321</id><published>2009-03-03T13:54:00.000-08:00</published><updated>2009-03-27T06:34:18.182-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails ubuntu centos migration mod_rails'/><title type='text'>Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 1: Setting up mod_rails</title><content type='html'>Notes from  migrating an app from an overloaded Ubuntu server box to a new, virgin centOS box. The app is a Rails app, with a postgres db, served up via mod_rails (passenger). The destination box has apache2 and Ruby 1.8.7 installed.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;My app uses phusion passenger, which is fairly easy to &lt;a href="http://www.modrails.com/documentation/Users%20guide.html#_installing_phusion_passenger"&gt;install&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Steps: &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;(1)  sudo gem install passenger&lt;br /&gt;(2) I then needed to generate the apache mod for passenger.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;sudo passenger-install-apache-2-module&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This step told me the following:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;* To install GNU C++ compiler:&lt;/span&gt; &lt;span style="font-style: italic;"&gt; Please run yum install gcc-c++ as root.&lt;/span&gt;  &lt;span style="font-style: italic;"&gt;* To install OpenSSL support for Ruby:&lt;/span&gt; &lt;span style="font-style: italic;"&gt; Please (re)install Ruby with OpenSSL support by downloading it from http://www.ruby-lang.org/.&lt;/span&gt;  &lt;span style="font-style: italic;"&gt;* To install Apache 2 development headers:&lt;/span&gt; &lt;span style="font-style: italic;"&gt; Please run yum install httpd-devel as root.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;Installing gcc-c++&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;sudo yum install gcc-c++, this was painless. &lt;a href="http://www.compwrite.com/index.php/2008/04/13/what-is-yum/"&gt;Yum&lt;/a&gt; is installed by default on the version of centos I was migrating to (5.2)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Enabling Ruby with Open SSL&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Unfortunately the company approved version of Ruby 1.8.7 did not include ext/openssl, so I needed to download, build, and install on my own.&lt;br /&gt;&lt;br /&gt;(1) download source&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;wget ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.7-p72.tar.gz&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(2) untar, cd to extracted dir, and run ./configure&lt;br /&gt;(3) make&lt;br /&gt;&lt;br /&gt;At this point, you usually sudo make install, but I didn't want to re-install ruby, I just wanted to add openssl to an existing installation.&lt;br /&gt;(4) cd extracted dir/ext/openssl&lt;br /&gt;(5) ruby extconf.rb&lt;br /&gt;(6) make&lt;br /&gt;(7) sudo make install&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;Onwards/Upwards&lt;/span&gt;&lt;br /&gt;I re-ran sudo passenger-install-apache-2-module, which completed successfully.  As instructed, I pasted the following into /etc/httpd/conf.d/passenger.conf, because all files in conf.d are loaded by httpd.conf:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-style: italic;"&gt;LoadModule passenger_module /evri/ruby/lib/ruby/gems/1.8/gems/passenger-2.0.6/ext/apache2/mod_passenger.so&lt;/span&gt; &lt;span style="font-style: italic;"&gt;PassengerRoot /evri/ruby/lib/ruby/gems/1.8/gems/passenger-2.0.6&lt;/span&gt; &lt;span style="font-style: italic;"&gt;PassengerRuby /evri/ruby/bin/ruby&lt;/span&gt; &lt;/span&gt;&lt;br /&gt;I then put in the server directives to map my public directory as well as a specified VirtualHost to my app.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;"&gt;&lt;directory var="" www="" rails="" dashboard="" current="" public=""&gt;&lt;br /&gt;Options FollowSymLinks&lt;br /&gt;AllowOverride none&lt;br /&gt;Order allow,deny&lt;br /&gt;Allow from all&lt;br /&gt;&lt;/directory&gt;&lt;br /&gt;&lt;br /&gt;&lt;directory var="" www="" rails="" dashboard="" current="" public=""&gt;&lt;/directory&gt;&lt;virtualhost&gt;&lt;br /&gt;ServerAdmin webmaster@localhost&lt;br /&gt;ServerName xen-5&lt;br /&gt;ServerAlias xen-5.local xen-5.evri.corp&lt;br /&gt;DocumentRoot /var/www/rails/dashboard/current/public&lt;br /&gt;ErrorLog /var/www/rails/dashboard/current/log/server.log&lt;br /&gt;&lt;br /&gt;# Possible values include: debug, info, notice, warn, error, crit,&lt;br /&gt;# alert, emerg.&lt;br /&gt;LogLevel debug&lt;br /&gt;&lt;br /&gt;&lt;/virtualhost&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;virtualhost&gt;Phusion Passenger aka mod_rails was configured.&lt;br /&gt;&lt;br /&gt;Next: &lt;a href="http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to_12.html"&gt;Database Migration&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/virtualhost&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4609375956715125321?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4609375956715125321/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4609375956715125321'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4609375956715125321'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/03/migrating-ror-app-from-ubuntu-feisty-to.html' title='Migrating an RoR app from Ubuntu Feisty  to CentOS 5.2 Part 1: Setting up mod_rails'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4569960014151645571</id><published>2009-02-14T21:23:00.000-08:00</published><updated>2009-03-27T06:35:03.109-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Flot JQuery javascript graphing'/><title type='text'>An Open Letter to the Creators of Flot</title><content type='html'>Dear creator(s) of Flot, who I only know from this blurb on your &lt;a href="http://code.google.com/p/flot/"&gt;google code page&lt;/a&gt;:&lt;br /&gt;&lt;p&gt;&lt;a name="Who's_behind_this?"&gt;The development so far has mostly been done by Ole Laursen, sponsored by &lt;/a&gt;&lt;a href="http://www.iola.dk/" rel="nofollow"&gt;IOLA&lt;/a&gt;, a small Danish web-development house focusing on Django and jQuery.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;I want to thank you for making such a kick ass graphing tool. First of all, you have cleanly separated data from code: I really think that isolating the actual graphing call to this method:&lt;/p&gt;&lt;p&gt;$.plot(...)&lt;br /&gt;&lt;/p&gt;&lt;p&gt;really allowed me to focus on what I should be focusing on when using a graphing tool:  the data  I wanted to present, instead of the mechanism I was going to use to present it.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Second, you have eliminated the need for me to write some kind of JS wrapper around &lt;a href="http://code.google.com/apis/chart/"&gt;Google Chart API&lt;/a&gt;. I cant tell you how much it means to me, a deeply lazy person, to not have to do work.  Especially work that seemed peripheral to my main focus.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Third, thank you for basing your engine on &lt;a href="http://docs.jquery.com/"&gt;JQuery&lt;/a&gt;. As a lazy person, I love the way I can do transformational operations on &lt;a class="zem_slink" href="http://en.wikipedia.org/wiki/Document_Object_Model" title="Document Object Model" rel="wikipedia"&gt;DOM&lt;/a&gt; elements by chaining function calls. I'm hardly a JS guru, but JQuery makes me feel like a badass mofo with very little effort on my part. Flot does the same.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Fourth, I really like the way you've made it possible to &lt;a href="http://people.iola.dk/olau/flot/examples/interacting.html"&gt;interact with the data&lt;/a&gt;, and &lt;a href="http://people.iola.dk/olau/flot/examples/selection.html"&gt;zoom in&lt;/a&gt;. For better or for worse, I've decided to write my own monitoring system, and this is the kind of functionality that gave me migraines, because it is essential to what I'm trying to do, but very time consuming to get right.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;begin&gt; I decided to write my own monitoring system  because the current monitoring system we had in place made the silly assumption that software doesn't migrate. Actually this assumption isn't all that silly, we just happen to move services around from machine to machine, and any system that assumed that services were somewhat statically located (again, a valid assumption in most cases) was S.O.L in our world. Also, by being smart about when to send data across, a custom monitoring system ends up scaling much better than traditional SNMP based approaches. &lt;end&gt;&lt;/end&gt;&lt;/begin&gt;&lt;/p&gt;&lt;p&gt; Ahem. Anyways, as the sole implementor of this system, I've been fielding a lot of complaints from both the business team and the devs about the quality of the graphs, the lack of preciseness, the lack of interactivity. Being able to receive events on mouseovers and clicks lets me, as &lt;a class="zem_slink" href="http://www.frampton.com/" title="Peter Frampton" rel="homepage"&gt;Peter Frampton&lt;/a&gt; would say, "come alive". It's always nice to be able to exceed expectations.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Finally, Ole and other makers of Flot, thanks for making me look like a UI wizard, even though I'm purely a server side guy. The fact that I was able to change out graphing engines in a couple of hours says a lot about how easy Flot is to use. You guys made a total pain in the ass operation a lot of fun. Flot is a great tool, and I look forward to using it more in my app.&lt;/p&gt;&lt;p&gt;Flot resources:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/flot/"&gt;Home page&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://flot.googlecode.com/svn/trunk/API.txt"&gt;API&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://people.iola.dk/olau/flot/examples/"&gt;Examples&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;/p&gt;       &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/ff8fe599-4dcd-4618-98ce-f1ecc0294ee8/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=ff8fe599-4dcd-4618-98ce-f1ecc0294ee8" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4569960014151645571?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4569960014151645571/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/02/open-letter-to-creators-of-flot.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4569960014151645571'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4569960014151645571'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/02/open-letter-to-creators-of-flot.html' title='An Open Letter to the Creators of Flot'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-8071925041165131696</id><published>2009-01-30T21:11:00.000-08:00</published><updated>2009-03-27T06:35:32.453-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='training running'/><title type='text'>Marathon, Man!</title><content type='html'>Two weeks ago I turned 40. Usually, I get depressed around my birthdays, but I'm 40, and my knees and back actually feel pretty decent. I'm running 12 miles on my long runs, and while I'm not moving too fast, I am still moving.&lt;br /&gt;&lt;br /&gt;So, I'm thinking its January, I'm already doing 12 miles, and it feels decent. Maybe its time to step up and see if  I can actually do a Marathon...like in May or June.  What better way to celebrate turning 40?&lt;br /&gt;&lt;br /&gt;I'm going to sleep on it. Actually I'm going to see how the next couple of runs go. I'm stretching to 13, 14 miles over the next couple of weeks, and if those feel decent, I'll start to work in some speed.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-8071925041165131696?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/8071925041165131696/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2009/01/marathon-man.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8071925041165131696'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8071925041165131696'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2009/01/marathon-man.html' title='Marathon, Man!'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-8634710320920993589</id><published>2008-12-30T10:15:00.000-08:00</published><updated>2009-03-27T06:36:06.620-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='rails activeresource'/><title type='text'>Bulk Resource Uploads via ActiveResource</title><content type='html'>&lt;span style="font-weight: bold;font-size:130%;"&gt;Background:&lt;/span&gt;&lt;br /&gt;I recently had to reduce the across the wire trips for the monitoring app I had hastily thrown together because the amount of time spent making trips serializing and deserializing individual resources was beginning to affect monitoring performance. The  &lt;a href="http://www.infoq.com/articles/pritchett-latency"&gt;Second Fallacy of Distributed Computing&lt;/a&gt; was beginning to rear it's &lt;a href="http://www.dancewithshadows.com/nuts/2008/09/28/putin-rears-his-head-over-alaska-airspace-flash-game/"&gt;ugly Putinesque head&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I knew that this was coming, but &lt;a href="http://en.wikipedia.org/wiki/Optimization_%28computer_science%29" title="Optimization (computer science)" rel="wikipedia" class="zem_slink"&gt;premature optimization&lt;/a&gt; has never worked out for me, so I went with the default &lt;a href="http://api.rubyonrails.org/classes/ActiveResource/Base.html"&gt;ActiveResource&lt;/a&gt; approach -- everything is a resource, and a CRUD operation on a resource maps to the corresponding http 'verb' -- until smoke started pouring out of my servers.&lt;br /&gt;&lt;br /&gt;My basic requirements:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create a web service that can store data for hundreds of individual datapoints at 5 minute intervals.&lt;/li&gt;&lt;li&gt;Those datapoints can come and go.&lt;/li&gt;&lt;li&gt;The implementor of the statistics gathering code really doesn't need to know the by the wire details of how their data is getting to my web service.&lt;/li&gt;&lt;/ol&gt;Implied in these requirements is the need for efficiency:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I shouldn't have to perform individual CRUD ops on each statistic every 5 minutes.&lt;/li&gt;&lt;li&gt;I shouldn't have to make an over the wire request for data every time I want to read that data.&lt;/li&gt;&lt;/ul&gt;From those implications I arrived at the following distilled technical requirements:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I need to bulk upload statistics, and create/update them in one transaction in order to reduce the need for individual CRUD ops. At this point I'm going to choose to '&lt;a href="http://en.wikipedia.org/wiki/Fail-fast" title="Fail-fast" rel="wikipedia" class="zem_slink"&gt;fail fast&lt;/a&gt;', aborting if a single create/update fails, so that I know if something is wrong.&lt;/li&gt;&lt;li&gt;I need to keep a client side cache of those statistics around, only updating them when they've changed &lt;span style="font-style: italic;"&gt;(important aside: because this is a monitoring application, it is assumed that each statistic belongs to a single client, so there is no need for out of band updates)&lt;/span&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span style="font-weight: bold;font-size:130%;"&gt;The Juicy Bits&lt;/span&gt;&lt;br /&gt;I'd love to go into a long digression about how I explored every which way to do this, but I'll summarize by saying that my final solution had the following advantages:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Uses the existing ActiveResource custom method infrastructure&lt;/li&gt;&lt;li&gt;No custom routes need to be defined&lt;/li&gt;&lt;li&gt;Complexity hidden from the user, restricted to client side upload_statistics call and server side POST handler method.&lt;/li&gt;&lt;li&gt;The priesthood of High REST will not need to crucify me at the side of the road.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;font-size:100%;"&gt;ActiveResource extension:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I needed to extend default ActiveResource. By default, AR is not aware of data model relationships. For example, invoking the &lt;span style="font-family:courier new;"&gt;to_xml&lt;/span&gt; method on an AR class only shows it's attributes, even if you specify other classes to include, like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-family:courier new;font-size:85%;"&gt;ARDerivedClass.to_xml(:include=&amp;gt;[childClass])&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This limitation makes being smart about bulk updates pretty hard. I needed to introduce the notion of a client side cache, initialized and synchronized as needed.&lt;br /&gt;&lt;br /&gt;My data model looks roughly like this:&lt;br /&gt;&lt;br /&gt;Monitor=&amp;gt;has many=&amp;gt;Statistics&lt;br /&gt;&lt;br /&gt;The default AR implementation of this looks like&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;class Statistic &lt;&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;I've extended as follows:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;implemented an add_statistic method to Monitor that  caches Statistic objects locally&lt;/li&gt;&lt;li&gt;Added an upload_statistics method to the Monitor that serializes the client local statistics and then sends them to the server.&lt;/li&gt;&lt;li&gt;modified the default POST handler for Statistic objects to handle bulk creates/updates.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;initially loaded the statistics cache on the client side.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;lazy synced the cache to the server side, updating on find and delete requests.  &lt;/li&gt;&lt;/ul&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;span style="font-weight: bold;"&gt;Client and Server code by Operation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I want to point out a couple of things in this code:&lt;br /&gt;&lt;br /&gt;(1) Cache loading is done in Monitor.initialize(). That way it gets called whether the client is retrieving or creating a Monitor.&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-weight: bold;"&gt;  &lt;/span&gt;&lt;span style="font-style: italic;"&gt;def initialize(attributes = {}, logger = Logger.new(STDOUT))&lt;br /&gt;if(@@logger == nil)&lt;br /&gt; @@logger = logger&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;@statistics = {}&lt;br /&gt;&lt;br /&gt;if(attributes["statistics"] != nil)&lt;br /&gt; &lt;span style="color: rgb(255, 0, 0);"&gt;attributes["statistics"].each do | single_stat_attributes|&lt;/span&gt;&lt;br /&gt;   @@logger.debug("loading #{single_stat_attributes["name"]}")&lt;br /&gt;   &lt;span style="color: rgb(255, 0, 0);"&gt;@statistics[single_stat_attributes["name"]] = Statistic.new(single_stat_attributes)&lt;/span&gt;&lt;br /&gt; end&lt;br /&gt;&lt;br /&gt;end&lt;br /&gt;super(attributes)&lt;br /&gt;end&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;This required the following modification on the Monitor controller (server) side:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;def index&lt;br /&gt;&lt;br /&gt;if(params[:name] == nil)&lt;br /&gt;@monitor_instances = Monitor.find(:all)&lt;br /&gt;else&lt;br /&gt;@monitor_instances = Monitor.find_all_by_name(params[:name])&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;respond_to do |format|&lt;br /&gt;format.html  #index.html.erb&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);font-size:100%;"&gt;format.xml  { render :xml =&amp;gt; @monitors.to_xml(:include=&amp;gt;[:statistics]) }&lt;/span&gt;&lt;span style="color: rgb(255, 0, 0);font-size:100%;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="color: rgb(255, 0, 0);font-size:100%;"&gt;           format.json  { render :json =&amp;gt; @monitors.to_json(:include=&amp;gt;[:statistics])}&lt;/span&gt;&lt;br /&gt;&lt;span style="color: rgb(0, 0, 0);"&gt;      &lt;/span&gt;end&lt;br /&gt;end&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;I needed to make sure that returned monitor instances included child statistics in order to load the client side cache.&lt;br /&gt;(2) get_statistic and delete_statistic synchronize with the server side.&lt;br /&gt;(3) I've added a new upload_statistics method. I wanted to override save, but what I found at runtime is that the ActiveResource.save method calls update, which loads statistics as attributes. This wont work for us because some of those attributes may not exist on the server side, so an 'update' operation is invalid. In upload_statistics, a &lt;a href="http://api.rubyonrails.org/classes/ActiveResource/CustomMethods.html#M001225"&gt;custom AR method&lt;/a&gt;  posts the client side cache of statistics to the StatisticsController  on the server side:&lt;br /&gt;&lt;br /&gt;&lt;code style="font-style: italic;"&gt;&lt;span&gt;def upload_statistics&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;if(@statistics.length &amp;gt; 0)&lt;br /&gt;data = @statistics.to_xml&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;self.post(:statistics,{:bulk=&amp;gt;"true"},data)&lt;/span&gt;&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;end&lt;/span&gt;&lt;/code&gt;&lt;br /&gt;&lt;br /&gt;Note that the first parameter is the method name, the second is the param options, and the third is the actual post data (that contains the serialized client side map of the statistics. The actual path that this POST gets sent to is /monitor_instances/:id/statistics.xml&lt;br /&gt;&lt;br /&gt;In the server, I do not have to add/create any new routes, but I do need to make sure that the default POST handler checks for the bulk parameter and handles accordingly.&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;# POST /statistics&lt;br /&gt;# POST /statistics.xml&lt;br /&gt;def create&lt;br /&gt;&lt;br /&gt;if(params[:bulk] == nil)&lt;br /&gt;# handle a single update&lt;br /&gt;else&lt;br /&gt;#handle a bulk update&lt;br /&gt;end&lt;br /&gt;end&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;&lt;span style="font-weight: bold;"&gt;Marshalling and Saving stats on the Client side.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;In the StatisticsController,create handler, I need to unmarshall the xml into statistics. There are &lt;a href="http://www.xcombinator.com/2008/08/11/activerecord-from_xml-and-from_json-part-2/"&gt;these instructions&lt;/a&gt; to extend ActiveRecord via the standard lib/extensions.rb mechanism, but they won't work for me because I'm serializing a hash, not  an array of Statistic objects. So I need to deserialize and create/update objects by 'hand', which actually isn't that hard:&lt;br /&gt;&lt;code&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;cmd = request.raw_post&lt;br /&gt;monitor_instance = MonitorInstance.find(params[:monitor_instance_id])&lt;br /&gt;logger.debug(cmd)&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;hash =  Hash.from_xml(cmd)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(255, 0, 0);"&gt;       hash["hash"].values&lt;/span&gt;.each do | options |&lt;br /&gt;stat = Statistic.find(:first,&lt;br /&gt;:conditions=&amp;gt;["monitor_instance_id = #{params[:monitor_instance_id]} and name = '#{options["name"]}'"])&lt;br /&gt;&lt;br /&gt;if(stat == nil)&lt;br /&gt; #create a new Statistic object&lt;br /&gt;else&lt;br /&gt; # update existing statistic object&lt;br /&gt;end&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;respond_to do |format|&lt;br /&gt;statistics = Statistic.find(:all,&lt;br /&gt;:conditions=&amp;gt;["monitor_instance_id = #{params[:monitor_instance_id]}"])&lt;br /&gt;format.xml  { render :xml =&amp;gt; statistics.to_xml, :status =&amp;gt; :created, :location =&amp;gt; monitor_instance_path(@monitor_instance) }&lt;br /&gt;end&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;/code&gt;In the code above, I deserialize the xml payload using Hash.from_xml, which creates a hash around the hash encoded in the xml data.&lt;br /&gt;&lt;br /&gt;To get to the original hash of statistics options, I had to extract them from the encoded hash:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-size:85%;"&gt;hash = Hash.from_xml(cmd)&lt;br /&gt;hash["hash"].values.each do | options |&lt;br /&gt;# create / update the stat that corresponds to options["name"] under the monitor&lt;br /&gt;end&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;code style="color: rgb(0, 0, 0);"&gt;&lt;span style="color: rgb(0, 0, 0);font-size:100%;"&gt;&lt;/span&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/code&gt;&lt;span style="font-weight: bold;"&gt;Summary&lt;/span&gt;&lt;br /&gt;This took a lot longer than expected, because I ran into issues with trying to use standard methods, i.e. save, that I still don't understand. However, I know a lot more about AR and how to extend it to do more intelligent sub resource handling.&lt;br /&gt;&lt;br /&gt;          &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/120f47df-199b-4537-9d6c-5742ee0481e4/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=120f47df-199b-4537-9d6c-5742ee0481e4" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;span class="zem-script more-related"&gt;&lt;script type="text/javascript" src="http://static.zemanta.com/readside/loader.js" defer="defer"&gt;&lt;/script&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-8634710320920993589?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/8634710320920993589/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/bulk-restful-uploads-via-activeresource.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8634710320920993589'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8634710320920993589'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/bulk-restful-uploads-via-activeresource.html' title='Bulk Resource Uploads via ActiveResource'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-242473542190710983</id><published>2008-12-21T08:48:00.001-08:00</published><updated>2008-12-21T08:53:05.360-08:00</updated><title type='text'>Best advertainment webisode ever.</title><content type='html'>This one made me laugh so hard I pulled something.&lt;br /&gt;&lt;br /&gt;http://bewareofthedoghouse.com/VideoPage.aspx&lt;br /&gt;&lt;br /&gt;If this is the future of ads, I'm hooked!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-242473542190710983?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/242473542190710983/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/best-xmas-video-ever.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/242473542190710983'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/242473542190710983'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/best-xmas-video-ever.html' title='Best advertainment webisode ever.'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1130385562395704948</id><published>2008-12-18T20:04:00.000-08:00</published><updated>2008-12-18T21:12:41.193-08:00</updated><title type='text'>Sinatra, my new favorite prototype playground</title><content type='html'>About a week ago I was trying to get something ready for the &lt;a href="http://blog.evri.com/index.php/2008/12/17/evri-holiday-hackathon-yields-twitter-widget/"&gt;first annual Evri Hack-a-thon&lt;/a&gt;, a concentrated 2 day affair where we focused on putting together cool apps with the new &lt;a href="http://www.evri.com/developer/rest/index.html"&gt;Evri API&lt;/a&gt;&lt;a href="http://www.evri.com/developer/rest/index.html"&gt;. &lt;/a&gt;The event was a blast, I for one rediscovered how &lt;a href="http://twitter.com/evri/statuses/1054469646"&gt;fun&lt;/a&gt; writing code for code's sake really is.&lt;br /&gt;&lt;br /&gt;I was implementing a 'music browser' mostly in javascript, and needed a proxy server to make calls out to those services that didn't have &lt;a href="http://ajaxian.com/archives/jsonp-json-with-padding"&gt;JSONP&lt;/a&gt; support.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;slight digression here. JSONP is the coolest thing since sliced bread. I say that as someone who loves bread, even more so when it is sliced. The ability to retrieve data w/o a backend is so powerful I _almost_ understand why it's been seen as a &lt;a href="http://unclehulka.com/ryan/blog/archives/2005/12/12/jsonpyoure-joking-right/"&gt;Terrible, Horrible, No Good Hack&lt;/a&gt;. But not really, because it makes life as a developer so much easier.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;I wanted to spend most of my time in the JavaScript, not futzing with the backend server. Because I've been mostly coding in Ruby for the last year, that ruled out rolling up a quick Java Servlet -- I didn't want to spend any time installing Tomcat/Jetty and associated jars, and having to remember how that world worked. I also didn't want to write a Rails app -- seemed ridiculous when  I didn't have a data model.&lt;br /&gt;&lt;br /&gt;I looked around at a couple of lightweight Ruby Frameworks, like &lt;a href="http://code.whytheluckystiff.net/camping/"&gt;Camping&lt;/a&gt; and &lt;a href="http://merbivore.com/"&gt;Merb&lt;/a&gt;. Camping would have required me to down version to 1.8.5, and Merb overwhelmed me with the volume of configuration choices. In other words,my ideal proxy server had to be stone cold simple because I simply didn't have the time for anything else.&lt;br /&gt;&lt;br /&gt;Enter &lt;a href="http://sinatra.rubyforge.org/"&gt;Sinatra&lt;/a&gt;. Elegant, concise, and witty, just like it's &lt;a href="http://www.evri.com/person/frank-sinatra-0x2018d.html"&gt;namesake&lt;/a&gt;. Here is how you configure a path to &lt;span style="font-weight: bold; font-style: italic;"&gt;/json/getjswidgets &lt;/span&gt;in Sinatra:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;get "/json/getjswidgets" do&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  cb = params[:callback]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  href = params[:href]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  ...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;end&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;A couple of things to note in the example above:&lt;br /&gt;(1) params are retrieved with the params hash, just like in Rails. So this method was actually called as:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;/json.getjswidgets?callback={temp callback name}&amp;amp;href={some value}&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;(2) all paths are handled with the same 'get...do...end' syntax. It's that simple.&lt;br /&gt;&lt;br /&gt;Another example:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;get "/json/artists/:name/album" do&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  cb = params[:callback]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  name = params[:name]&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;  ....&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;end&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Note that the name parameter is embedded in the path, just like you do in the standard routes.rb file in rails.&lt;br /&gt;&lt;br /&gt;Once you get past the path routing (which takes about as long as it does to read this sentence), Sinatra continues to be blissfully easy by allowing you to render the view via &lt;a href="http://www.ruby-doc.org/stdlib/libdoc/erb/rdoc/classes/ERB.html"&gt;erb&lt;/a&gt;, &lt;a href="http://builder.rubyforge.org/"&gt;builder&lt;/a&gt;, &lt;a href="http://haml.hamptoncatlin.com/"&gt;haml&lt;/a&gt;,  and &lt;a href="http://haml.hamptoncatlin.com/docs/rdoc/classes/Sass.html"&gt;sass&lt;/a&gt;. You can render the view inline, or modularize it by putting the files in a &lt;span style="font-family:courier new;"&gt;view&lt;/span&gt; directory.&lt;br /&gt;&lt;br /&gt;Helper methods are defined in a helpers block:&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;helpers do&lt;br /&gt;def helper_method&lt;br /&gt;   ...&lt;br /&gt;end&lt;br /&gt;&lt;br /&gt;...&lt;br /&gt;end&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;Static assets are kept in a public directory -- again, Sinatra takes a "if it ain't broke..."  approach that really minimizes the learning curve. Normally, I loathe the whole &lt;a href="http://www.imdb.com/title/tt0088258/quotes"&gt;"But Ours Go To Eleven!"&lt;/a&gt; mindset that I see in frameworks because it means that I have to once again learn another unique set of concepts to get anything done. Sinatra does the exact opposite in leveraging a well known, well used, well understood set of conventions/concepts from Rails while stripping the concept of a framework down to that which is as &lt;a href="http://www.quotedb.com/quotes/1360"&gt;simple as possible, but not simpler&lt;/a&gt;. Sinatra, you're my new &lt;a href="http://www.urbandictionary.com/define.php?term=bff"&gt;BFF&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1130385562395704948?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1130385562395704948/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/sinatra-my-new-favorite-prototype.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1130385562395704948'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1130385562395704948'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/sinatra-my-new-favorite-prototype.html' title='Sinatra, my new favorite prototype playground'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-6106746121886092493</id><published>2008-12-07T07:16:00.000-08:00</published><updated>2008-12-07T23:17:40.165-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Commuting'/><category scheme='http://www.blogger.com/atom/ns#' term='Singlepspeed'/><category scheme='http://www.blogger.com/atom/ns#' term='Cycling'/><title type='text'>Converting to a Single Speed</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_NqxvfgwIOvA/STv7W3khBLI/AAAAAAAACSE/2iJyMBRpNgg/s1600-h/DSCN6880.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 150px;" src="http://2.bp.blogspot.com/_NqxvfgwIOvA/STv7W3khBLI/AAAAAAAACSE/2iJyMBRpNgg/s200/DSCN6880.jpg" alt="" id="BLOGGER_PHOTO_ID_5277087758687470770" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Why Single Speed? A combination of 'luck' and timing have led me to re-rig my commuter bike as a single speed. The 'luck' part was a pulley on my circa 1995 rear derailleur exploding. The timing part is the rise of single speeds in general. I've been noticing the rise in single speeds in the last couple of years as a bike commuter.  They look so....simple and maintenance free!&lt;br /&gt;&lt;br /&gt;I'm riding a 14 year old &lt;a href="http://www.mtbr.com/cat/older-categories-bikes/bike/kona/kula/PRD_348991_91crx.aspx"&gt;Kona Kula&lt;/a&gt;, once my singletrack steed, now my urban commuting stalwart. The key thing about converting a bike with vertical dropouts into a single speed is that you can't slide the rear wheel back and forth to get the perfect chain tension. You need a chain tensioner. A chain tensioner is like a derraileur-lite that pulls the chain taut. There are several brands out there, all of which make a simple, bullet proof device.&lt;br /&gt;&lt;br /&gt;The other key thing about converting a standard bike into a singlespeed is what to do with your rear cluster. There are a number of freehub to singlespeed conversion kits out there that provide spacers and cogs to replace your freewheel.&lt;br /&gt;&lt;br /&gt;I ended up choosing the &lt;a href="http://www.performancebike.com/shop/profile.cfm?SKU=23062&amp;amp;subcategory_ID=5132"&gt;Forte Singlespeed conversion kit&lt;/a&gt;, made by the Performance Bicycle house brand. This was the only brand I found out there that offered the freehub spacers and cogs, as well as the chain tensioner, for by far the cheapest price -- for $25 I got everything, including 3 cogs to experiment with. Compare that to the &lt;a href="http://www.surlybikes.com/" title="Surly Bikes" rel="homepage" class="zem_slink"&gt;Surly&lt;/a&gt; solution, which was going to cost $50 for the chain tensioner, and $30 for the spacers, and $10+ for the cog.&lt;br /&gt;&lt;br /&gt;It also had what I considered to be a key feature: it allowed me to adjust the horizontal placement of the tensioner. This was important because I had no idea where I would be  placing the cog to line up with the chainring.&lt;br /&gt;&lt;br /&gt;I also wanted to try using an original cog and chainring, since I had replaced them a year ago and they weren't completely beat down yet. The preferred way to go is to do a clean replacement, but that would require a new chain and front chainring, and I wasn't sure that I could find a replacement front chainring without a special order.&lt;br /&gt;&lt;br /&gt;Installation was easy, and gave me a chance to clean my bike for the first time in 6 years!&lt;br /&gt;&lt;br /&gt;Tools required:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;a &lt;a href="http://www.parktool.com/products/category.asp?cat=4"&gt;cassette removal tool&lt;/a&gt; and &lt;a href="http://www.parktool.com/products/detail.asp?cat=4&amp;amp;item=SR-1"&gt;chainwhip&lt;/a&gt; for freewheel removal.&lt;/li&gt;&lt;li&gt;an allen wrench for the usual.&lt;/li&gt;&lt;li&gt;a &lt;a href="http://www.parktool.com/products/detail.asp?cat=26&amp;amp;item=CCP-4"&gt;crank puller&lt;/a&gt; to remove the inner chainring on the triple.&lt;/li&gt;&lt;li&gt;a &lt;a href="http://en.wikipedia.org/wiki/Chain_tool" title="Chain tool" rel="wikipedia" class="zem_slink"&gt;chain tool&lt;/a&gt; to break and resize the chain.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Step 1: remove the chainrings. The only way to get the inner chainring out is to &lt;a href="http://www.parktool.com/repair/readhowto.asp?id=103"&gt;remove the crank&lt;/a&gt; from the bike. The optimal position for the new chainring is in the middle position of the triple crank. But the chainring I wanted was 44 tooth and too big to use in the middle position -- it rubbed the chainstay -- so I had to keep it in the outer position:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_NqxvfgwIOvA/STwEetGxuZI/AAAAAAAACSc/sLSZlF9IYdc/s1600-h/DSCN6881.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 150px;" src="http://2.bp.blogspot.com/_NqxvfgwIOvA/STwEetGxuZI/AAAAAAAACSc/sLSZlF9IYdc/s200/DSCN6881.jpg" alt="" id="BLOGGER_PHOTO_ID_5277097788921985426" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This is where the horizontal adjustablilty of the Forte chain tensioner became really useful. It let me slide the tensioner cog over to the outside with an allen wrench.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_NqxvfgwIOvA/STwFBSST4qI/AAAAAAAACSs/CjD4bOaJt40/s1600-h/DSCN6886.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_NqxvfgwIOvA/STwFBSST4qI/AAAAAAAACSs/CjD4bOaJt40/s200/DSCN6886.jpg" alt="" id="BLOGGER_PHOTO_ID_5277098383018025634" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Step 2: &lt;a href="http://www.parktool.com/repair/readhowto.asp?id=48"&gt;remove the freewheel&lt;/a&gt; using the chainwhip and the freewheel tool.&lt;br /&gt;Step 3: install the chain tensioner where the real derailleur used to be.&lt;br /&gt;Step 4: position the singlespeed cog -- using spacers to fill up the freehub around that cog -- and the chain tensioner cog so that they are inline with the chainring. This is important. If you don't line things up, the chain will derail. In the picture below, note the spacers around the cog. Because I installed my chainring on the outermost position, I've had to position the cog at the outer end of the freehub (with only one spacer between it and the cassette lockring).&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_NqxvfgwIOvA/STv8I86rmYI/AAAAAAAACSM/z_qu-E0tJEY/s1600-h/DSCN6883.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 150px;" src="http://4.bp.blogspot.com/_NqxvfgwIOvA/STv8I86rmYI/AAAAAAAACSM/z_qu-E0tJEY/s200/DSCN6883.jpg" alt="" id="BLOGGER_PHOTO_ID_5277088619116075394" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Step 5: whip out the chain tool resize the chain so that the chain tensioner is engaged (i.e. it has tension).&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_NqxvfgwIOvA/STv9LUmp6zI/AAAAAAAACSU/8osu2-hVzMA/s1600-h/DSCN6884.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 200px; height: 150px;" src="http://4.bp.blogspot.com/_NqxvfgwIOvA/STv9LUmp6zI/AAAAAAAACSU/8osu2-hVzMA/s200/DSCN6884.jpg" alt="" id="BLOGGER_PHOTO_ID_5277089759345896242" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I used a 16 tooth cog from my old freewheel, and my existing chain. This may not work because that cog was designed to be 'shiftable', and the ramps on the cog body may derail the chain. However, I wanted to give this a try before buying a new chain and front chainring.&lt;br /&gt;&lt;br /&gt;  &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/60270abf-8834-4c22-aebf-0470125ae262/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=60270abf-8834-4c22-aebf-0470125ae262" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-6106746121886092493?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/6106746121886092493/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/converting-to-single-speed.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6106746121886092493'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/6106746121886092493'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/12/converting-to-single-speed.html' title='Converting to a Single Speed'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_NqxvfgwIOvA/STv7W3khBLI/AAAAAAAACSE/2iJyMBRpNgg/s72-c/DSCN6880.jpg' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1839860763867708759</id><published>2008-11-26T13:28:00.000-08:00</published><updated>2008-11-26T22:03:18.203-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='Net::HTTP'/><category scheme='http://www.blogger.com/atom/ns#' term='Basic Authentication'/><title type='text'>Basic Auth over HTTP using Ruby, Net::HTTP</title><content type='html'>I'm writing this one down because it took way too long for me to stumble around it. The &lt;a href="http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html"&gt;Net::HTTP &lt;/a&gt;class provides http transport level page access. Most of the time I use &lt;a href="http://www.ruby-doc.org/core/classes/OpenURI.html"&gt;open-uri&lt;/a&gt;, which treats web pages like files, because that is, as the kids say, one hella fine way to roll.&lt;br /&gt;&lt;br /&gt;Too bad it doesn't work with Basic Auth.&lt;br /&gt;&lt;br /&gt;I've got a service at http://db-import that listens on 8080. It requires valid credentials. I want to get some type data from it and parse it with Hpricot.  Normally I would do this like so:&lt;br /&gt;&lt;br /&gt; &lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;doc = open("http://db-import:8080/rest/entityTypes.xml) { |f| Hpricot(f) }&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;however, the requirement of basic auth makes me do this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;&lt;span style="font-family:courier new;"&gt;   ENTITY_TYPES_REQUEST = "http://db-import:8080/importapi/rest/entityTypes.xml"&lt;br /&gt; .....&lt;br /&gt;   uri = URI.parse(ENTITY_TYPES_REQUEST)&lt;br /&gt;   Net::HTTP.start(uri.host,uri.port) do |http|&lt;br /&gt;   req = Net::HTTP::Get.new(uri.path)&lt;br /&gt;       req.basic_auth user,pass&lt;br /&gt;       response = http.request(req)&lt;br /&gt;       end&lt;br /&gt;&lt;br /&gt;   doc = Hpricot(response.body)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A couple of things to note (that got me):&lt;br /&gt;&lt;ol&gt;&lt;li&gt;I  needed to specify the hostname w/o the transport. Instead of "http://db-import", specify "db-import". Yeah, that's kind of obvious after the fact :). I URI&lt;/li&gt;&lt;li&gt;HTTP.start only opens the connection, the user then makes all requests/process all responses within the connection block. So in the code above I first configure the request object with basic auth and then use it to make the request.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Not terribly hard, but I do tend to trip up on details and wanted to spare some pain the next time around.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1839860763867708759?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1839860763867708759/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/basic-auth-over-http-using-ruby-nethttp.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1839860763867708759'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1839860763867708759'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/basic-auth-over-http-using-ruby-nethttp.html' title='Basic Auth over HTTP using Ruby, Net::HTTP'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-8825440887981440820</id><published>2008-11-24T21:27:00.000-08:00</published><updated>2008-11-26T22:04:02.653-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='data'/><category scheme='http://www.blogger.com/atom/ns#' term='training'/><category scheme='http://www.blogger.com/atom/ns#' term='running'/><title type='text'>I dont like running (data) naked</title><content type='html'>Yesterday morning, at 6:45, I was on semi autopilot, stepping out the door for my morning run.  I grabbed my trusty &lt;a href="http://www.garmin.com/" title="Garmin" rel="homepage" class="zem_slink"&gt;Garmin&lt;/a&gt; 305, walked out the door, and hit the on button. And waited. And tried again. I figured my gloves were a little too thick, so I took one off and then pressed again. And pressed harder. Nothing.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://buy.garmin.com/shop/store/assets/images/products/010-00467-00/en/cf-md.jpg"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 150px; height: 150px;" src="https://buy.garmin.com/shop/store/assets/images/products/010-00467-00/en/cf-md.jpg" alt="" border="0" /&gt;&lt;/a&gt;After almost 2 years of pretty much day in day out use, in rain, wind, and snow, through heat and cold, thick and thin, my little friend had left the building. Operating on pure reflex, I plugged it back into the recharging cradle and went back outside, bereft.&lt;br /&gt;&lt;br /&gt;This was a truly sad moment for me. There I was, in the dark and cold, trying to get excited about going running without second by second updates on heart rate, pace, altitude, and distance covered. At that moment I realized that I was being ridiculous, even diva-like.  I mean, wasn't running for running's sake not enough? Would I have even had this mental conversation  2 years ago?&lt;br /&gt;&lt;br /&gt;Well....no. Not at 0-dark-45 in the morning. When I'd rather be in bed, warm and comfortable, dozing in and out of consciousness. Instead, I'm standing in a slight drizzle with my headlight strapped on, bundled up from head to toe  in waterproof yet breathable and oh-so-reflective winter running gear. It would be different if, say, I was running on the beach at &lt;a href="http://maps.google.com/maps?ll=22.0833333333,-159.5&amp;amp;spn=0.1,0.1&amp;amp;q=22.0833333333,-159.5%20%28Kauai%29&amp;amp;t=h" title="Kauai" rel="geolocation" class="zem_slink"&gt;Kauai&lt;/a&gt;, wearing shorts and a t-shirt. I dont think I would need motivation coming from my wrist-top computer.&lt;br /&gt;&lt;br /&gt;Then again, maybe I would. I mean the coolest thing about the GPS/HRM is that it tells a story, of where I've been and - literally - what I've done. It tells a story and then persists it, for later recall. When I upload my run to the computer, I get to see how &lt;del&gt;slow&lt;/del&gt; fast I went, the hills on the route, the overall distance, and I get to remember how I felt at specific points in the run. And if I don't remember, my heart rate tells me. It's sort of like a data photo album, where the mix of lat/long, altitude, and heart rate combine to give me a snapshot of how I felt at every point in the run.&lt;br /&gt;&lt;br /&gt;I took off on the run anyway, shamed by my dependence on data, determined to experience 'pure' running without instrumentation. And I actually did. I couldn't refer to my data feed, so I started to pay attention to my form, my breathing, my stride, my forward lean. I knew the mileage of the route  I was running (6.23 to be exact), but didn't know exactly how far I had gone, or how far I had left. And although I knew that I was somewhere between 125 and 145 bpm, I had to pace myself by how I felt at that moment, not how my watch was telling me how I felt.&lt;br /&gt;&lt;br /&gt;So, yeah, I enjoyed it, a little. And I was actually resigned to a month of 'naked' running while I sent my little buddy back to Garmin to be refurbed. It is, after all, the middle of winter, and I'm not training for anything in particular, more doing long runs to justify eating all those XMas sugar cookies.&lt;br /&gt;&lt;br /&gt;I had just convinced myself that this whole zen running thing was good, really good. But when I went down to the garage to pack the HRM up so I could  ship it back to Garmin for refurb I noticed that it was on, telling me that it was fully charged. Slowly, disbelieving, I turned it back on, and watched it search fruitlessly for a satellite connection. "Are you indoors?" it asked me. It seemed a little irritated. I turned it off, put it back in the cradle, and went back upstairs -- all of a sudden tomorrows early morning run is looking a lot more fun.&lt;br /&gt;&lt;br /&gt;  &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/bfb9ad0c-3987-48e0-9b32-6db86c667f17/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=bfb9ad0c-3987-48e0-9b32-6db86c667f17" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-8825440887981440820?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/8825440887981440820/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/i-dont-like-running-data-naked.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8825440887981440820'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/8825440887981440820'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/i-dont-like-running-data-naked.html' title='I dont like running (data) naked'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1468843475194640772</id><published>2008-11-18T11:26:00.001-08:00</published><updated>2008-11-18T11:35:46.075-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='JavaScript'/><category scheme='http://www.blogger.com/atom/ns#' term='console debugging'/><category scheme='http://www.blogger.com/atom/ns#' term='Firebug'/><title type='text'>Notes from the (Javascript) Noob: conditionally enable console debugging</title><content type='html'>Today I ran into a problem when the primary user of my monitoring app wanted to know why graphs werent rendering for him. I checked the site from my machine and all looked good. I checked the site from another devs machine, and again, everything was rendering. At this point I was confused.&lt;br /&gt;&lt;br /&gt;I knew it had to be something in the javascript rendering, so  I had the user install &lt;a href="https://addons.mozilla.org/en-US/firefox/addon/1843"&gt;firebug&lt;/a&gt;. Instead of a &lt;a href="http://en.wikipedia.org/wiki/JavaScript" title="JavaScript" rel="wikipedia" class="zem_slink"&gt;JS&lt;/a&gt; error (or 10), the page loaded fine. Hmmm. I then wanted to see if the requests I was firing from the page to the Google Charts API were actually going through. We tabbed to the FB net tab, which was disabled. When I had him enable that, plus the console, the graphs rendered.&lt;br /&gt;&lt;br /&gt;Doh! I was using console.log to check a value, and forgot that not everyone in the known universe runs with FB enabled. In order to continue to log, I've done this:&lt;br /&gt;&lt;br /&gt;function log(str) {&lt;br /&gt;    var c = window.console; if (c) {&lt;br /&gt;      console.log(str);&lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;I'm kind of surprised that (a) I wasn't getting a 'console object not defined' in the naive install of FB (which evaluated JS, but had console/net logging turned off), and that (b) if console was present as implied by (a), that logging would degrade gracefully. But the code above works, and I'll take that over sheer speculation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; &lt;div style="margin-top: 10px; height: 15px;" class="zemanta-pixie"&gt;&lt;a class="zemanta-pixie-a" href="http://reblog.zemanta.com/zemified/bdc0565d-590a-4c68-89d1-4d408201dc85/" title="Zemified by Zemanta"&gt;&lt;img style="border: medium none ; float: right;" class="zemanta-pixie-img" src="http://img.zemanta.com/reblog_e.png?x-id=bdc0565d-590a-4c68-89d1-4d408201dc85" alt="Reblog this post [with Zemanta]" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1468843475194640772?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1468843475194640772/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/notes-from-javascript-noob.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1468843475194640772'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1468843475194640772'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/notes-from-javascript-noob.html' title='Notes from the (Javascript) Noob: conditionally enable console debugging'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7467571263174479880</id><published>2008-11-09T21:26:00.000-08:00</published><updated>2008-11-26T22:04:36.700-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Leela'/><category scheme='http://www.blogger.com/atom/ns#' term='family'/><category scheme='http://www.blogger.com/atom/ns#' term='Kiran'/><title type='text'>Kiran and Leela and Pork n Beans</title><content type='html'>This morning, hopped up on (whole wheat) pancakes and (lite) syrup, the kids and I rocked out to &lt;a href="http://www.lyricsmode.com/lyrics/w/weezer/pork_and_beans.html"&gt;Weezer&lt;/a&gt;. In this age of &lt;a href="http://en.wikipedia.org/wiki/Rock_Band" title="Rock Band" rel="wikipedia" class="zem_slink"&gt;Rock Band&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/Guitar_Hero" title="Guitar Hero" rel="wikipedia" class="zem_slink"&gt;Guitar Hero&lt;/a&gt;, it might seem lame to jam with tennis rackets, but we're old school. Kiran, Leela, consider yourselves blackmailed :)&lt;br /&gt;&lt;br /&gt;&lt;object width="320" height="266" class="BLOG_video_class" id="BLOG_video-800c4e87d6a8e2f7" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"&gt;&lt;param name="movie" value="http://www.youtube.com/get_player"&gt;&lt;param name="bgcolor" value="#FFFFFF"&gt;&lt;param name="allowfullscreen" value="true"&gt;&lt;param name="flashvars" value="flvurl=http://v17.nonxt4.googlevideo.com/videoplayback?id%3D800c4e87d6a8e2f7%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1331321911%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D3F6633A8AA57B4133BD5479C6075CFDD3A265504.4DF7A8F997D02C9B9C316987CF05CCF507262AEE%26key%3Dck1&amp;amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3D800c4e87d6a8e2f7%26offsetms%3D5000%26itag%3Dw160%26sigh%3DiJ9Xd1NbuRIA0QChwqHHcbWUmRI&amp;amp;autoplay=0&amp;amp;ps=blogger"&gt;&lt;embed src="http://www.youtube.com/get_player" type="application/x-shockwave-flash"width="320" height="266" bgcolor="#FFFFFF"flashvars="flvurl=http://v17.nonxt4.googlevideo.com/videoplayback?id%3D800c4e87d6a8e2f7%26itag%3D5%26app%3Dblogger%26ip%3D0.0.0.0%26ipbits%3D0%26expire%3D1331321911%26sparams%3Did,itag,ip,ipbits,expire%26signature%3D3F6633A8AA57B4133BD5479C6075CFDD3A265504.4DF7A8F997D02C9B9C316987CF05CCF507262AEE%26key%3Dck1&amp;iurl=http://video.google.com/ThumbnailServer2?app%3Dblogger%26contentid%3D800c4e87d6a8e2f7%26offsetms%3D5000%26itag%3Dw160%26sigh%3DiJ9Xd1NbuRIA0QChwqHHcbWUmRI&amp;autoplay=0&amp;ps=blogger"allowFullScreen="true" /&gt;&lt;/object&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-7467571263174479880?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='enclosure' type='video/mp4' href='http://www.blogger.com/video-play.mp4?contentId=800c4e87d6a8e2f7&amp;type=video%2Fmp4' length='0'/><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/7467571263174479880/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/kiran-and-leela-and-pork-n-beans.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7467571263174479880'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/7467571263174479880'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/kiran-and-leela-and-pork-n-beans.html' title='Kiran and Leela and Pork n Beans'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2482505799895680127</id><published>2008-11-07T22:56:00.000-08:00</published><updated>2008-11-10T11:58:03.242-08:00</updated><title type='text'>I think I've got a Soccer Problem</title><content type='html'>When  I was seven,  all I liked to do was read. Read read read. My mom was and is a very wise woman and decided that being a wimpy, nerdy bookworm was the fast track to many beatdowns, and signed me up for &lt;a href="http://soccer.org/home.aspx"&gt;AYSO&lt;/a&gt; soccer.&lt;br /&gt;&lt;br /&gt;I hated the first season, didn't really understand what the hell was going on, and wanted to quit. I'm not sure why I didn't, but by the end of the second season  I really loved the game. I loved the smell of the field, the oranges at halftime, and the feeling of being part of something bigger than just me. I loved playing, touching the ball, and would dribble and shoot on an imaginary goal framed by trees for hours and hours after school.&lt;br /&gt;&lt;br /&gt;Note that love doesn't imply ability. I'm not overly coordinated, and that, coupled with a serious vision problem (brought on by all that reading), and my reluctance to wear glasses on the field, washed me out of soccer by high school. I really missed playing and got back into it when I turned 30.&lt;br /&gt;&lt;br /&gt;People that play soccer when they're older tend to fall into two camps. There are the ex college/high school studs/studettes, who have amazing touch and vision and ability. They know exactly where they are, where everyone else is, and what is going to happen next.  Then there are the rest of us, hacks who occasionally get a good touch or light up a good run and feel that all too brief moment of being connected to the worlds most amazing game.&lt;br /&gt;&lt;br /&gt;I'm a spaz, occasionally doing something nice, sometimes having great games, sometimes having terrible games, most of the time having randomly great and terrible moments in the same game.  My only real gifts are speed and endurance, both of which are slowly disappearing as I get older. I can pass OK, and have decent field vision at times, but my first touch is more accidental than deliberate,  I have no air game, and I have a pathetically wimpy shot.&lt;br /&gt;&lt;br /&gt;I've been on the same team for about six years. It's a great group of men and women, most of whom are much better than I am, and very patient.  One thing I've noticed over the years is that we've started to focus less on the actual games and more on the beers after the game. Its just as fun to give each other crap after the game as it is to play. Sometimes more fun.&lt;br /&gt;&lt;br /&gt;Every season I swear it will be my last. In tonights game I was trying to move the ball across the field with a defender at my hip. I tried to reverse on him when all of a sudden I found myself flat on the ground with a really bad calf cramp. I made it clear to the ref that the defender had nothing to do with me ending up on the ground, and limped off the field to enjoy the rest of the game as a spectator. I don't know why my body chose that moment to betray me, but it was enough to end my night.&lt;br /&gt;&lt;br /&gt;I'm not sure why I keep coming back. As mentioned above, my speed is no longer keeping me in the game. Seattles dirt fields play havoc on my knees and ankles. Pacific Northwest weather in the late fall/early spring is the opposite of warm and dry.  Guys in their 20s are starting to burn by me, making me feel like a slow old man. And lately, more often than not, I find myself in the middle of the game with no anticipation, consistently a 1/2 second too late to the ball, and (even though I now see  20/15  thanks to lasik), completely tunnel visioned.&lt;br /&gt;&lt;br /&gt;But there are those moments, really brief ones, where occasionally I get a glimpse of what it is like to really &lt;span style="font-style: italic;"&gt;play&lt;/span&gt; the beautiful game. Tonight I got a pass, touched it to my inside,  and moved the ball up the field. I could see everything, and it felt like I had all the time in the world. I drew a defender to me and flicked the ball to an open space right in front of my wing, who touched it once and lofted a beautiful high shot over the goalies outstretched arms. It was textbook, it was beautiful, and for that brief moment I was not a spaz, I was a player.  It's an elusive high that keeps me coming back looking for more.&lt;br /&gt;&lt;br /&gt;I'll take 400mg of ibu and walk off that leg cramp now. It hurts, but I think not playing would hurt worse. Maybe I'll quit next season.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2482505799895680127?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2482505799895680127/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/i-think-ive-got-soccer-problem.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2482505799895680127'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2482505799895680127'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/i-think-ive-got-soccer-problem.html' title='I think I&apos;ve got a Soccer Problem'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-4492368229505642933</id><published>2008-11-03T11:15:00.000-08:00</published><updated>2008-11-03T11:22:13.410-08:00</updated><title type='text'>Crontarded (sigh)</title><content type='html'>Note to self: sometimes mistakes are painful. Sometimes they are funny. Sometimes they are both, and sometimes they are painful, but funny in retrospect. In any case, the best approach is to document it, so that it _never_happens_again. Here is an email I sent earlier today:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;Subject: HI! I'm an idiot!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;Hey Adam, you know when you came up to me and told me db-import was getting pegged every 6 hours? And Gil, you know when you were asking me one day around noon why you were handling requests every 1 minute? &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;well, in the crontab I was running a job like this:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;* */6 * * *&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;which is really a great way of saying: every 6 hours, run this task every 1 minute. You see, I knew that, I just didn't _know_ that. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;I've amended the offending crontab entry to :&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;00 */6 * * *&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;so that the job can run once and only once, every 6 hours, like God intended. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;So, I'm sorry. You both can join the long list of people that (a) should punch me or (b) should get a free beer from me. The way I'm f*cking up &lt;/span&gt;&lt;span style="font-family: courier new; font-style: italic;" class="Object" id="OBJ_PREFIX_DWT19"&gt;today&lt;/span&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;, that line will shortly be stretching around the block. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-style: italic;"&gt;-- Arun&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Those of you keeping score at home will know that the score now reads:&lt;br /&gt;Compilers and OS's (not including windoze): 12,000&lt;br /&gt;Arun: 0&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-4492368229505642933?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/4492368229505642933/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/crontarded-sigh.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4492368229505642933'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/4492368229505642933'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/11/crontarded-sigh.html' title='Crontarded (sigh)'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-5843836568825906848</id><published>2008-10-30T17:12:00.000-07:00</published><updated>2008-10-30T17:43:43.062-07:00</updated><title type='text'>Logrotate: a tale of two config locations</title><content type='html'>I was trying to make sure my logfiles didnt grow disproportionately large by rotating them via &lt;a href="http://linuxcommand.org/man_pages/logrotate8.html"&gt;logrotate.d.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Logrotate has two entry points:&lt;br /&gt;&lt;br /&gt;/etc/logrotate.d -- this directory contains files that maintain the config settings of all logfiles you want to rotate.&lt;br /&gt;&lt;br /&gt;/etc/logrotate.conf -- this file allows you to specify application specific log rotate settings as well.&lt;br /&gt;&lt;br /&gt;I'm writing this down now because I had forgotten that  I had already configured logrotate for one of my applications and modified the general config file logrotate.conf.  When I tried to simulate log rotation by running with the -d parameter:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new; font-weight: bold; font-style: italic;"&gt;logrotate -d /etc/logrotate.conf&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I recieved a 'duplicate entry' error, which led me to (re)discover the application config files in /etc/logrotate.d.&lt;br /&gt;&lt;br /&gt;In general, I think it's a much better idea to do application level logrotate configuration in /etc/logrotate.d. It keeps files manageable and readable.&lt;br /&gt;&lt;br /&gt;Here is a sample logrotate config file:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;/var/www/rails/dashboard/current/log/*.log {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       weekly          # once a week&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       rotate 10       # keep 10 copies &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       copytruncate    # keep original file handle (but truncate file) &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       delaycompress   # delay compression until next rotation&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       compress        # compress it&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       notifempty      # do nothing if you don't need to&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;       missingok       # it's not a bad thing to not have a log file.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;&lt;/span&gt;&lt;span style="font-family: courier new;"&gt;}&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-5843836568825906848?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/5843836568825906848/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/logrotate-tale-of-two-config-locations.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5843836568825906848'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/5843836568825906848'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/logrotate-tale-of-two-config-locations.html' title='Logrotate: a tale of two config locations'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-2963196645695464117</id><published>2008-10-30T15:49:00.000-07:00</published><updated>2008-10-30T17:11:31.349-07:00</updated><title type='text'>Generating a public key for those automated remote scripts</title><content type='html'>I've been using &lt;a href="http://rubyhitsquad.com/Vlad_the_Deployer.html"&gt;vlad&lt;/a&gt; to deploy apps lately, which makes deployment a breeze. However, for a complicated deploy, I'm usually asked at least 10 times to re-enter my password.&lt;br /&gt;&lt;br /&gt;I was originally too lazy to set up a public key, but after doing 10 deploys one day (another story),   I reconsidered.  After all, typing the same thing over and over again = violates all kinds of fairly logical assertions, like &lt;a href="http://en.wikipedia.org/wiki/DRY"&gt;DRY&lt;/a&gt; for one.&lt;br /&gt;&lt;br /&gt;Here is what I did to set up a public key on my deployment server.&lt;br /&gt;&lt;br /&gt;(1) on my client box:&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;ssh-keygen -t dsa&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;ssh-keygen will ask for a passphrase. Dont enter one  -- that kills the whole point of using a public key to automate ssh/scp actions!&lt;br /&gt;&lt;br /&gt;ssh-keygen generates two files in ~/.ssh:&lt;br /&gt;id_rsa -- your private key used in ssh authentication.&lt;br /&gt;id_rsa.pub -- the public key you can spray out to machines you want to copy things to.&lt;br /&gt;&lt;br /&gt;Then do the following:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;scp ~/.ssh/id_rsa.pub machine:/~/.ssh/authorized_keys2&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;now your logins should be password (and pain!) free.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-2963196645695464117?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/2963196645695464117/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/generating-public-key-for-those.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2963196645695464117'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/2963196645695464117'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/generating-public-key-for-those.html' title='Generating a public key for those automated remote scripts'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-1147496422743428500</id><published>2008-10-21T15:43:00.000-07:00</published><updated>2008-10-21T15:52:24.771-07:00</updated><title type='text'>Creating Objects on the fly in Ruby</title><content type='html'>I'm writing a file parser where I want to plug in different parsing modules depending on the kind of file I need to parse.&lt;br /&gt;&lt;br /&gt;In order to do this without having to change code, I'm storing the configuration in a YAML file, like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;home_page_total_clicks:&lt;/span&gt;&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;span style="font-family: courier new;"&gt;  id: 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;  match_string: "home_page"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;  measurement: "total_clicks"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;   clazz_name: "DummyParser"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family: courier new;"&gt;   date_comparision: 1&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;I'm specifying that for pages that are categorized as 'home_page_total_clicks', that I want to instantiate a class named "DummyParser". I'm thinking that in the future I could allow someone to specify an arbitrary parser to an abitrary file type.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;The way Ruby allows you do instantiate classes from specified strings relys on the fact that all classes are constants, that you can retrieve:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;  clazz = Kernel.const_get(file_process_data.clazz_name)&lt;br /&gt;  processor = clazz.new(file_process_data)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: georgia;"&gt;and tada! a new processor -- note how I've assumed that a processor takes a file_process_data as an input. This will fail processors that don't have initialize methods that dont take file_process_data. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8840067776782114927-1147496422743428500?l=arunxjacob.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://arunxjacob.blogspot.com/feeds/1147496422743428500/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/creating-objects-on-fly-in-ruby.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1147496422743428500'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/8840067776782114927/posts/default/1147496422743428500'/><link rel='alternate' type='text/html' href='http://arunxjacob.blogspot.com/2008/10/creating-objects-on-fly-in-ruby.html' title='Creating Objects on the fly in Ruby'/><author><name>Arun Jacob</name><uri>http://www.blogger.com/profile/17781797469431108786</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_NqxvfgwIOvA/SczXH0tot6I/AAAAAAAADI8/yVn7EnVncq8/S220/bio_pic.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-8840067776782114927.post-7105774835565089564</id><published>2008-10-14T12:21:00.000-07:00</published><updated>2008-10-14T21:29:53.279-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Total Immersion'/><category scheme='http://www.blogger.com/atom/ns#' term='Swimming'/><category scheme='http://www.blogger.com/atom/ns#' term='Breathing'/><title type='text'>Swimming Breakthrough this AM</title><content type='html'>&lt;span style="margin: 1em; display: block; float: right;" class="zemanta-img zemanta-action-dragged"&gt;&lt;a href="http://commons.wikipedia.org/wiki/Image:Swimming_dog_bgiu.jpg"&gt;&lt;img style="border: medium none ; display: block;" title="A dog swimming" alt="A dog swimming" src="http://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Swimming_dog_bgiu.jpg/202px-Swimming_dog_bgiu.jpg" height="143" width="202" /&gt;&lt;/a&gt;&lt;span class="zemanta-img-attribution"&gt;Image via &lt;a href="http://commons.wikipedia.org/wiki/Image:Swimming_dog_bgiu.jpg"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_0"&gt;Wikipedia&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;Some history: when I was twelve, my mom decided that the one way to 'drown proof' me and my sister was to get us on a swim team. My summers, and then my winters, became filled with laps and yards and intervals. Only one problem. I was a terrible swimmer.&lt;br /&gt;&lt;br /&gt;I had the technique of a drowning man (hence the title of my blog), coupled with low body fat -- in the 6% range -- and the upper body of a Kenyan marathoner. Eventually, in high school, I started to dread swimming so much I would actually have nightmares about going to practice. I dropped out to spare myself any more sleepless nights, and proceeded to apply any lung capacity I had built up swimming to inhaling bong hits, which eventually led to more sleepless nights, but that's another story.&lt;br /&gt;&lt;br /&gt;Fast forward to now, and I'm quickly closing in on 40. I've been an incredibly average bike racer, run some 1/2 marathons, and while my genetics don't point to world class anything, I do actually enjoy running and biking. So triathlons would seem like a natural next step, especially since soccer is becoming a beer and ibuprofen aided affair, and climbing takes too much time away from the kids right now.&lt;br /&gt;&lt;br /&gt;But the thought of swimming, and the indelible imprints of suffering through thousands of yards very slowly and painfully, kept me focused on other things. Until now. I decided to actually take the time to learn how to swim, via &lt;a href="http://www.totalimmersion.net/"&gt;Total Immersion&lt;/a&gt;. Another positive factor: my body fat has doubled, so I'm not quite the sinker I used to be.&lt;br /&gt;&lt;br /&gt;Total Immersion teaches you how to swim better via a series of progressive drills. These drills start out very basic, i.e. you are floating on your back and kicking. They build up from there, but the keys are&lt;br /&gt;&lt;ul&gt;&lt;li&gt;swimming "downhill" by keeping your head down instead of looking forward.&lt;/li&gt;&lt;li&gt;swimming on your side, and pivoting from side to side.&lt;/li&gt;&lt;li&gt;driving that pivot from your core&lt;/li&gt;&lt;li&gt;pushing your chest down into the water because the air in your lungs will help you float.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;barely kicking&lt;/li&gt;&lt;/ul&gt;   This was completely counter to the way I swam, which was with my head looking forward, using my arms to drag myself through the water until my shoulders hurt, kicking spasmodically to try and float as effortlessly as the much better swimmers around me.&lt;br /&gt;&lt;br /&gt;After a month of working on these drills,  I was swimming with much less effort than I ever had before, but I still felt that something was missing. I still felt that  I was expending a lot of energy, that it was hard to breathe, and that I was struggling to swim downhill.&lt;br /&gt;&lt;br /&gt;After reading and re-reading the drills section of the Total Immersion book, I decided to try using &lt;a href="http://www.totalimmersion.net/fistgloves.html"&gt;&lt;span class="blsp-spelling-error" id="SPELLING_ERROR_1"&gt;FistGloves&lt;/span&gt;&lt;/a&gt;. They are what they sound like: rubber gloves that force your hands closed. What that does is dramatically reduce the amount of surface area that you have to work with. The idea is that by reducing your hand area, you will be forced to concentrate on balance as well as stroke.&lt;br /&gt;&lt;br /&gt;Again, this is counter to what I had been taught. To work on stroke, our coaches used to give us paddles and pull buoys. The paddles increased the surface area during the pull, increasing the workload on the shoulders. The buoys were used to let us concentrate on pulling. The result was supposed to be increased strength that resulted in increased speed, but I always felt fast until  I took the paddles off, and then I felt slow. And heavy, especially since I had to put the buoy away.&lt;br /&gt;&lt;br /&gt;This morning was my first go with the &lt;span class="blsp-spellin
