Monday, December 21, 2009

Impedance Mismatch

Post Mortem:

Long story short: every time I've had a project fail, one recurring theme has been bad technology choices. There have been others, like group dysfunctionality aka 'collective asshat fatigue', where the entire group stops functioning to avoid dealing with one or more aberrant personalities. Bad project scoping/definition also contribute to failure rates, but I don't think that there is much intersection between groundbreaking work and the prototypical well defined, well understood project.  So it would seem that issues of scoping as well as team dynamics are  (a) exacerbated by and therefore (b) secondary to the bad technology choices that put the project in jeopardy.

I thought I had gotten better at detecting when I was making a bad technology choice, but I recently made one. Fortunately we were able to turn things around, but it was hard.  In the interest of not making this class of bad decision again -- because the definition of insanity is to do the same thing and expect different results --  I want to dissect what went wrong, what went right, and what I learned.

The Choice:

I recently started a new job. My first task was to jump in and assist on a prototype project by writing feed parsers that would parse millions of rows of comma separated values from feeds into various records in an SQL database. I was initially constrained to using Java.  I took a standard object=row approach persisting single objects at a time.

The code was tdd, separation of concerns was good, all unit tests passed (coverage was good). Several hundred lines of code were required to parse, clean, validate, and insert data from various feeds into the database. I did not use an ORM, I used straight SQL via JDBC.

The Results:

The performance was horrendous. Inserting several million rows took hours, hours that we simply didn't have. The performance impacted the effectiveness of every down the line operation, and was jeopardizing the success of the overall project. Not a good way to start a new job :)

The Workaround:

We ended up ditching Java completely and rely on existing unix commandline tools to parse the files, insert into temp tables, and do bulk updates/inserts of rows from those temp tables into the main, 'canonical' tables.   In other words, the 100s of lines of java parsing and insertion code that I wrote in a week or so (counting unit testing) and frantically reworked several times to try and speed up was replaced by something like this:

cat rawfile.csv | cut -d, -f1 | tr ":upper:" ":lower:" | sed -e"s/^m//g" | sort | uniq > psql -c "copy tablename from stdin using delimiters ','" | actual_queries.pl


This took a multi hour query down to 5 minutes. There was a bunch of pre-formatting prior to inserting into the database, and a perl script that ran afterwards, using DBI to copy/update from the temp table.

In general, the one rule that emerged was 'do as much processing before going to the db'. For example, determining set exclusion/intersection, which is something I would have definitely gone to code or SQL for, could be done via commandline via the comm utility:

comm -12 <(sort file1) <(sort file2) gives the intersection of file1 and file2.
comm -13 <(sort file1) <(sort file2) gives unique lines from file2
etc.

Conclusions:

Conclusion 1: Tests Still Required.

The good thing about piping a bunch of common unix tools together is that they have been around for a long, long time. Meaning you don't have to worry about the integrity of the data as much as you have to worry about using the tool options correctly. The bad thing about this approach is that the only kind of testing is integration testing, and it is easy to blow off when the initial solution works (or seems to).

After getting bitten when the queries worked but the data had integrity issues that manifested in the logic,  we ended up writing a bunch of scripts that verified data integrity by making queries and inspecting result sets. We also leveraged the database, adding constraints that would allow the script to fail fast and alert us to schematic integrity issues, like duplicate rows.

Conclusion 2: It's Not the Databases Fault.

The database is a very convenient scapegoat, but the truth  is that I spoon fed data into the database, and I could only move as fast as I could move my spoon (in Java). The better approach is to bulk feed data into the database, via bulk copies and bulk inserts/updates. Again, verification/validation scripts and constraints are required.

Conclusion 3: SQL Good, ORM Bad.

The truth is that we could have done this in Java, had we just used the same SQL we ended up using in the Workaround. My mistake when using Java was to put on my ORM blinders, which are great for when I want to pretend that there is some arbitrary data store underneath my code. This works until it doesnt, usually at 12AM the day of a release.

Multiple FAILS mean I'm done pretending the database is some fuzzy abstract data 'store', because I will use one when I want to want to mine data along arbitrary axes -- in other words, I'll use a database precisely to use SQL and not some mapping to it.  SQL is a mature and extremely powerful way to ask open ended questions of a schema.  If I don't want to ask open ended questions of my data, I shouldn't use a database. Because that's what they're built for.  BTW I haven't used Hive or Pig yet, but these seem to be the QL solutions for much larger datasets than the one I was working with.

Conclusion 4: When in Doubt, Go Cheap, Go Fast.

However, just because we could have done it in Java doesn't mean we should have. Perl or Ruby or Python or Bash and the plethora of solid utilities available will now always be my first option when putting together a data input operation at this particular scale.

I think there will always be those opportunities that present themselves as vaguely defined chances to hit it big. Instead of taking lots of time up front to define the work involved at the expense of the actual opportunity, I'm going to move ahead with cheap and fast technologies that let me change path extremely quickly, because I'm sure I will need to at least once during the course of an ill defined project.

Conclusion 5: Keep the blinders off!

This entire experience was a huge reminder to me to be open minded about choosing the right tool for the job. This was an instance where I let the technology choice mandate my implementation decisions, instead of the other way around. Every time I do that, I get screwed. If, instead of putting my head down and running as fast as I could,  I had initially asked questions about the duration of the project, the intention of the code, the performance constraints on data input, etc, I could have easily justified the use of Perl/unix tools/raw SQL, and saved a lot of late nights/coding angst.

Conclusion 6: It's The People, Stupid

One thing that overwhelmingly shone through even in the grimmest of moments was the quality and class of the people I was working with. They all stayed completely focused on the solution, and did not point any fingers even when to do so would have been more than understandable. Furthermore, they were able to keep their sense of humor intact. While I didn't necessarily enjoy making this big of a mistake at a new job, the level of teamwork, professionalism, and respect from my new co-workers was complete confirmation of my reasons for jumping ship from my old company.