Wednesday, November 2, 2011

Schema On Read? Not so fast!

I just got back from HadoopWorld. I have many thoughts on what I saw and heard there, but that is probably a separate post. I've been trying to write something for the last 3 months, and HadoopWorld gave me the clarity I needed to finish it.

There was this phrase I kept hearing in the halls and meeting rooms...."Schema on Read". Schema on Read, in contrast to Schema on Write, gives you the freedom to define your schema after you've stored your data. Schema on Read sounds like Freedom. And who wouldn't  like Freedom from the Tyranny of Schemas?

If, by the way, that phrase rings a bell, it may be because of this:


All Downfall Meme kidding aside, Schema on Read sure seems nice. Because there is nothing in any of the current NoSQL storage engines that enforces a schema down to the column level, can we just not worry about schemas? What's the point?

Schemas -- good for the consumer, bad for the producer. 

The point is that schemas are a guarantee to a consumer of the data. They guarantee that the data follows a specific format, which makes it easier to consume. That's great for the consumer. They get to write processes that can safely assume that the data has structure.

But...not without a cost, to someone. For the producer of the data,  schemas can suck.  Several reasons:
  1. Because you never get them right the first time, and you're left doing schema versioning and writing upgrade scripts to compensate for your sins of omission -- basically you get nailed for not being omniscient when you designed the schema. 
  2. Because they don't do well with variability. If you have data with a high rate of variability, your schema is guaranteed to change every time you encounter values/types/ranges that you didn't expect. 
The various parties pumping freedom from schemas via NoSQL technologies seem to have an additional implicit message -- that even though you don't have to lock down your data to a schema, you still get the benefits of having one -- the data is still usable. Specifically, if you don't define or partially define the data, you can still get value from it. Because you're storing it. Is that true?

Sure it is. Sort of. Take a file in HDFS. If the file isn't formatted in a specific manner, can it still be processed? As long as you can split it, you can process it. Will the processing code need to account for an undefined range of textual boundary conditions? Absolutely. That code will be guaranteed to break arbitrarily because the format of the data is arbitrary.

The same thing can happen with column families. Any code that processes a schema-free column family needs to be prepared to deal with any kind of column that is added into that column family. Again, the processing code needs to be incredibly flexible to deal with potentially unconstrained change. Document stores are another example where even though the data is parse-able, your processing code may need to achieve sentience a la Skynet in order to process it.

So, yes, you can get value from randomly formatted data, if you can write bulletproof, highly adaptable code. That will eventually take over the world and produce cyborgs with Austrian accents that travel back in time.

But what about those of us that process (semi) structured data? Web logs, for example. Or (XML/JSON) data feeds. Things that have some kind of structure, where the meaning and presence -- aka the semantics -- of fields may change but the separators between them don't. Do we really need freedom from the tyranny of something that guarantees structure when we are processing things that have a basic structure?

Yes. Even though format may be well defined, semantics can be quite variable. Fields may be optional, mileage may vary.  Putting some kind of schematic constraint on all data invalidates one of the key bonuses of big data platforms -- we wouldn't be able to clean it because we wouldn't be able to load it if we had to adhere to some kind of well defined format. In the big data world, imposing a schema early on would not only suck, it would suck at scale.

However, the moment we have done some kind of data cleansing, and have data that is ready for consumption, it makes sense to bind that data to a schema. Note: by schema I'm talking about a definition of the data that is machine readable. JSON keys and values work quite well, regardless of the actual storage engine.

Because the moment you guarantee your data can adhere to a schema, you liberate your data consumers. You give them...freedom! Freedom from....the tyranny of undefined data!

But wait, that's not all...

What else comes along for free? How about a quality bar by which you can judge all data you ingest? If your data goes through a cleansing process, you could publish how much data didn't make it through the cleansing process. Consumers could opt out of data consumption if too much data was lost.

And when your data changes (because it will) your downstream customers can opt out because none of the data would pass validation. This fast failure mode is much preferred to the one in which you discover that your financial reports were wrong after a month because of an undetected format change. That isn't an urban myth, it actually happened to -- ahem -- a friend of mine, who was left to contemplate the phrase that 'pain is a very effective motivator' while scrambling to fix the issue :)

So what does this all mean?
While the costs of Schema on Write are (a) well known and (b) onerous, Schema on Read is not much better the moment you have to maintain processes that consume the data.

However, by leveraging the flexibility of Hadoop, Cassandra, HBase, Mongo, etc, and loading the data in without a schema, I can then rationalize (clean) the data and apply a schema at that point. This provides freedom to the data producer while they are discovering what the data actually looks like, and freedom to the data consumer because they know what to expect. It also lets me change over time in a  controlled manner that my consumers can opt in or out of.

That's not Schema on Read or Schema on Write, it's more like Eventual Schema. And I think it's a rational compromise between the two.