Friday, January 11, 2013

More on the Enterprise Data Platform: Data Requirements

In my last post, I talked in very general terms about an Enterprise Data Platform (EDP) and in very specific terms around what I consider to be a core requirement of any EDP, data governance. If I have a set of services and processes that provide data governance, I have a way to manage data. What kind of data am I trying to manage?

I'm primarily concerned with building systems that contain  event data and reference data. Event data can be data copied from OLTP systems, it can be user click streams, machine data collected at regular intervals --  anything that signals an event happening.  Event data can be huge.

Reference data is data that can be used to classify/quantify aspects of an event.  If I'm looking at a click stream, a user ID is reference data that I can aggregate events by. In OLAP terms, events can be cast as facts, with reference data providing some of the dimension values.

This data starts to get valuable when  'raw' event data is joined to reference data, and then in turn joined to other event data along a specific reference dimension. For example, aggregating user click stream and email campaign opens by user ID could be used to track the rate at which the email campaign actually generated new users.

In addition to that kind of analytical use, event data can be used to classify users and/or determine their affinities. This kind of derivative data is typically referenced at runtime, by the same applications that are generating the event data.

The two cases above are interesting because they have opposing requirements. Analytic data must first and foremost be consistent, especially when financial reporting and/or business decisions are being made from that data. Consistency in this case implies that when a value is written, the next read reflects the last change made to that variable.

Runtime data must first and foremost be available, because unavailability of data may compromise application behavior. Besides, preference or personalization data is derivative data, generated on a set interval as a batch process. Being out of sync for a long time will eventually mean the application will grow 'stale', but when we talk about availability at the expense of consistency, the usual case is that any inconsistency between read state and written state will be resolved in sub second time.

Why is this important?

  1. Enterprises collect a lot of this data -- TB/day -- and that scale will swamp any single machine based database. 
  2. The enterprise must therefore store data on a storage platform that spans many machines -- a cluster. 
  3. The moment data is stored in a cluster, it is subject to the CAP theorem, which states that a distributed system cannot enforce Consistency, Availability, and Partition Tolerance at the same time. 
  4. While it is valid to desire a clustered system that favors either Consistency or Availability, Enterprise level  requirements state always require partition tolerance, so we can have one or the other, but not both. 
  5. This means that there are usually two main systems, an analytic focused system, and a runtime facing system. The analytic focused system favors consistency, the runtime facing system favors availability.
  6. An EDP that wants to offer both analytic and runtime facing data must have at least two systems, one which is consistent for analytic data, the other which is available for runtime facing data.
So now my definition of an EDP has evolved. It must have some fundamental Data Governance, and due to the scale enterprises operate at, must have 1..N analytics focused platform(s) and 1..N runtime facing platform(s). In my next post I'll focus on the analytic focused platform requirements.