Friday, November 16, 2012

What is an Enterprise Data Platform, Anyway?

I've been trying to write about my requirements for a Data Platform for the last month. The problem is that I was trying to write this at the exact same time that my understanding of both the requirements and the Platform was shifting, so  I ended up writing the requirements, coming back to them a week later, throwing them all out, and rewriting them again. This is revision number three, and I actually feel that I've stabilized my definition. The biggest change between now and a month ago is that I realized that I am actually thinking about an Enterprise Data Platform, which is different than a Data Platform.

First of all: What is a Data Platform? Here is a first crack at a definition: a Data Platform enables people to leverage data by facilitating data collection, storage, transformation, and access. In other words, I need to be able to get data in, persist it, access it, change it, and still access it. All of that is completely possible with {name your nosql storage engine of choice}. So why am I not calling {name your nosql storage engine of choice} an Enterprise Data Platform?

I have a short answer that is explainable with a longer answer. The short answer: Enterprise Data Platforms have another requirement: the ability to manage all of that data. 

The long answer: my experiences with databases and persistence technology started from the perspective of a developer in a startup, not an administrator or data modeler in an enterprise. I wanted to successfully collect, store, transform, and retrieve data, in order to fulfill the functional requirements of the applications I wrote. I found that a lot of the database admins and data modelers/architects I worked with actually got in the way of this fairly simple mission.

A lot of the constructs and restrictions imposed on the data model and the database by both parties seemed arbitrary, and when I asked them for their reasoning, their answers had no relevance to my single data set use case.  So, when I discovered Hadoop while working around a complex set of ETLs that had ground our database to a halt in 2009, I was elated, because both parties were sidelined and I could do what I needed to do. Their apoplectic reactions to 'schema on read' used to fill my heart with joy.  That makes my current perspective very ironic.

The change in perspective happened when I changed jobs. I went from being a developer that used Hadoop to deliver solutions to an architect/product manager position where I led a team that operationalized Hadoop, Cassandra, Mongo, and Solr and then started to store and curate enterprise level data as a centralized service offering. The moment I started doing thinking past pure functionality and more about operations, I started to care about a lot of the things that people who manage data -- the people I used to disparage -- have been caring about for a while.  The moment I became a data curator or steward of a set of very diverse and valuable data, I started to see real value in Data Governance, which was a term that previously made me roll my eyes in impatient disgust.

Data Governance is defined in wikipedia as a set of processes around "data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization".  As a developer in a startup tasked with doing a few things very well, I could care less about data governance. I know my data, it may be vast but it has the same schema, which I know and can code for.

However, as a member of a large enterprise storing many kinds of data who wants to use some of that data, I now must rely on data quality, definition, and security. Without those three I may be able to generate some value from that data, but the value is undermined by the lack of defined quality of the inputs, the lack of defined standards to normalize the data to, and the lack of security which means that the data could be compromised by a rogue user/process. Those 'constraints' on enterprise data are in place because they have a direct impact on the bottom line.

Breaking down Data Governance from the above definition, I get the following:
  1. Data resources must be discoverable: there must be a set of defined metadata that analysts can search for data by. 
    • common metadata is most effective when there is a common data model, which becomes hard to enforce when an enterprise spans many different business units. 
  2. Data structure must be describable so that analysts can consume it. 
  3. Data Quality must be known and identified every time new data is ingested. 
    • When I load data from an external source, downstream processes rely on it conforming to a schema. I need to score the data by how much of it conforms to that schema.
    • When Data Quality dips below a defined level, it must be treated as an operational issue. 
  4. Data Replication must be defined and enforceable.
    • Storage systems must allow users to set a replication factor to account for storage failures. 
  5. Data Replication must be defined, enforceable, and applicable per unique data type.
    • Replicating full data sets may be a business requirement: if so, a Data Platform must replicate data along the following dimensions: resource location (where) and range (how much). 
  6. The context of the data, e.g. what the units of the data are, must be defined in order for an analyst to successfully trust and consume the data. 
    • Unitless data is much less useful. Note that this implies common units across an enterprise, which is harder to do than one would think. Defining and encouraging a Master Data Model helps restrict units and meaning to allow different people to utilize data that they haven't produced. 
    • Data that has been vetted is 'trusted' which means that someone is standing behind it. Standards around trust are important when you have many owners of data. They imply standards around data quality and context. 
  7. Data must be secure
    • Access must be restricted to specific owners. 
    • Those owners must be able to centrally manage permission granting to share data with other users. 
    • Users must be authenticated to the overall system, and can only access data that the data owners have authorized them to. 
    • All actions must be audited.
These requirements are foundational to a data platform that houses many diverse data sets. You could build one without them, but it would only work for a limited set of data. Which is why startups don't care about data governance, and why most NoSQL products are only now starting to think about security. They don't need to -- and to be clear, that's perfectly fine. But that doesn't work or scale at the enterprise level.

What would an Enterprise Data Platform that implemented the requirements above enable? It would enable qualified, standardized, and secure use of data for both analytics and runtime consumption. As a company ran different kinds of analytic efforts on it's data, and generated runtime models from those analytic efforts, other analysts could reuse the input, intermediate, and generated data in different, unanticipated, "recombinant" ways because they can trust the data and can apply company standards to it. The platform would become an enabler, allowing analysts to discover new insights by removing the overhead of managing the data.

I have other requirements of a Data Management Platform besides management, but management is a foundational aspect when managing very many, very diverse sets of data across multiple storage platforms. I hope to start addressing those shortly.