Sunday, December 8, 2013

Innovation Week Recap

I previously posted about our leadup to Innovation Week, which ended up being more like Innovation Week-and-a-half because it shifted a sprint ending the week of Christmas, which is pretty sparsely attended due to everyone being out of town.

The 10 days of innovation ended up being much more successful than I had thought possible. There was some very out of the box thinking, both in the realm of infrastructure and analytics, and some of these ideas have huge potential to shift how we think about big data.

The reasons I thought that things might not go so well (and what ended up happening):

  1. Lack of ideas. At the time I wrote the last post, we were up to 6. We topped out at 14, all well thought out and presented. We had to narrow the ideas down to 5 based on peoples availability -- we did that with a team-wide vote 
  2. Lack of managment: we -- the management team --  had specifically decided to let the teams be self organizing, and not interfere with them even if we saw them go off the rails. No one went off the rails, and teams organized around the work and the capabilities of the team members. We did make ourselves available for questions/advice, but other than that we sat back and observed. 
  3. Technical roadblocks: the ideas we ended up voting in (as an entire team) had some steep technical hurdles. I wasn't sure if the teams could overcome those, and wasn't sure what they would do if they couldn't. Every team had at least one significant roadblock that they worked around with little to no guidance. 
  4. I'm as pessimist realist, and tend to prepare for worse case scenarios. Apparently I overestimate myself and my management team's contributions :)
The presentations were great in that all except for one were  live demos of working software -- one key difference between this and standard demos is that the teams owned the ideas and were therefore much more invested in how the demos went.

We're taking the top  ideas and starting new work that will get prioritized against existing deliverables. While I'm obviously excited about the ideas, some of which I consider to be fundamental game changers, I'm just as excited because of  what I learned  about leading teams. 

Our best ideas come from our people, and when we guide them and set the target, they crush it.  As management our primary job should be to clearly communicate a vision of where the team needs to be, inspire them by giving them ownership and autonomy, and get obstacles out of their way.  

Sometimes I feel like the best teams are the ones that build up ideas the way Barca moves the ball down the field:
 

There is no 'central control', there is just the idea -- the ball -- and the team, which supports each other as they move the ball downfield, and the magic that happens because the team is focused on doing what it takes to move the ball, develop the attack, and put together a combination that finishes in the opponent's net. What blows me away is that each of these players has amazing skill but they are so much more effective with one touch passing and holding the triangle. I see the same thing on engineering teams that work well together. The top talent doesn't hold onto the ideas, they share them and make themselves available to move it along, and in doing so bring everyone up to their level. Seeing that happen without explicit guidance was the best part of Innovation Week for me.




Sunday, November 17, 2013

Hadoop Streaming with MRJob

Motivation to use Streaming:

Writing java map-reduces for simple jobs feels like 95% boilerplate, 5% custom code. Streaming is a much simpler interface into Mapreduce, and it gives me the ability to tap into of the rich data processing, statistical analysis and nlp modules of Python.

Motivation to use mrjob:

While the interface to Hadoop Streaming couldn't be simpler, not all of my jobs are simple 'one and done' map-reduces, and most of them require custom options MRJob allows you to configure and run a single map and multiple reduces.  It also does some blocking and tackling, allowing me to customize arguments and passing them into specified jobs. Finally, mrjob can be applied to an on prem cluster or an amazon cluster - and we are looking at running amazon clusters for specific prototype use cases.

mrjob and streaming hurdles

The mrjob documentation is excellent for getting up and running with a simple job. I'm going to assume that you have read enough to know how to subclass MRJob, set up a map and a reduce function, and run it.

I'm going to discuss some of the things that weren't completely obvious to me after I had written my first job, or even my second job. Some of these things definitely made sense after I had read through the documentation, but it took multiple reads, some debug attempts on a live cluster, and some source code inspection.

Hurdle #1: passing arguments

My first job was basically a multi dimensional grep: I wanted to walk input data that had timestamp information  a tab delimited field and only process those lines that were in my specified date range.  In order to do this  I needed two range arguments that took date strings to do the range check in the mapper.  I also wanted to be able to apply specified regex patterns to those lines at map time.  Because there were several regex patterns,  I decided to put them in a file and parse them. So I needed to pass three arguments into my job, and those arguments were required for every mapper that got run in the cluster.

In order to pass arguments into my job, I had to override the configure_options() method of MRJob and use add_passthrough_option() for the range values, and add_file_option() for the file that held the regexes:

def configure_options(self):
        super(HDFSUsageByPathMatch,self).configure_options()
        self.add_passthrough_option("--startDateRange",type='string',help='...')
        self.add_passthrough_option("--endDateRange",type='string',help='')
        self.add_file_option("--filters")

All options were passed straight through to my job from the command-line:

python job.py --startDateRange 01/01/13 --endDateRange 12/01/13 --filters filters.json

I referenced them in an init function of my job class, which subclassed the MRJob class:

class MyJob:
     ...
    def task_init(self):
        self.startDateRange = dateutil.parser.parse(self.options.startDateRange)
        self.endDateRange = dateutil.parser.parse(self.options.endDateRange)
        self.filters = parseJsonOptions(self.options.filters)

This init method was specified in the MyJob.steps() override of the default MRJob method:

def steps(self):
        return [


            self.mr(mapper_init = self.task_init,
            .....
        ]

Something to note here: In the code I had written during development,  I had neglected to really read the documentation and as a result I had previously done all validation of my custom args using a standard OptParse class in my main handler. This worked for me in inline mode, which is what I was developing in. It does not work at all when running the job on a cluster, and it took some source code digging to figure out. Do as I say, not as I do :) In hadoop mode, the main MRJob script file is passed to mapper and reducer nodes with the step parameter set to the appropriate element in the steps array. The entry point into the script is the default main, and MRJob has a set of default parameters it needs to pass through to the MRJob subclassed job class. Overriding parameter handling in main effectively breaks MRJob when it tries to spawn mappers and reducers on worker nodes. MRJob handles the args for you, and you need to let it handle all arg parsing, and pass custom arguments as passthrough or file options. 

Hurdle #2: passing python modules

This nuance has more to do with streaming than it does with mrjob. But it's worth understanding if you're going to leverage non-standard Python modules in your mapper or reducer code, and those modules have not been installed on all of your datanodes.

I was using the dateutil class because it makes parsing dates from strings super easy. On a single node, getting dateutil up and running is this hard:

easy_install python-dateutil

But when you're running a streaming job on a cluster, that isn't an option. Or, it wasn't an option for me because the ops team didn't give me sudoers permissions on the cluster nodes, and even if they did, I would have had to write the install script to ssh in, do the install, and roll back on error. Arrgh, too hard.

What worked for me was to
  1. Download  the source code
  2. Zip it up (it arrived in tar.gz)
  3. Change the extension of the zip file because files that end in .zip are automatically moved to the lib folder of the task's working directory
  4. Access  it from within my script by putting it into the load path: 
sys.path.insert(0,'dateutil.mod/dateutil')
import dateutil
...

I'm passing dateutil.mod as a file passed in via add_file_option() in  myjob.configure_options(). Leveraging the add_file_option() method puts dateutil.mod in the local hadoop job's working directory:

def configure_options(self):
        super(HDFSUsageByPathMatch,self).configure_options()
        ....
        self.add_file_option("--dateutil")

Three things to note from the above code: (1) dateutil.mod is the zip file, (2) I'm referencing a module within the zip file by it's path location in that zipfile, and (3) because I've renamed the file, it gets placed in the job working directory, which means it is on my path by default. 

This is how I pass dateutil.mod into the job:

python job.py ... --dateutil dateutil.mod

Hurdle #3 (not quite cleared): chaining reduces vs map-reduces

As mentioned in the doc, it's super easy to chain reduces to do successive filtering and processing. Simply specify your multiple reduces in the steps() override:

def steps(self):
        return [
            self.mr(mapper_init = self.task_init,
                mapper=self.mapper_filter_matches,
                combiner=self.combiner_sum_usage,
                reducer=self.reducer_sum_usage),
            self.mr(reducer_init = self.task_init,
                    reducer=self.reducer_filter_keys)

        ]    

I haven't found it necessary to run successive mapreduces -- successive reduces work just as well in the use cases I've tried. When chaining reduces to the end of your first mapreduce, you can specify the key value from the first mapreduce as the key value in the next reduce.

What is not easy at this time is the ability to save intermediate output to a non intermediate location. While doing that is relatively straightforward in 'inline' mode, the approach suggested in the link won't work in hadoop mode because MRJob is invoking the python script with the right --step-num argument based on what it sees in the steps() method.

I did read about the --cleanup option, but from what I understand the intermediate output dir of a complex job is based on a naming convention, not on something I can set. As this is somewhat of an edge case, I can work around it by chaining MRJob runs with Oozie.

Summary

What I've learned about MRJob is that while it does a great job of allowing you to set and pass options, and allows you to construct good workflows (assuming you don't care about intermediate output), it is so easy to use that I fell into the trap of believing that running local on my machine was equivalent to running on a hadoop cluster.

As I've found out several times above, that is not the case. For me the keys here are (1) let MRJob handle your job specific variables, (2) leverage the steps() method for your more complex flows, and (3) if you need to save intermediate output, chain your jobs using an external scheduler.

Friday, November 8, 2013

Innovation -- trying to break out beyond the buzzword

Innovation is the poster child of buzzword bingo.


It's hard not to have an allergic reaction to people that talk about it, because you can't talk about innovation and do it at the same time.

So why am I talking about Innovation instead of doing it ? :)

A couple of months back, when we were revamping our development process, basically going from 'Scrum-in-name-only' to something much more genuine (and I've got to do a post on that), we wanted to give people a block of time to do something completely different from their day jobs. We wanted them to work with different people, outside of their usual teams, on ideas that they (not we) thought of. We wanted to break down some of the walls that naturally occur when you section large teams into smaller units to get work done efficiently.

We're sitting on some amazing data and have built some great infrastructure to manage it. These people are on the teams that are work with that data and use that infrastructure day in and day out. They're smart. I know they have ideas on new data products, or tools to make getting insights easier, but no time to actually work on them. Most importantly, I know their ideas are good ones, because I've seen multiple people make those ideas happen in spite of having no time to work on them. We have products that we've built because people have championed their ideas into the delivery stream. I wanted to make that easier. You shouldn't have to be Rocky Balboa to get a good idea off the ground.

In other words, that kind of effort shouldn't happen on nights and weekends, against all odds -- we need to reward that kind of creativity during business hours -- while balancing the delivery needs of the business.

'Innovation Week' is our collective attempt to do just that. One week a quarter is enough time to stop business as usual and try something completely different. Innovation Week is very much an experiment, one that could go well....or not.

The overall plan:

  1. Before:
    1. Announce the week. 
    2. Send out a 'request for ideas' email
    3. Review ideas in as many sessions as we needed:
      1. the idea 'author' presents their idea canvas.
      2. We go over the canvas, ask questions, offer suggestions.
  2. During: 
    1. Everyone sells their idea.
      1. key in the selling: they need to ask for help where they need it.
    2. People provide their first, second, third choices.
    3. We assign people to ideas -- the reason we arent going to just let people choose is that we don't want imbalanced teams, and we want to make sure groups were diverse. 
    4. The teams work on the ideas -- we are available to unblock any issues and provide guidance if asked. 
  3. After:
    1. Every team presents their work.
    2. The group stack rates all ideas. 
    3. The top 3 get prizes. 
    4. The management team gives separate awards for 
      1. Business Value
      2. Completeness of Effort
      3. Disruptiveness (of the idea, of the technology being used, etc)


As the saying goes: 'no battle plan survives first contact with the enemy'. I was fairly nervous. What if no one had ideas? What if the team could care less? What if they were as allergic to the I-word as I was?

Our first idea review meeting was last night. Instead of the 1 or 2 ideas we had predicted, we have (at last count ) six. Instead of the vague, tech-aspirational ideas we thought we were going to see -- things along the vein of 'I want to play with technology X, here is a contrived attempt to justify that', we saw carefully thought out resolutions to problems our team was either working around or about to go through.

The discussion around the ideas was very positive and constructive -- the ideas that were presented got a lot of feedback and suggestions about how they could be better. The best part was getting individuals that they had good ideas and that exposing them to the group would make those ideas better.  The best moment was when one of the most quiet, most unassuming engineers got up and proceeded to unveil a completely awesome idea that was completely out of the box and completely powerful. At that point the energy in the room jacked up like a big wave.

After a while, work becomes work. We're lucky enough to be in a profession that requires as much creativity as it does precision. I wanted to put some meaning into what has become a term that is only applied with heavy irony.

We are early on in the process.  I am going to document how this first Innovation week goes -- expecting the unexpected, of course.

Right now, as noted, we are at the beginning. The management team has put a lot of work into setting up the idea generation, and we need to follow through by setting the teams up for success, picking the best ideas, then ruthlessly evangelizing those up the chain. It's a long journey but I think we made a great first step.



Sunday, October 13, 2013

Yesterday I woke up and realized I was a Product Owner.

Well, not yesterday. But I have had a recent revelation that I am no longer an engineer. And I'm not sure if I can ever really go back, even if I were to write production code again (I'm always writing code, but at this point it's more of a hobby than anything).

My journey into product ownership began innocently enough when I assumed technical leadership for a set of services built around Hadoop and other NoSQL Platforms -- the Enterprise Data Platform I've posted about before --  and has taken over a year. In that year I have transitioned from a person that provides purely technical solutions to a product owner. What's the difference? To me it is simple: a technologist implements solutions the best way they possibly can to address a perceived problem. A product owner makes sure that the  problem being solved actually matters to someone. Preferably before the technologist gets too far down the solution path.

In my last post I talked a lot about secondary vs primary value propositions. The Enterprise Data Platform I have been working on as a technologist is a group of services have secondary value propositions. The technologies used to solve them are definitely awesome.  But that's irrelevant. As a product owner I've got to make sure that those problems are worth solving to someone, because there is a significant investment of time and resources required to solve problems. Even if the solutions to those problems don't end up getting used. 

Nothing is more frustrating than burning effort on a solution that goes nowhere. Especially if that time was spent crafting a robust, scalable, well tested, well documented implementation of that solution. It's frustrating because that work resulted in a solution that either no one understood or no one wanted. In fact it was my frustration with having been through the good solution/bad product cycle several times that made me take the jump from technologist -- one of the people that implements solutions -- to product owner -- one of the people who decides whether implementation is worth it.

One of the hardest things about transitioning to a product owner from a purely technical role is being able to distinguish whether I should be doing something because I can, or because it solves a problem that people actually have. This is hard for me because as a technologist I tend to get sucked into solving hard problems before asking whether those problems should actually been solved. Ironically I've found that the best technologists I've worked with are the ones that stand back, ask the bigger picture questions, and use the answers to implement elegant, concise solutions.

Product owners need to do the same thing. I think the main job of a product owner is to learn what product the customer base really wants. Instead of plowing ahead with user stories and a feature roadmap, the best product owners I've seen are the ones that can step back from that process and ask the bigger picture questions, and use the answers to make a better product.

Answering big picture questions correctly allows product owners to fail fast -- throw away features that aren't working and iterate towards ones that do. And before doing any of that, the best product owners I know are the ones that can articulate the real value proposition of any effort before embarking on it. They may not know what or how they are going to deliver the effort, but they've nailed why the effort is worth making or not.

In my Product Owner self education, I've learned (from lots of reading and even more bleeding)  that clearly articulating the value proposition and true effort cost is critical to understand whether the effort is worth exploring and sell that exploration process to the people financing it and the people working on it. One of the best things I've found to help clarify product vision is this product canvas from Shardul Metha.


I've been going through our product suite with this product canvas. That process, while painful, has clarified why we should continue doing certain things and why we should stop doing others.

The interesting thing about the canvas is that everything is linked.  Here are some examples of what  I mean:

  1. If Key Success Metrics don't link back to the Value Proposition through the Solution, it doesn't matter how obvious they are, they're wrong. 
  2. If you don't have viable Channels, no executive is going to sign up to be a Key Stakeholder, and it's going to be hard to find a customer that could be a champion. 
  3. If the Value Proposition doesn't address the problems that customers have more efficiently than the alternatives, the Business Value of the effort is weak. 
  4. An ill defined or prohibitively expensive Cost Structure can weaken the best Value Proposition as well because Business Value will be reduced.


Because everything is linked, I find myself revisiting what I had thought was 'obvious' or at least 'set'. Like the Value Prop. Or the Customer. Or even the Problem. But after several cycles what comes out of the other side is either very strong or very weak. This makes deciding which efforts to pursue much easier.

Back to my original problem: I'm selling secondary solutions. The product canvas helps by allowing me to clarify the customer I'm helping, the problem they have, my specific value proposition, and how I would go about fulfilling that value proposition. I've found that for these secondary solutions, linking them to primary solutions is critical. An example:

We've been working on a reliable data delivery service that allows engineering teams to stream their data into different endpoints. Teams can build in 'routing directions' to the data they send that will enable it to land in HDFS, Mongo, Cassandra, ElasticSearch, and other endpoints. That is technically very cool. But we have not been able to clearly articulate why we would do something like this to our leadership because as technologists, the advantage of having something like this as a service is obvious. Namely, people could reuse this service instead of building a custom one.

As a product owner I get a chance to look at the solution from another point of view. What problem am I trying to solve?  How much does it cost to stand up and run? Are there commercial alternatives that we would be better of using? Can I find someone that is willing to try it and work with them to load test it? What other teams do I need to deliver it? Who do I need to sponsor it? And how am I going to get people to use it? For the a secondary solution like this, perhaps the most important question to answer is: what primary problems would be easier to solve if this existed today, and how does that benefit the business? 

None of those questions have anything to do how we are solving the problem. But they are the minimum set of questions that need to be answered to justify continuing this effort. And, had I been as educated when we had started the effort, this is the minimum set of questions that we would have asked prior to doing any work. The anwers to these questions laid out in product canvas form would have been a decision making compass -- when confronted with choice A vs choice B, the product canvas would have given us the tools to make the best decision.

I think that this approach is fundamentally right, but it is one that I am only just starting. The product canvas approach -- clarifying why before what and how--  is obviously the first foundational step in delivering a valuable product. Our teams are going through this journey, and my hope is to write down what is working, and what isn't, as we now try to deliver products whose value proposition we clearly understand :) 

Thursday, September 26, 2013

Enterprise Data Platform: A Reboot and a Reality Check

The last post I wrote on the Enterprise Data Platform was in January. It's September. What happened?

A whole lot, actually. My understanding of what an Enterprise Data Platform is and how it needs to be 'sold' has changed dramatically. My role in this process has also changed. And my understanding of delivering software and running a team has grown and changed  and continues to change for the better.

What I've found out in the past year, among many other things, is that getting people to fund what I've been calling an Enterprise Data Platform is as much about education as it is about technical execution. I'm not talking about educating other people, I'm talking about educating myself. It has been an incredibly educational, humbling, uncomfortable, frustrating, awesome nine months.

When I look back at the posts I wrote early this year, one thing that is very murky throughout a reasonably well laid out argument for data management is a value proposition. I didn't know that when I wrote it, but I quickly found it out when I went to ask for money.

Here is the value disconnect: a system to collect, manage, and leverage data is that system solves  a secondary problem that assumes that a primary problem has been solved. In other words: If I'm trying to get you to fund a data collection, storage, and management platform, I'm assuming that there is something that is already generating the data that needs to be stored.  Netflix, Amazon, Google, my bank website, Blogspot, all solve primary problems. Those problems are easy to explain, regardless of how hard they are to implement. Solutions to primary problems have clear, concise, direct connections to value.

Solutions to secondary problems are optimizations of primary solutions. Decreasing time to insight on operational metrics of a website is an optimization. A great one, to be sure, but not necessary if the website is not getting any traffic.

Any solution to a secondary problem has an indirect value connection at best. Secondary solutions only make sense when the initial value proposition of the primary problem is diluted or reduced due to a secondary problem. A system that doesn't scale to support a site whose popularity is exploding through the roof is a secondary problem. The Secondary problems I see in my current role are operational in nature, and the solutions to them are optimizations. They can deliver huge value when done correctly.

"When done correctly". Three words that are seared into my brain. In the past year several things have happened while I've been trying to explain, again and again, why we should build a solution to a secondary problem, and while I've been trying to build that solution with limited resources:
  1. I've realized that the best way to solve a secondary problem is one primary problem at a time. Building a platform to optimize an undefined set of primary solutions is a risky, 'field of dreams' approach, and there are many ways to go awry. 
  2. I've become less of a technologist and more of a product owner. My last piece of production facing code will (hopefully) be retired in the next couple of months. There are much better engineers on the team, and I rely on them to deliver working software in the same way that they rely on me to come up with a useful product.
  3. Where I used to think about use cases and requirements and assume that these were valid, I now question and validate product direction up front. That involves use cases, but if the use case is invalid, why spend time extracting requirements from it? I spend more time thinking about validating the use case as cheaply as possible.  Requirements emerge and solidify as product direction takes shape -- doing them in advance of having a validated use case seems backwards. This insight radically changes the way software is delivered, and our teams are in the middle of this change process.
  4. The cost and recovery plan for any effort -- infrastructure and resources -- combined with time to recovery, is best defined as soon as you have validated use cases. Those plans need to change as validated features emerge that impact cost and recovery. In previous roles I had been 'sheltered' from that aspect of the business. I'm finding now that financial data is the ultimate data point that helps quantify whether value is being delivered.
This process is far from complete. I am continuing to learn every day, and while it can get very uncomfortable, it has been an amazing education. 

I've tried to write things down several times in the past 9 months. I haven't gotten very far because what I was writing didn't feel complete. Writing about the technology is only one side of the story. What I've learned in the past nine months is that there is a much bigger picture -- now that I'm starting to be able to externalize what I've learned, I'm excited to write about it. More soon...

Friday, January 11, 2013

More on the Enterprise Data Platform: Data Requirements

In my last post, I talked in very general terms about an Enterprise Data Platform (EDP) and in very specific terms around what I consider to be a core requirement of any EDP, data governance. If I have a set of services and processes that provide data governance, I have a way to manage data. What kind of data am I trying to manage?

I'm primarily concerned with building systems that contain  event data and reference data. Event data can be data copied from OLTP systems, it can be user click streams, machine data collected at regular intervals --  anything that signals an event happening.  Event data can be huge.

Reference data is data that can be used to classify/quantify aspects of an event.  If I'm looking at a click stream, a user ID is reference data that I can aggregate events by. In OLAP terms, events can be cast as facts, with reference data providing some of the dimension values.

This data starts to get valuable when  'raw' event data is joined to reference data, and then in turn joined to other event data along a specific reference dimension. For example, aggregating user click stream and email campaign opens by user ID could be used to track the rate at which the email campaign actually generated new users.

In addition to that kind of analytical use, event data can be used to classify users and/or determine their affinities. This kind of derivative data is typically referenced at runtime, by the same applications that are generating the event data.

The two cases above are interesting because they have opposing requirements. Analytic data must first and foremost be consistent, especially when financial reporting and/or business decisions are being made from that data. Consistency in this case implies that when a value is written, the next read reflects the last change made to that variable.

Runtime data must first and foremost be available, because unavailability of data may compromise application behavior. Besides, preference or personalization data is derivative data, generated on a set interval as a batch process. Being out of sync for a long time will eventually mean the application will grow 'stale', but when we talk about availability at the expense of consistency, the usual case is that any inconsistency between read state and written state will be resolved in sub second time.

Why is this important?

  1. Enterprises collect a lot of this data -- TB/day -- and that scale will swamp any single machine based database. 
  2. The enterprise must therefore store data on a storage platform that spans many machines -- a cluster. 
  3. The moment data is stored in a cluster, it is subject to the CAP theorem, which states that a distributed system cannot enforce Consistency, Availability, and Partition Tolerance at the same time. 
  4. While it is valid to desire a clustered system that favors either Consistency or Availability, Enterprise level  requirements state always require partition tolerance, so we can have one or the other, but not both. 
  5. This means that there are usually two main systems, an analytic focused system, and a runtime facing system. The analytic focused system favors consistency, the runtime facing system favors availability.
  6. An EDP that wants to offer both analytic and runtime facing data must have at least two systems, one which is consistent for analytic data, the other which is available for runtime facing data.
So now my definition of an EDP has evolved. It must have some fundamental Data Governance, and due to the scale enterprises operate at, must have 1..N analytics focused platform(s) and 1..N runtime facing platform(s). In my next post I'll focus on the analytic focused platform requirements.