Monday, May 15, 2017

Kubernetes Development Environment - Minimizing Configuration Drift, Maximizing Developer Speed

One of the most appealing aspects of  Container as a Service offerings like Kubernetes is that operational capability is mostly built in.  At a high level, applications and services are deployed to clusters with manifest files that contain configuration, required containers, and needed resources while the actual mechanics of the deployment are abstracted away from the end user. 

While PaaS offerings like Cloud Foundry offer the same level of operational automation, I believe Kubernetes has gained more traction than Cloud Foundry because it is possible to  move the stateful parts of an application into Kubernetes and run them as-is by mounting their containers to persistent volumes. I think that persistent disks are a work in progress in Cloud Foundry, but standard practice for persistence is to expose and bind to a service that provides persistence (e.g. database, file store)  via a CF cluster service broker.

Once a legacy app is containerized and ported into Kubernetes, it is possible to rapidly refactor from a monolith into purpose built microservices. Normally, without some kind of container orchestration, deployment complexity would increase dramatically. Kubernetes automates that deployment complexity away.

This is very exciting, especially to teams that are saddled with legacy codebases, faced with faster moving competitors, needing to deliver key features to capture or maintain market share. These teams know what they need to deliver, they just can’t do it fast enough. Velocity is key. Any technology that allows teams to ultimately deliver features faster allows them to then iterate on those features to gain and extend market fit. 

However, while porting a legacy codebase into Kubernetes is relatively straightforward, it is not possible to ‘lift and shift’ the engineering team skillset. So after the port, on day 1, engineering teams are still faced with a huge step function that prevents them from moving as quickly as they could. 

Making a long term successful migration to Kubernetes means ensuring that the engineering teams that own the applications and services that have been migrated can fulfill the promise of the system - by quickly refactoring and extending the legacy software from it's legacy state to a decomposed, flexible architecture that enables fast feature ideation.

To be able to make that leap,  developers who have been focused on a legacy codebase have to  ramp up on several potentially new concepts:  
  1. Microservice architectural patterns - so they can  start to refactor their monolithic legacy applications. 
  2. The Docker toolchain, so they can own the creation of containers for their applications and services.
  3. Key  concepts of Kubernetes, e.g. pods, deployments, services, labels, liveness checks,  etc, so they can understand how their application is deployed, patched, and scaled. 
  4. Operational aspects of Kubernetes, like logging,  attaching to containers, checking pod and service state, so they can debug and fix application issues once apps have been deployed. 
  5. Containerization in general - how containers are the unit of deployment and how that can be leveraged for tasks, tests, and ensuring that application configuration remains the same from their laptop to production.
What I don't address in that list: the very significant additional operational complexity of managing a Kubernetes cluster. Teams that need to run on premise - for instance teams that work for Financial Services companies - would need to come up to speed on that operational complexity as well.

Even in a managed Kubernetes instance, the complexity of multi service applications can pose problems to engineers used to coding to and debugging against a monolithic codebase.

One of the advantages of writing to a monolithic codebase is that the development environment allows for rapid iteration. Making changes to a monolithic codebase and then starting up the server on your local machine to validate those changes makes for a seamless, fast loop.  However, most local development environments don't resemble production environments at all, which leads to production problems that have configuration mis matches as a root cause.

The ideal development environment would allow for fast development, in an environment that parallels production as closely as possible, which enables rapid validation.




When an application or service is composed of several sub services that are in turn composed of multiple containers, setting up a good development environment gets much harder and that fast loop slows down, reducing the gains from container orchestration.  Prior to minikube, development was either done with a bunch of jury rigged containers (which meant massive configuration drift from the production environment) or with a remote kubernetes cluster, which preserved configuration at the cost of development speed because validation required deployment to that remote cluster.


Minikube runs Kubernetes on a single VM node, which removed the latency of deployment, but requires deployment to validate. This is a faster alternative to a remote cluster, but deployment to the cluster was still required:



This still didn't give me the speed I wanted. As a developer:
  1. I would rather spend more time in a 'local environment', e,g,  developing against live services while running a local version the service I'm writing. I would rather validate without having to deploy every single time. 
  2. I would always be developing a process that made calls to other services. I didnt want to have to 'fake' those services or drift my dev environment configuration from production. 
To get this behavior, I took advantage of the environment variables that Kubernetes creates per deployed application. If I was working on  microservice 1 that made calls to microservice 2, I would run microservice 2 in minikube, and export the environment variables that stored microservice 2's IP and port. Then I would set those variables to point to my Minikube instance for the IP, and the port that Minikube exposed microservice 2 on.


That setup let me run microservice 1 in my existing development environment and move through the code-debug-test loop as quickly as I used to with a monolith application.

I've posted the code on github. The readme shows how I set up for inter service development. As noted above, I ensured I was pointing at the docker runtime on the Minikube VM, I patched deployments instead of re-deploying, and (just like the original article above) I used makefiles to automate the commands I kept running:
  1. Building the Docker Image and deploying to Minikube
  2. enabling my service to run locally by setting environment variables
  3. pushing the image to Minikube by patching the deployment
  4. generating a service url once I've deployed to Kubernetes so that I can validate the deployed app.
I like this pattern, it allowed me to develop as fast as I normally do, while essentially keeping my development environment aligned to the production environment.



Saturday, March 11, 2017

Google Next '17 Recap

I just returned from Google Next '17. It was my first time at Next, or at a Google event. I had already been looking at Google services for both professional and personal work, and the agenda promised some answers to questions I had, and probably more questions that I'd need to answer.

Some of the reactions to Google Next are interesting. This one was disappointed in the focus on enterprise customers, wanting more visionary focus and less SAP :) I get the sentiment, but the reality I live in is the one that Google wants to transform.

The one thing I came away with from Next '17 is that there is a viable on-ramp for teams to move faster by delegating a lot of the undifferentiated heavy lifting they do on premise today to managed services in the cloud. That's because of what those services are and how they are implemented. Those services are what is pulling teams to public cloud.

Managed services have always been one of the core values of public cloud, along with general purpose IaaS. In the past few years,  and especially in the past year, the service diversity and capability all three major public cloud providers has made managed services the overwhelming core value of public cloud for most companies. IaaS is becoming more and more of an implementation detail.

The graph below from this post by Subbu Allamaraju explains where the value of public cloud is for most companies. Most companies have less than 1000 servers, and would benefit from the increased market share of getting product to customers faster before they benefit from any economy of scale. That's true regardless of which public cloud services you consume. They're managed for you, so you get to focus on building your product more, and spend less time operating that product.




Services are great, but not if they lock you into a single cloud. Ideally, a team uses services that exist on another cloud provider or  are open source, so at worst I could just run them on another providers IaaS.  If a service is proprietary, it needs to provide more value than a non proprietary version would.

So services are awesome, except when they're proprietary. Sort of. I have grouped Google offerings into three sections: Open - based on OSS, Proprietary - not OSS but still attractive, and Future State. I will try to explain the value proposition I see in each one. This is not the entire list of services, but the ones that really fit the problems I've been running into lately.

Open


Google Container Engine (GKE), based on Kubernetes, GKE  GA'd in August 2015, and momentum  really started to pick up in 2016.  Kubernetes has caught and passed Cloud Foundry as the default OSS way to think about running apps  both on prem and across all public clouds. Microsoft offers managed Kubernetes, and many teams run their own cluster on AWS.

One reason for the shift in momentum is that it is much easier to move legacy architectures to Kubernetes than it is to move them to Cloud Foundry.  While Cloud Foundry is now working with containers, it doesn't work with state - state is assumed to be outside of the cluster. This makes migrating any typical architecture over more involved.

At a minimum, moving an  app to run on Cloud Foundry would require it to be refactored to use a service broker for connections to stateful backend services. In contrast, moving a legacy application to Kubernetes would require a config change to point to the service endpoint of the stateful service.

I'm not saying it's completely plug and play -- the stateful service needs to be configured to meet required replication policies, and leverage persistent volumes -- but the hardcore 12factor requirements around statelessness are not present in Kubernetes. Because of this, migration to 12factor microservices can become much more incremental, which increases the chance of migration success.

Kubernetes, like clustered anything, is hard to configure and operate. The discussion on Kubernetes networking I attended on the last day of Next 17 reinforced that there is significant complexity that make running Kubernetes on premise much harder than using a managed service. I'm very excited to start experimenting with Kubernetes via Google Container Engine because I don't want to trip over that complexity.

I'm also excited to play with Minikube as part of a rational Kubernetes dev stack. Minikube is a great step forward for kubernetes developer tooling, and I think it's going to massively accelerate adoption. I think it's greatly simplified how to develop more complex Kubernetes apps that have multiple services. Developers can accurately replicate cluster state and service access patterns with minikube, and push changes to a deploy pipeline with more confidence.

DataFlow (based on Apache Beam) is exciting to me because of the unified processing model around batch and streaming. Prior to Beam,  cognitive overhead for the number of options for both batch and streaming was overwhelming. Being able to reason across a single system, even if that system uses different underlying processing technologies, makes it possible for teams to provide value faster because they don't have to ramp up on very diverse APIs and concepts. We can provide the same insights regardless of delivery mode

Proprietary


Despite what I said about not being locked into a single cloud, there are Google specific technologies that make me consider making exceptions to the rule. All of these are in the data processing/machine learning realm. Google IMO is far ahead of the other providers wrt getting intelligence from data. In this case I think the lock in is worth it due to the significant leverage gained.

The scale potential of Spanner is very exciting. The implementation is not a silver bullet - it forces you to reason about locality of data at the schema level - but in my opinion this is much better than the locality ignorant schemas of traditional RDBMS system that force people to horizontally partition and therefore silo data in response to scale.

BigTable is very appealing because of its natural fit to time series data. A lot of the data I play with is event based, and so most insights come from aggregating events . The kinds of in stream insights done in  DataFlow would be greatly expanded if it  can access state in BigTable.

The ability to make queries over BigTable with BigQuery is also really exciting. I had previously thought of BigQuery as a way to reason over object storage alone, but the ability to unify how to reason across multiple sources of data is simplifying, much in the same way that reasoning over batch vs stream processing is simplifying.

At this point it would be logical to ask if the services I'm most excited about exist on other public clouds. Lock in needs to be weighed against immediate value. Microsoft does offer Kubernetes as an option in it's Container Service, along with Mesos DC/OS and Docker Swarm. Apache Beam can be run on other IaaS and is in fact being built to use Spark as a runner.

Spanner, BigTable, and BigQuery are unique to Google, but the patterns (RDBMS, Columnar Storage, batch SQL across heterogenous data sources) have OSS analogues.  I don't feel as locked in as if I were building services at Amazon, using their purely proprietary services. But I am more locked in on the data processing side because Google is ahead of the scaled data processing curve when compared to the other public cloud vendors.

Future State


Beyond processing and storing data at scale, these services are the most exciting to me because they have the potential to democratize machine learning the way public cloud originally democratized self service compute, storage, and network.  One is based on open source, the other is proprietary, both are examples of how Google operates from a  future state.

Google Machine Learning Engine (Tensorflow) is intriguing. When I wrote a neural network as part of a class, I spent so much time struggling with setting it up that I lost perspective on the problem we were solving. Anything that purports to help me keep this perspective by making neural net construction (and the tuning associated with backpropagation) easier is something that gets me excited.

The Vision API - I'm assuming this is (partially) based on Tensorflow, because of the limited image recognition work I've done in the past. Just playing around with the API gives me about 20 new ideas I'd like to work into current personal projects. Now I just need a time management API...

I'm hoping to get some time to play with a lot of these in the next few weeks, and document my (mis) adventures here. It's been fun reading about the tech and doing 'hello world' applications, but I'd like to apply them to my current problem domain and see how far I get by leveraging them.