Waving Not Drowning

Google Next '17 Recap

2017-03-11T21:59:00.002-08:00

I just returned from Google Next '17. It was my first time at Next, or at a Google event. I had already been looking at Google services for both professional and personal work, and the agenda promised some answers to questions I had, and probably more questions that I'd need to answer.

Some of the reactions to Google Next are interesting. This one was disappointed in the focus on enterprise customers, wanting more visionary focus and less SAP :) I get the sentiment, but the reality I live in is the one that Google wants to transform.

The one thing I came away with from Next '17 is that there is a viable on-ramp for teams to move faster by delegating a lot of the undifferentiated heavy lifting they do on premise today to managed services in the cloud. That's because of what those services are and how they are implemented. Those services are what is pulling teams to public cloud.

Managed services have always been one of the core values of public cloud, along with general purpose IaaS. In the past few years, and especially in the past year, the service diversity and capability all three major public cloud providers has made managed services the overwhelming core value of public cloud for most companies. IaaS is becoming more and more of an implementation detail.

The graph below from this post by Subbu Allamaraju explains where the value of public cloud is for most companies. Most companies have less than 1000 servers, and would benefit from the increased market share of getting product to customers faster before they benefit from any economy of scale. That's true regardless of which public cloud services you consume. They're managed for you, so you get to focus on building your product more, and spend less time operating that product.

Services are great, but not if they lock you into a single cloud. Ideally, a team uses services that exist on another cloud provider or are open source, so at worst I could just run them on another providers IaaS. If a service is proprietary, it needs to provide more value than a non proprietary version would.

So services are awesome, except when they're proprietary. Sort of. I have grouped Google offerings into three sections: Open - based on OSS, Proprietary - not OSS but still attractive, and Future State. I will try to explain the value proposition I see in each one. This is not the entire list of services, but the ones that really fit the problems I've been running into lately.

Open

Google Container Engine (GKE), based on Kubernetes, GKE GA'd in August 2015, and momentum really started to pick up in 2016. Kubernetes has caught and passed Cloud Foundry as the default OSS way to think about running apps both on prem and across all public clouds. Microsoft offers managed Kubernetes, and many teams run their own cluster on AWS.

One reason for the shift in momentum is that it is much easier to move legacy architectures to Kubernetes than it is to move them to Cloud Foundry. While Cloud Foundry is now working with containers, it doesn't work with state - state is assumed to be outside of the cluster. This makes migrating any typical architecture over more involved.

At a minimum, moving an app to run on Cloud Foundry would require it to be refactored to use a service broker for connections to stateful backend services. In contrast, moving a legacy application to Kubernetes would require a config change to point to the service endpoint of the stateful service.

I'm not saying it's completely plug and play -- the stateful service needs to be configured to meet required replication policies, and leverage persistent volumes -- but the hardcore 12factor requirements around statelessness are not present in Kubernetes. Because of this, migration to 12factor microservices can become much more incremental, which increases the chance of migration success.

Kubernetes, like clustered anything, is hard to configure and operate. The discussion on Kubernetes networking I attended on the last day of Next 17 reinforced that there is significant complexity that make running Kubernetes on premise much harder than using a managed service. I'm very excited to start experimenting with Kubernetes via Google Container Engine because I don't want to trip over that complexity.

I'm also excited to play with Minikube as part of a rational Kubernetes dev stack. Minikube is a great step forward for kubernetes developer tooling, and I think it's going to massively accelerate adoption. I think it's greatly simplified how to develop more complex Kubernetes apps that have multiple services. Developers can accurately replicate cluster state and service access patterns with minikube, and push changes to a deploy pipeline with more confidence.

DataFlow (based on Apache Beam) is exciting to me because of the unified processing model around batch and streaming. Prior to Beam, cognitive overhead for the number of options for both batch and streaming was overwhelming. Being able to reason across a single system, even if that system uses different underlying processing technologies, makes it possible for teams to provide value faster because they don't have to ramp up on very diverse APIs and concepts. We can provide the same insights regardless of delivery mode

Proprietary

Despite what I said about not being locked into a single cloud, there are Google specific technologies that make me consider making exceptions to the rule. All of these are in the data processing/machine learning realm. Google IMO is far ahead of the other providers wrt getting intelligence from data. In this case I think the lock in is worth it due to the significant leverage gained.

The scale potential of Spanner is very exciting. The implementation is not a silver bullet - it forces you to reason about locality of data at the schema level - but in my opinion this is much better than the locality ignorant schemas of traditional RDBMS system that force people to horizontally partition and therefore silo data in response to scale.

BigTable is very appealing because of its natural fit to time series data. A lot of the data I play with is event based, and so most insights come from aggregating events . The kinds of in stream insights done in DataFlow would be greatly expanded if it can access state in BigTable.

The ability to make queries over BigTable with BigQuery is also really exciting. I had previously thought of BigQuery as a way to reason over object storage alone, but the ability to unify how to reason across multiple sources of data is simplifying, much in the same way that reasoning over batch vs stream processing is simplifying.

At this point it would be logical to ask if the services I'm most excited about exist on other public clouds. Lock in needs to be weighed against immediate value. Microsoft does offer Kubernetes as an option in it's Container Service, along with Mesos DC/OS and Docker Swarm. Apache Beam can be run on other IaaS and is in fact being built to use Spark as a runner.

Spanner, BigTable, and BigQuery are unique to Google, but the patterns (RDBMS, Columnar Storage, batch SQL across heterogenous data sources) have OSS analogues. I don't feel as locked in as if I were building services at Amazon, using their purely proprietary services. But I am more locked in on the data processing side because Google is ahead of the scaled data processing curve when compared to the other public cloud vendors.

Future State

Beyond processing and storing data at scale, these services are the most exciting to me because they have the potential to democratize machine learning the way public cloud originally democratized self service compute, storage, and network. One is based on open source, the other is proprietary, both are examples of how Google operates from a future state.

Google Machine Learning Engine (Tensorflow) is intriguing. When I wrote a neural network as part of a class, I spent so much time struggling with setting it up that I lost perspective on the problem we were solving. Anything that purports to help me keep this perspective by making neural net construction (and the tuning associated with backpropagation) easier is something that gets me excited.

The Vision API - I'm assuming this is (partially) based on Tensorflow, because of the limited image recognition work I've done in the past. Just playing around with the API gives me about 20 new ideas I'd like to work into current personal projects. Now I just need a time management API...

I'm hoping to get some time to play with a lot of these in the next few weeks, and document my (mis) adventures here. It's been fun reading about the tech and doing 'hello world' applications, but I'd like to apply them to my current problem domain and see how far I get by leveraging them.

Changing Roles...again

2016-11-26T18:15:00.003-08:00

So much for that new years resolution to post more -- its coming up on a year since my last post. About two months ago I moved from Product Management back into Engineering. In September, I left the HPE Stackato product management team and took a senior engineering management position at Zonar Systems, a Seattle based vehicle telematics company.

The last two years were hard, but the lessons I learned were ultimately worth the struggle, because I learned so much.

I learned to execute better, from being very specific about what I wanted out of every meeting, to always driving discussions to closure, even if it is across several meetings and emails.

I learned how important it is as a Product Owner to define the arc of a backlog - the way product themes are implemented via epics, stories, and ultimately tasks -- and how important it is to keep the backlog well defined.

I learned how inspirational leaders can unleash the best out of a team, and how non inspirational leaders can demotivate the same team.

I also learned a lot about myself. I did have some misgivings about the role, and as time went on those misgivings were shown to be true. Unfortunately I ignored those initial instincts, and by the time I started listening to them, the compromises I had made to stay in the role were significant, and I didn't know if I had compromised so much that I was 'trapped' in the role.

The one thing I really missed was thinking about the 'How'. As a Product Manager, I loved thinking about the Why and the What, but in order to be effective as a Product Manager I had to delegate the How completely.

When I realized that being a Product Manager was not for me, I sat back and made a list of what I wanted to do next, and after some searching, found that position at Zonar.

I've been very happy digging into the new job. Ironically, it's the things that I learned as a product manager that have been the most critical to my success in the first two months. That's great, because there was a period of time where I was really questioning the move to product management and whether it was a mistake. At this point it looks like the best possible move to have made.

I would like to say that I was that smart and forward looking, and intentional about my career path. The truth was that I was never thinking that far ahead. What I really did was follow my curiosity, which is what I've always done. I started in software because I was interested in writing the logic behind the applications I used. The job choices I made as a software engineer were motivated by the chance to learn new technologies and skills to solve harder and more interesting problems. That's been the story of my career and my life, and so far it has panned out well.

Changing Roles

2016-01-11T21:56:00.000-08:00

It has been almost 2 years since I've posted anything in this blog. In July 2014 I decided to leave Disney, and more significantly, change roles. At Disney I had been leading the engineering and analytics teams of a central data service. In my not-so-new role at HP I'm on the product management team for the Helion Development Platform, which in it's current incarnation is known as HPE Helion Stackato, a Cloud Foundry based Platform as a Service.

I was (and still am!) excited to do Product Management. As an engineer I have seen how essential the consistent application of product vision and stewardship can be to an engineering effort. I did not know if I really had product vision or was just delusional, but I wanted to find out either way...

The best part about this job is that I get to think - a lot - about how to make software development better for the average enterprise engineer. The struggle is real! Most engineering teams still write IaaS based applications like they were running on bare metal servers, and that leads to a world of hurt, because an application distributed across IaaS provided resources has to contend with underlying network, compute, and storage service failures.

The promise of Platform as a Service is that it enables easy development and deployment of cloud native applications - applications that take advantage of the elasticity of the cloud while dealing with the ephemerality of the underlying IaaS. Cloud native is a concept that takes most engineering organizations some time to get their head around.

As a result, there is a significant educational aspect to my role, which I love. I get to help people to focus on creating value with software, something that has enthralled me since I was 14 years old teaching myself BASIC on my Dads IBM PC/AT.

I get to present my thoughts to various captive audiences, either at customer onsite visits or conferences. Here is a presentation from HP Discover 2015 that I think captures the problem we are trying to solve and the best approaches to take to solve the problem.

In 2016 I'm really excited because of the acceleration and rapid maturity of several key technologies that have the potential to play very well together.

Containers, mostly via Docker, have brought easy authoring and immutable infrastructure into mainstream software development. As an example, the other day, instead of hand installing Kafka and Zookeeper onto a VM and then doing it again by hand when I needed to grow my test Kafka cluster, I just typed "docker run...", pointing to a Kafka/ZK image built and published to Docker Hub by Spotify. I got to take advantage of all of their hard work, and save the potential multiple hours required to get that image working correctly.

Kubernetes and Cloud Foundry are viable orchestration mechanisms for distributed, container based applications, handling deployment, scaling, and failure remediation -- deployment and scaling are things that we used to do by hand, late at night, on pins and needles. Dealing with failure usually meant doubling your hardware, or writing complex startup scripts customized to each application. Both approaches are quite differently opinionated, and I can see the merits of each one for different use cases, sometimes in the same overall application stack!

Mesos has emerged as an intermediate resource management layer that abstracts the underlying IaaS away. The emergence of next generation big data workloads running on Mesos is something that really excites me as an ex service owner who struggled to justify value of the insights provided by using big data technologies against the steep costs of hardware. Having a layer that allocates a finite resources and maximizes resource allocation across very diverse workloads makes the start up cost of new experimental data investigation much lower, and therefore much more likely.

All of these technologies are evolving at an incredible rate. I'm excited to see, and hopefully play a role in delivering, the next generation of platforms that make these technologies easy to consume and manage, and allow engineering teams to focus on features instead of infrastructure.

I'm trying to post more this year - the last 18 months have been a heads down, get it done, tactical march. That was great, but I think I miss key insights when I don't occasionally digest and reflect what is going on around me. I hope to do more of that here over the next year. It's a new years resolution, hopefully one that will last longer than the one I made about not eating sugar :)

Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

2014-03-18T23:29:00.002-07:00

A Swing And A Miss

In my last post I had trained a Bayesian classifier using a dataset pulled from sitejabber.com, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:

Did I really test and validate the data in the most effective way?
As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
If so, how do I go about increasing data from a finite set of data?

Giving the classifier another chance with K fold validation

Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

I can get that coverage and reproducibility by using k fold validation across the data.

K fold validation works like this:

break the dataset into K equivalent subsets.
hold one of the subsets out for testing.
use all of the other subsets for training.
train the data on the k-1 subsets, test it on the kth subset.
rotate through all subsets - repeat 2-4, holding out a different subset each time.
average the accuracy of all test+train processes.

When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

This is the method I used to split an array into k folds:

def partitionArray(self,partitions, array):

"""

@param partitions - the number of partitions to divide array into

@param array - the array to divide

@return an array of the partitioned array parts (array of subarrays)

"""

nextOffset = incrOffset = len(array)/partitions

remainder = len(array)%partitions

lastOffset = 0

partitionedArray = []

for i in range(partitions):

partitionedArray.append(array[lastOffset:nextOffset])

lastOffset= nextOffset

nextOffset += incrOffset

partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])

return partitionedArray

This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):

"""

build test and training sets

@param iteration - the offset of the arrays to hold out

@param reviewsByRating - the set of reviews to build from

@return test and training arrays

"""

test = []

test.extend(reviewsByRating[1][iteration])

test.extend(reviewsByRating[5][iteration])

training = []

for i in range(folds):

if i == iteration:

continue

training.extend(reviewsByRating[1][i])

training.extend(reviewsByRating[5][i])

return training, test

Increasing The Data Set with Sampling

How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach. In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews. In order to increase the positive data set I need to generate more positive reviews.

Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set.

One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

Pretend the following sentence is actually a review:

The big big green caterpillar ate the small green leaf.

putting the words in an array that looks like this:

somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'. I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

In Python sampling with replacement looks like this:

word = somearray[random.randint(0,len(somearray))]

I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

def createReview(self,textFreqDist,reviewLength):

"""

@param textFreqDist - the array containing the frequency distribution of words to choose from.

@param reviewLength - the length of the review (in words) to build

@return the generated review as a string

"""

randLen = len(textFreqDist)

reviewStr = ""

for i in range(reviewLength):

reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')

return reviewStr

A Cautionary Note on Overfitting

When I first did the positive 'boost', I was getting really good results....really, really good results. 99% accuracy on a test set was a number that seemed too good to be true. And it was.

In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart.

When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased.

New Test Results

I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process, the ratio of positive to negative reviews in the test data remainsthe same. The code used to train and test the data is the same as before.

My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews.

Summary

This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency.

If I were to do this over again, I would do the following:

If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
k-fold validation from the beginning to get the average accuracy per approach.
investigate mis-classification errors before doing any optimization!

Most of my time was spent analyzing and building the optimal training data set. The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews.

A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data.

Potential next steps include:

Getting (more) new data from a different source.
Trying the bayes classifier on that data
Trying a different classifier, e.g. the maxent classifier on the same data
Going deeper into sentiment: what entities were positive / negative sentiment directed at?

Making Sense of Unstructured Text in Online Reviews Part 3: Trying to Improve Classifier Accuracy

2014-02-04T22:23:00.002-08:00

This post is part of a series where I try to classify online review text in more and more concrete ways. Right now I'm training a classifier to accurately classify one (bad) vs five (good) starred reviews. In the last post I had done some initial training and testing of an NLTK Bayesian classifier. In this post I want to see if I can improve the accuracy score of my classifier by getting smarter about which features I include.

In the last post I had experimented with varying the quantity of feature set, and had found that while encoding more features into a classifier during training helps accuracy, there is an eventual accuracy ceiling. My feature set came from taking the top N words from a frequency distribution of all words in the reviews text. Here is what the accuracy curve looks like:

One other way to improve accuracy is to address the 'quality' of the feature set by looking at features not only in terms of their frequency across the training corpus, but looking at their relative frequencies across classifications.

In the review classification done so far, individual words are the features. I'm going to try to 'tune' feature sets in several different ways -- I have no idea if these will work, but they seem reasonable. I'm going to call these attempts hypotheses, because my goal is to prove them to be true or false, with relatively minimal effort.

Hypothesis 1: Throw away features with a low 'frequency differential'

My hypothesis is that there are features that have a much higher chance of being in a negative review than a positive review, and vice versa. Those are the features that we want to keep. Other features are ones that have approximately the same chance of being in either type of review (positive or negative).

P(review rating | features) = P(features, review rating)

In the equation above, the P(features, review rating) term is the multiplied probabilities of each P(feature, review rating). If I'm looking for a higher overall probability that a document is one star over five star or vice versa, having per feature probabilities that are similar for one star or five star reviews means that my overall probabilities for one and five star will be close to equal, which could tip classification results 'the other way' and increase my error rate.

I can validate this hypothesis by filtering out those low probability differential features and keeping the ones that have a high probability differential: a high difference between P(feature, review rating) for {1 star, 5 star} ratings.

Building The Feature Set

I had trained and tested the classifier by taking input data, splitting it into a test and a training set, then training and testing. I will recreate that process now to get the raw data so that I can 'remove' common terms with low probability:

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

asd = AnalyzeSiteData()

I ended up recoding the building of the training and test data so that the data sets being built had a more even distribution of ratings across them:

def generateLearningSetsFromReviews(self,reviews, ratings,buckets):

# check to see that percentages sum to 1

# get collated sets of reviews by rating.

val = 0.0

for pct in buckets.values():

val += pct

if val > 1.0:

raise 'percentage values must be floats and must sum to 1.0'

reviewsByRating = defaultdict(list)

for reviewSet in reviews:

for rating in ratings:

reviewList = [(self.textBagFromRawText(review.text), rating)

for review in reviewSet.reviewsByRating[rating]]

reviewsByRating[rating].extend(reviewList)

random.shuffle(reviewsByRating[rating]) # mix up reviews from different reviewSets

# break collated sets across all ratings into percentage buckets

learningSets = defaultdict(list)

for rating in ratings:

sz = len(reviewsByRating[rating])

lastidx = 0

for (bucketName, pct) in buckets.items():

idx=lastidx + int(pct*sz)

learningSets[bucketName].extend(reviewsByRating[rating][lastidx:idx])

lastidx = idx

return learningSets

When I built up the training data using this method, the sets were returned in the buckets[] array:

buckets = asd.generateLearningSetsFromReviews([sjr],[1,5],{'training': 0.8,'test':0.2})

Each training set in this list is actually an array of (textBag, rating) tuples:
buckets = [[(bagOfText,rating)...],[..]]

I want to get frequency distributions of common terms from one and five star reviews in the training data, so that I can find terms with a high probability differential:

          # get common terms and frequency differentials

          allWords1 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 1]

fd1 = FreqDist(allWords1)

allWords5 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 5]

fd5 = FreqDist(allWords5)

          commonTerms = [w for w in fd1.keys() if w in fd5.keys()]

          # now get frequency differentials

          commonTermFreqs = [(w,fd1.freq(w),fd5.freq(w),abs(fd1.freq(w)-fd5.freq(w)))
              for w in commonTerms]

          commonTermFreqs.sort(key=itemgetter(3),reverse=True)

Now we've got common terms, sorted by their absolute differential between frequency distributions in 1 and 5 star reviews.

if I plot this distribution:

freqdiffs = [diff for (a,b,c,diff) in commonTermFreqs]
  plt.plot(freqdiffs)
plt.show()

I can see that it falls off sharply:

This looks like a Zipfian distribution: "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word..."

The shape of this distribution implies that only a small subset of the terms actually have a frequency differential that really 'matters' in the hypothesis -- all terms aren't needed. I can start arbitrarily by keeping all terms with a frequency differential > 0.001 to quickly test the hypothesis. That leaves 131 of the original 688 common terms.

Note that in getting and filtering common terms, I have not retained the terms that very strongly signal one review rating or another: those would be the terms that exist only in one review rating corpus or another. Note that even though those terms do not exist in one or the other review corpus, and that would make the calculation go to zero, the non existent terms are 'smoothed out' by including them in the other corpus and adding a very small value to the frequency of all terms in that corpus, which guarantees that there are no terms with a zero frequency, and the Bayesian calculation won't zero out.

I would need to add those terms into the set of terms that we filter by.

The full set of filtered terms is comprised of both uncommon and filtered common words:

filterTerms = [w for (w,x,y,diff) in commonTermFreqs if diff > 0.001]

fd1Only = [w for w in fd1.keys() if w not in fd5.keys]

filterTerms.extend(fd1Only)

fd5Only = [w for w in fd5.keys() if w not in fd1.keys]

filterTerms.extend(fd5Only)

defaultWordSet = set(filterTerms) # rename so I dont have to rewrite the encoding method

And I use those words as features identified at encoding time:

def emitDefaultFeatures(tokenizedText):

'''

@param tokenizedText: an array of text features

@return: a feature map from that text.

'''

tokenizedTextSet = set(tokenizedText)

featureSet = {}

for text in defaultWordSet:

featureSet['contains:%s'%text] = text in tokenizedTextSet

return featureSet

Testing The Hypothesis

Now I can train the classifier: asd.encodeData() takes care of encoding features from the training and test sets by calling emitDefaultFeatures() for each review.

encodedTrainSet = asd.encodeData(rawTrainingSetData,emitDefaultFeatures )

classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

print nltk.classify.accuracy(classifier, encodedTestSet)

And I get an accuracy of 0.83, the same accuracy I got with no manipulation of the feature set, which is 0.02 less than my optimal accuracy. Whoops.

Detailed Error Analysis

There is one other step I can take to understand the accuracy of the classifier, and that is to analyze the errors made on the test set. If I know how I mis-classified the data, that can help me affect the classifier.

shouldBeClassed1 = []

shouldBeClassed5 = []

for (textbag, rating) in buckets['test']:

testRating = classifier.classify(emitDefaultFeatures(textbag))

if testRating != rating:

if rating == 1:

shouldBeClassed1.append(textbag)

else:

shouldBeClassed5.append(textbag)

A quick check on the error arrays shows me that I've only made mistakes on the reviews that should be classified as positive:

>>>> print len(shouldBeClassed5.append(textbag))

Wait a minute. That number looks familiar. Let me review the raw data again:

>>>>print len(sjr.reviewsByRating[5])

>>>>print int(0.2*len(sjr.reviewsByRating[5]))

This data shows that I mis-classified all 11 positive reviews in the test data, because my error analysis showed that I had eleven mis-classified positive reviews, and I only had 11 positive reviews in the teset set based on an 80% training/20% testing split.

A quick reversal to the original test method (that collected features from a FreqDist of all terms in the training data) shows that I mis-classified all 11 positive reviews as well.

Summary

This was one attempt to improve classifier accuracy by trying something reasonable with the feature set -- removing features whose probability differential across 1 star and 5 star review corpuses was very small.

While the numbers initially looked 'decent', deeper analysis shows that my classifier completely mis-classified positive reviews. In the future I'll do error analysis of classifiers before trying to theorize about what could make the classifier more accurate.

Looking closer at the data, the data set had 55 total positive reviews and 273 total negative reviews. In other words only 20% of my data was actually positive review data.

I had originally scraped only one reviewed site for data, but now I think I'm going to need to scrape more sites to get a more representative set of positive review data so that the classifier has more training examples.

In my next post I'm going to try to collect a more representative 'set' of data, and also take a slightly different approach to validating my classifier. I'm going to do error analysis up front and attempt to correct my classifier based on the errors I see, then test the classifier against new test data -- testing a fixed classifier against the data I used to fix it will give me a false sense of accuracy, because the test data used to do error analysis has in effect become training data.

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

2014-01-22T22:18:00.001-08:00

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants who wrote and maintain the NLTK package. In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

from sitejabberreviews import SiteJabberReviews
from analyzesitedata import AnalyzeSiteData

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this: the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'. So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier by training it with one star and five star review data. This is a pretty simple, binary approach to review classification.

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:

Extract train and test data from my review data.
Encode train and test data.
Train the classifier with the encoded training data
Test the classifier with the encoded test data.
Investigate errors during the test
Modify training set and repeat as needed.

I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):

# generate tuples of (textbag,rating)

reviewList = [(self.textBagFromRawText(review.text), key)

for review in reviews.reviewsByRating[key]]

return reviewList[: int(trainSetPercentage*len(reviewList))],

reviewList[int(trainSetPercentage*len(reviewList)):]

the generateTestAndTrainingSetsFromReviews() method calls textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

def textBagFromRawText(self,rawText):

'''

@param rawText: a string of whitespace delimited text, 1..n sentences

@return: the word tokens in the text, stripped of non text chars including punctuation

'''

rawTextBag = []

sentences = re.split('[\.?!&,]',rawText)

for sentence in sentences:

lowered = sentence.lower()

parts = lowered.split()

rawTextBag.extend(parts)

textBag = [w for w in rawTextBag if w not in stopwords.words('english')]

return textBag

I generate test and training data for one and five star reviews using generateTestAndTrainingSetsFromReviews():

# load helper objects

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

asd = AnalyzeSiteData()

trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)

trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)

rawTrainingSetData = []

rawTrainingSetData.extend(trainingSet1)

rawTrainingSetData.extend(trainingSet5)

random.shuffle(rawTrainingSetData)

rawTestSetData = []

rawTestSetData.extend(testSet1)

rawTestSetData.extend(testSet5)

random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document. I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

# for raw Training Data, generate all words in the data

all_words = [w for (words, condition) in rawTrainingSetData for w in words]

fdTrainingData = FreqDist(all_words)

# take an arbitrary subset of these

defaultWordSet = fdTrainingData.keys()[:1000]

def emitDefaultFeatures(tokenizedText):

'''

@param tokenizedText: an array of text features

@return: a feature map from that text.

'''

tokenizedTextSet = set(tokenizedText)

featureSet = {}

for text in defaultWordSet:

featureSet['contains:%s'%text] = text in tokenizedTextSet

return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here:

def encodeData(self,trainSet,encodingMethod):

return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training data encoded, we can encode the data and train the classifier as follows:

encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)

classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.

Making Sense of Unstructured Text in Online Reviews, Part 1

2014-01-04T22:49:00.001-08:00

Introduction

I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

# first get the pages we need to navigate to to get all reviews for this site.

page = urllib2.urlopen(self.pageUrl)

soup = BeautifulSoup(page)

pageNumDiv = soup.find('div',{'class':'page_numbers'})

anchors = pageNumDiv.find_all('a')

urlList = []

urlList.append(self.pageUrl)

for anchor in anchors:

urlList.append(self.base + anchor['href'])

# with all pages set, pull each page down and extract review text and rating. s

for url in urlList:

page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above:

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements.

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

r = Review(text,rawRating)

if self.reviewsByRating.has_key(r.rating):

reviews = self.reviewsByRating[r.rating]

else:

reviews = []

self.reviewsByRating[r.rating] = reviews

reviews.append(r)

and flushed that map to disk using pickle:

def saveToDisk(self):

with open(self.filename,'w') as f:

pickle.dump(self.reviewsByRating,f)

this let me load the data from file without having to scrape it again:

def load(self):

with open(self.filename,'r') as f:

self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data.

Innovation Week Recap

2013-12-08T21:07:00.001-08:00

I previously posted about our leadup to Innovation Week, which ended up being more like Innovation Week-and-a-half because it shifted a sprint ending the week of Christmas, which is pretty sparsely attended due to everyone being out of town.

The 10 days of innovation ended up being much more successful than I had thought possible. There was some very out of the box thinking, both in the realm of infrastructure and analytics, and some of these ideas have huge potential to shift how we think about big data.

The reasons I thought that things might not go so well (and what ended up happening):

Lack of ideas. At the time I wrote the last post, we were up to 6. We topped out at 14, all well thought out and presented. We had to narrow the ideas down to 5 based on peoples availability -- we did that with a team-wide vote
Lack of managment: we -- the management team -- had specifically decided to let the teams be self organizing, and not interfere with them even if we saw them go off the rails. No one went off the rails, and teams organized around the work and the capabilities of the team members. We did make ourselves available for questions/advice, but other than that we sat back and observed.
Technical roadblocks: the ideas we ended up voting in (as an entire team) had some steep technical hurdles. I wasn't sure if the teams could overcome those, and wasn't sure what they would do if they couldn't. Every team had at least one significant roadblock that they worked around with little to no guidance.
I'm as ~~pessimist~~ realist, and tend to prepare for worse case scenarios. Apparently I overestimate myself and my management team's contributions :)

The presentations were great in that all except for one were live demos of working software -- one key difference between this and standard demos is that the teams owned the ideas and were therefore much more invested in how the demos went.

We're taking the top ideas and starting new work that will get prioritized against existing deliverables. While I'm obviously excited about the ideas, some of which I consider to be fundamental game changers, I'm just as excited because of what I learned about leading teams.

Our best ideas come from our people, and when we guide them and set the target, they crush it. As management our primary job should be to clearly communicate a vision of where the team needs to be, inspire them by giving them ownership and autonomy, and get obstacles out of their way.

Sometimes I feel like the best teams are the ones that build up ideas the way Barca moves the ball down the field:

There is no 'central control', there is just the idea -- the ball -- and the team, which supports each other as they move the ball downfield, and the magic that happens because the team is focused on doing what it takes to move the ball, develop the attack, and put together a combination that finishes in the opponent's net. What blows me away is that each of these players has amazing skill but they are so much more effective with one touch passing and holding the triangle. I see the same thing on engineering teams that work well together. The top talent doesn't hold onto the ideas, they share them and make themselves available to move it along, and in doing so bring everyone up to their level. Seeing that happen without explicit guidance was the best part of Innovation Week for me.

Hadoop Streaming with MRJob

2013-11-17T21:00:00.000-08:00

Motivation to use Streaming:

Writing java map-reduces for simple jobs feels like 95% boilerplate, 5% custom code. Streaming is a much simpler interface into Mapreduce, and it gives me the ability to tap into of the rich data processing, statistical analysis and nlp modules of Python.

Motivation to use mrjob:

While the interface to Hadoop Streaming couldn't be simpler, not all of my jobs are simple 'one and done' map-reduces, and most of them require custom options MRJob allows you to configure and run a single map and multiple reduces. It also does some blocking and tackling, allowing me to customize arguments and passing them into specified jobs. Finally, mrjob can be applied to an on prem cluster or an amazon cluster - and we are looking at running amazon clusters for specific prototype use cases.

mrjob and streaming hurdles

The mrjob documentation is excellent for getting up and running with a simple job. I'm going to assume that you have read enough to know how to subclass MRJob, set up a map and a reduce function, and run it.

I'm going to discuss some of the things that weren't completely obvious to me after I had written my first job, or even my second job. Some of these things definitely made sense after I had read through the documentation, but it took multiple reads, some debug attempts on a live cluster, and some source code inspection.

Hurdle #1: passing arguments

My first job was basically a multi dimensional grep: I wanted to walk input data that had timestamp information a tab delimited field and only process those lines that were in my specified date range. In order to do this I needed two range arguments that took date strings to do the range check in the mapper. I also wanted to be able to apply specified regex patterns to those lines at map time. Because there were several regex patterns, I decided to put them in a file and parse them. So I needed to pass three arguments into my job, and those arguments were required for every mapper that got run in the cluster.

In order to pass arguments into my job, I had to override the configure_options() method of MRJob and use add_passthrough_option() for the range values, and add_file_option() for the file that held the regexes:

def configure_options(self):

super(HDFSUsageByPathMatch,self).configure_options()

self.add_passthrough_option("--startDateRange",type='string',help='...')

self.add_passthrough_option("--endDateRange",type='string',help='')

self.add_file_option("--filters")

All options were passed straight through to my job from the command-line:

python job.py --startDateRange 01/01/13 --endDateRange 12/01/13 --filters filters.json

I referenced them in an init function of my job class, which subclassed the MRJob class:

class MyJob:
...

def task_init(self):

self.startDateRange = dateutil.parser.parse(self.options.startDateRange)

self.endDateRange = dateutil.parser.parse(self.options.endDateRange)

self.filters = parseJsonOptions(self.options.filters)

This init method was specified in the MyJob.steps() override of the default MRJob method:

def steps(self):

return [

self.mr(mapper_init = self.task_init,

.....

]

Something to note here: In the code I had written during development, I had neglected to really read the documentation and as a result I had previously done all validation of my custom args using a standard OptParse class in my main handler. This worked for me in inline mode, which is what I was developing in. It does not work at all when running the job on a cluster, and it took some source code digging to figure out. Do as I say, not as I do :) In hadoop mode, the main MRJob script file is passed to mapper and reducer nodes with the step parameter set to the appropriate element in the steps array. The entry point into the script is the default main, and MRJob has a set of default parameters it needs to pass through to the MRJob subclassed job class. Overriding parameter handling in main effectively breaks MRJob when it tries to spawn mappers and reducers on worker nodes. MRJob handles the args for you, and you need to let it handle all arg parsing, and pass custom arguments as passthrough or file options.

Hurdle #2: passing python modules

This nuance has more to do with streaming than it does with mrjob. But it's worth understanding if you're going to leverage non-standard Python modules in your mapper or reducer code, and those modules have not been installed on all of your datanodes.

I was using the dateutil class because it makes parsing dates from strings super easy. On a single node, getting dateutil up and running is this hard:

easy_install python-dateutil

But when you're running a streaming job on a cluster, that isn't an option. Or, it wasn't an option for me because the ops team didn't give me sudoers permissions on the cluster nodes, and even if they did, I would have had to write the install script to ssh in, do the install, and roll back on error. Arrgh, too hard.

What worked for me was to

Download the source code
Zip it up (it arrived in tar.gz)
Change the extension of the zip file because files that end in .zip are automatically moved to the lib folder of the task's working directory
Access it from within my script by putting it into the load path:

sys.path.insert(0,'dateutil.mod/dateutil')
import dateutil
...

I'm passing dateutil.mod as a file passed in via add_file_option() in myjob.configure_options(). Leveraging the add_file_option() method puts dateutil.mod in the local hadoop job's working directory:

def configure_options(self):

super(HDFSUsageByPathMatch,self).configure_options()
....

self.add_file_option("--dateutil")

Three things to note from the above code: (1) dateutil.mod is the zip file, (2) I'm referencing a module within the zip file by it's path location in that zipfile, and (3) because I've renamed the file, it gets placed in the job working directory, which means it is on my path by default.

This is how I pass dateutil.mod into the job:

python job.py ... --dateutil dateutil.mod

Hurdle #3 (not quite cleared): chaining reduces vs map-reduces

As mentioned in the doc, it's super easy to chain reduces to do successive filtering and processing. Simply specify your multiple reduces in the steps() override:

def steps(self):

return [

self.mr(mapper_init = self.task_init,

mapper=self.mapper_filter_matches,

combiner=self.combiner_sum_usage,

reducer=self.reducer_sum_usage),

self.mr(reducer_init = self.task_init,

reducer=self.reducer_filter_keys)

]

I haven't found it necessary to run successive mapreduces -- successive reduces work just as well in the use cases I've tried. When chaining reduces to the end of your first mapreduce, you can specify the key value from the first mapreduce as the key value in the next reduce.

What is not easy at this time is the ability to save intermediate output to a non intermediate location. While doing that is relatively straightforward in 'inline' mode, the approach suggested in the link won't work in hadoop mode because MRJob is invoking the python script with the right --step-num argument based on what it sees in the steps() method.

I did read about the --cleanup option, but from what I understand the intermediate output dir of a complex job is based on a naming convention, not on something I can set. As this is somewhat of an edge case, I can work around it by chaining MRJob runs with Oozie.

Summary

What I've learned about MRJob is that while it does a great job of allowing you to set and pass options, and allows you to construct good workflows (assuming you don't care about intermediate output), it is so easy to use that I fell into the trap of believing that running local on my machine was equivalent to running on a hadoop cluster.

As I've found out several times above, that is not the case. For me the keys here are (1) let MRJob handle your job specific variables, (2) leverage the steps() method for your more complex flows, and (3) if you need to save intermediate output, chain your jobs using an external scheduler.

Innovation -- trying to break out beyond the buzzword

2013-11-08T16:52:00.001-08:00

Innovation is the poster child of buzzword bingo.

It's hard not to have an allergic reaction to people that talk about it, because you can't talk about innovation and do it at the same time.

So why am I talking about Innovation instead of doing it ? :)

A couple of months back, when we were revamping our development process, basically going from 'Scrum-in-name-only' to something much more genuine (and I've got to do a post on that), we wanted to give people a block of time to do something completely different from their day jobs. We wanted them to work with different people, outside of their usual teams, on ideas that they (not we) thought of. We wanted to break down some of the walls that naturally occur when you section large teams into smaller units to get work done efficiently.

We're sitting on some amazing data and have built some great infrastructure to manage it. These people are on the teams that are work with that data and use that infrastructure day in and day out. They're smart. I know they have ideas on new data products, or tools to make getting insights easier, but no time to actually work on them. Most importantly, I know their ideas are good ones, because I've seen multiple people make those ideas happen in spite of having no time to work on them. We have products that we've built because people have championed their ideas into the delivery stream. I wanted to make that easier. You shouldn't have to be Rocky Balboa to get a good idea off the ground.

In other words, that kind of effort shouldn't happen on nights and weekends, against all odds -- we need to reward that kind of creativity during business hours -- while balancing the delivery needs of the business.

'Innovation Week' is our collective attempt to do just that. One week a quarter is enough time to stop business as usual and try something completely different. Innovation Week is very much an experiment, one that could go well....or not.

The overall plan:

Before:

Announce the week.
Send out a 'request for ideas' email
Review ideas in as many sessions as we needed:

the idea 'author' presents their idea canvas.
We go over the canvas, ask questions, offer suggestions.

During:

Everyone sells their idea.

key in the selling: they need to ask for help where they need it.

People provide their first, second, third choices.
We assign people to ideas -- the reason we arent going to just let people choose is that we don't want imbalanced teams, and we want to make sure groups were diverse.
The teams work on the ideas -- we are available to unblock any issues and provide guidance if asked.

After:

Every team presents their work.
The group stack rates all ideas.
The top 3 get prizes.
The management team gives separate awards for

Business Value
Completeness of Effort
Disruptiveness (of the idea, of the technology being used, etc)

As the saying goes: 'no battle plan survives first contact with the enemy'. I was fairly nervous. What if no one had ideas? What if the team could care less? What if they were as allergic to the I-word as I was?

Our first idea review meeting was last night. Instead of the 1 or 2 ideas we had predicted, we have (at last count ) six. Instead of the vague, tech-aspirational ideas we thought we were going to see -- things along the vein of 'I want to play with technology X, here is a contrived attempt to justify that', we saw carefully thought out resolutions to problems our team was either working around or about to go through.

The discussion around the ideas was very positive and constructive -- the ideas that were presented got a lot of feedback and suggestions about how they could be better. The best part was getting individuals that they had good ideas and that exposing them to the group would make those ideas better. The best moment was when one of the most quiet, most unassuming engineers got up and proceeded to unveil a completely awesome idea that was completely out of the box and completely powerful. At that point the energy in the room jacked up like a big wave.

After a while, work becomes work. We're lucky enough to be in a profession that requires as much creativity as it does precision. I wanted to put some meaning into what has become a term that is only applied with heavy irony.

We are early on in the process. I am going to document how this first Innovation week goes -- expecting the unexpected, of course.

Right now, as noted, we are at the beginning. The management team has put a lot of work into setting up the idea generation, and we need to follow through by setting the teams up for success, picking the best ideas, then ruthlessly evangelizing those up the chain. It's a long journey but I think we made a great first step.

Yesterday I woke up and realized I was a Product Owner.

2013-10-13T21:11:00.003-07:00

Well, not yesterday. But I have had a recent revelation that I am no longer an engineer. And I'm not sure if I can ever really go back, even if I were to write production code again (I'm always writing code, but at this point it's more of a hobby than anything).

My journey into product ownership began innocently enough when I assumed technical leadership for a set of services built around Hadoop and other NoSQL Platforms -- the Enterprise Data Platform I've posted about before -- and has taken over a year. In that year I have transitioned from a person that provides purely technical solutions to a product owner. What's the difference? To me it is simple: a technologist implements solutions the best way they possibly can to address a perceived problem. A product owner makes sure that the problem being solved actually matters to someone. Preferably before the technologist gets too far down the solution path.

In my last post I talked a lot about secondary vs primary value propositions. The Enterprise Data Platform I have been working on as a technologist is a group of services have secondary value propositions. The technologies used to solve them are definitely awesome. But that's irrelevant. As a product owner I've got to make sure that those problems are worth solving to someone, because there is a significant investment of time and resources required to solve problems. Even if the solutions to those problems don't end up getting used.

Nothing is more frustrating than burning effort on a solution that goes nowhere. Especially if that time was spent crafting a robust, scalable, well tested, well documented implementation of that solution. It's frustrating because that work resulted in a solution that either no one understood or no one wanted. In fact it was my frustration with having been through the good solution/bad product cycle several times that made me take the jump from technologist -- one of the people that implements solutions -- to product owner -- one of the people who decides whether implementation is worth it.

One of the hardest things about transitioning to a product owner from a purely technical role is being able to distinguish whether I should be doing something because I can, or because it solves a problem that people actually have. This is hard for me because as a technologist I tend to get sucked into solving hard problems before asking whether those problems should actually been solved. Ironically I've found that the best technologists I've worked with are the ones that stand back, ask the bigger picture questions, and use the answers to implement elegant, concise solutions.

Product owners need to do the same thing. I think the main job of a product owner is to learn what product the customer base really wants. Instead of plowing ahead with user stories and a feature roadmap, the best product owners I've seen are the ones that can step back from that process and ask the bigger picture questions, and use the answers to make a better product.

Answering big picture questions correctly allows product owners to fail fast -- throw away features that aren't working and iterate towards ones that do. And before doing any of that, the best product owners I know are the ones that can articulate the real value proposition of any effort before embarking on it. They may not know what or how they are going to deliver the effort, but they've nailed why the effort is worth making or not.

In my Product Owner self education, I've learned (from lots of reading and even more bleeding) that clearly articulating the value proposition and true effort cost is critical to understand whether the effort is worth exploring and sell that exploration process to the people financing it and the people working on it. One of the best things I've found to help clarify product vision is this product canvas from Shardul Metha.

I've been going through our product suite with this product canvas. That process, while painful, has clarified why we should continue doing certain things and why we should stop doing others.

The interesting thing about the canvas is that everything is linked. Here are some examples of what I mean:

If Key Success Metrics don't link back to the Value Proposition through the Solution, it doesn't matter how obvious they are, they're wrong.
If you don't have viable Channels, no executive is going to sign up to be a Key Stakeholder, and it's going to be hard to find a customer that could be a champion.
If the Value Proposition doesn't address the problems that customers have more efficiently than the alternatives, the Business Value of the effort is weak.
An ill defined or prohibitively expensive Cost Structure can weaken the best Value Proposition as well because Business Value will be reduced.

Because everything is linked, I find myself revisiting what I had thought was 'obvious' or at least 'set'. Like the Value Prop. Or the Customer. Or even the Problem. But after several cycles what comes out of the other side is either very strong or very weak. This makes deciding which efforts to pursue much easier.

Back to my original problem: I'm selling secondary solutions. The product canvas helps by allowing me to clarify the customer I'm helping, the problem they have, my specific value proposition, and how I would go about fulfilling that value proposition. I've found that for these secondary solutions, linking them to primary solutions is critical. An example:

We've been working on a reliable data delivery service that allows engineering teams to stream their data into different endpoints. Teams can build in 'routing directions' to the data they send that will enable it to land in HDFS, Mongo, Cassandra, ElasticSearch, and other endpoints. That is technically very cool. But we have not been able to clearly articulate why we would do something like this to our leadership because as technologists, the advantage of having something like this as a service is obvious. Namely, people could reuse this service instead of building a custom one.

As a product owner I get a chance to look at the solution from another point of view. What problem am I trying to solve? How much does it cost to stand up and run? Are there commercial alternatives that we would be better of using? Can I find someone that is willing to try it and work with them to load test it? What other teams do I need to deliver it? Who do I need to sponsor it? And how am I going to get people to use it? For the a secondary solution like this, perhaps the most important question to answer is: what primary problems would be easier to solve if this existed today, and how does that benefit the business?

None of those questions have anything to do how we are solving the problem. But they are the minimum set of questions that need to be answered to justify continuing this effort. And, had I been as educated when we had started the effort, this is the minimum set of questions that we would have asked prior to doing any work. The anwers to these questions laid out in product canvas form would have been a decision making compass -- when confronted with choice A vs choice B, the product canvas would have given us the tools to make the best decision.

I think that this approach is fundamentally right, but it is one that I am only just starting. The product canvas approach -- clarifying why before what and how-- is obviously the first foundational step in delivering a valuable product. Our teams are going through this journey, and my hope is to write down what is working, and what isn't, as we now try to deliver products whose value proposition we clearly understand :)

Enterprise Data Platform: A Reboot and a Reality Check

2013-09-26T07:31:00.002-07:00

The last post I wrote on the Enterprise Data Platform was in January. It's September. What happened?

A whole lot, actually. My understanding of what an Enterprise Data Platform is and how it needs to be 'sold' has changed dramatically. My role in this process has also changed. And my understanding of delivering software and running a team has grown and changed and continues to change for the better.

What I've found out in the past year, among many other things, is that getting people to fund what I've been calling an Enterprise Data Platform is as much about education as it is about technical execution. I'm not talking about educating other people, I'm talking about educating myself. It has been an incredibly educational, humbling, uncomfortable, frustrating, awesome nine months.

When I look back at the posts I wrote early this year, one thing that is very murky throughout a reasonably well laid out argument for data management is a value proposition. I didn't know that when I wrote it, but I quickly found it out when I went to ask for money.

Here is the value disconnect: a system to collect, manage, and leverage data is that system solves a secondary problem that assumes that a primary problem has been solved. In other words: If I'm trying to get you to fund a data collection, storage, and management platform, I'm assuming that there is something that is already generating the data that needs to be stored. Netflix, Amazon, Google, my bank website, Blogspot, all solve primary problems. Those problems are easy to explain, regardless of how hard they are to implement. Solutions to primary problems have clear, concise, direct connections to value.

Solutions to secondary problems are optimizations of primary solutions. Decreasing time to insight on operational metrics of a website is an optimization. A great one, to be sure, but not necessary if the website is not getting any traffic.

Any solution to a secondary problem has an indirect value connection at best. Secondary solutions only make sense when the initial value proposition of the primary problem is diluted or reduced due to a secondary problem. A system that doesn't scale to support a site whose popularity is exploding through the roof is a secondary problem. The Secondary problems I see in my current role are operational in nature, and the solutions to them are optimizations. They can deliver huge value when done correctly.

"When done correctly". Three words that are seared into my brain. In the past year several things have happened while I've been trying to explain, again and again, why we should build a solution to a secondary problem, and while I've been trying to build that solution with limited resources:

I've realized that the best way to solve a secondary problem is one primary problem at a time. Building a platform to optimize an undefined set of primary solutions is a risky, 'field of dreams' approach, and there are many ways to go awry.
I've become less of a technologist and more of a product owner. My last piece of production facing code will (hopefully) be retired in the next couple of months. There are much better engineers on the team, and I rely on them to deliver working software in the same way that they rely on me to come up with a useful product.
Where I used to think about use cases and requirements and assume that these were valid, I now question and validate product direction up front. That involves use cases, but if the use case is invalid, why spend time extracting requirements from it? I spend more time thinking about validating the use case as cheaply as possible. Requirements emerge and solidify as product direction takes shape -- doing them in advance of having a validated use case seems backwards. This insight radically changes the way software is delivered, and our teams are in the middle of this change process.
The cost and recovery plan for any effort -- infrastructure and resources -- combined with time to recovery, is best defined as soon as you have validated use cases. Those plans need to change as validated features emerge that impact cost and recovery. In previous roles I had been 'sheltered' from that aspect of the business. I'm finding now that financial data is the ultimate data point that helps quantify whether value is being delivered.

This process is far from complete. I am continuing to learn every day, and while it can get very uncomfortable, it has been an amazing education.

I've tried to write things down several times in the past 9 months. I haven't gotten very far because what I was writing didn't feel complete. Writing about the technology is only one side of the story. What I've learned in the past nine months is that there is a much bigger picture -- now that I'm starting to be able to externalize what I've learned, I'm excited to write about it. More soon...

More on the Enterprise Data Platform: Data Requirements

2013-01-11T21:19:00.002-08:00

In my last post, I talked in very general terms about an Enterprise Data Platform (EDP) and in very specific terms around what I consider to be a core requirement of any EDP, data governance. If I have a set of services and processes that provide data governance, I have a way to manage data. What kind of data am I trying to manage?

I'm primarily concerned with building systems that contain event data and reference data. Event data can be data copied from OLTP systems, it can be user click streams, machine data collected at regular intervals -- anything that signals an event happening. Event data can be huge.

Reference data is data that can be used to classify/quantify aspects of an event. If I'm looking at a click stream, a user ID is reference data that I can aggregate events by. In OLAP terms, events can be cast as facts, with reference data providing some of the dimension values.

This data starts to get valuable when 'raw' event data is joined to reference data, and then in turn joined to other event data along a specific reference dimension. For example, aggregating user click stream and email campaign opens by user ID could be used to track the rate at which the email campaign actually generated new users.

In addition to that kind of analytical use, event data can be used to classify users and/or determine their affinities. This kind of derivative data is typically referenced at runtime, by the same applications that are generating the event data.

The two cases above are interesting because they have opposing requirements. Analytic data must first and foremost be consistent, especially when financial reporting and/or business decisions are being made from that data. Consistency in this case implies that when a value is written, the next read reflects the last change made to that variable.

Runtime data must first and foremost be available, because unavailability of data may compromise application behavior. Besides, preference or personalization data is derivative data, generated on a set interval as a batch process. Being out of sync for a long time will eventually mean the application will grow 'stale', but when we talk about availability at the expense of consistency, the usual case is that any inconsistency between read state and written state will be resolved in sub second time.

Why is this important?

Enterprises collect a lot of this data -- TB/day -- and that scale will swamp any single machine based database.
The enterprise must therefore store data on a storage platform that spans many machines -- a cluster.
The moment data is stored in a cluster, it is subject to the CAP theorem, which states that a distributed system cannot enforce Consistency, Availability, and Partition Tolerance at the same time.
While it is valid to desire a clustered system that favors either Consistency or Availability, Enterprise level requirements state always require partition tolerance, so we can have one or the other, but not both.
This means that there are usually two main systems, an analytic focused system, and a runtime facing system. The analytic focused system favors consistency, the runtime facing system favors availability.
An EDP that wants to offer both analytic and runtime facing data must have at least two systems, one which is consistent for analytic data, the other which is available for runtime facing data.

So now my definition of an EDP has evolved. It must have some fundamental Data Governance, and due to the scale enterprises operate at, must have 1..N analytics focused platform(s) and 1..N runtime facing platform(s). In my next post I'll focus on the analytic focused platform requirements.

What is an Enterprise Data Platform, Anyway?

2012-11-16T18:38:00.002-08:00

I've been trying to write about my requirements for a Data Platform for the last month. The problem is that I was trying to write this at the exact same time that my understanding of both the requirements and the Platform was shifting, so I ended up writing the requirements, coming back to them a week later, throwing them all out, and rewriting them again. This is revision number three, and I actually feel that I've stabilized my definition. The biggest change between now and a month ago is that I realized that I am actually thinking about an Enterprise Data Platform, which is different than a Data Platform.

First of all: What is a Data Platform? Here is a first crack at a definition: a Data Platform enables people to leverage data by facilitating data collection, storage, transformation, and access. In other words, I need to be able to get data in, persist it, access it, change it, and still access it. All of that is completely possible with {name your nosql storage engine of choice}. So why am I not calling {name your nosql storage engine of choice} an Enterprise Data Platform?

I have a short answer that is explainable with a longer answer. The short answer: Enterprise Data Platforms have another requirement: the ability to manage all of that data.

The long answer: my experiences with databases and persistence technology started from the perspective of a developer in a startup, not an administrator or data modeler in an enterprise. I wanted to successfully collect, store, transform, and retrieve data, in order to fulfill the functional requirements of the applications I wrote. I found that a lot of the database admins and data modelers/architects I worked with actually got in the way of this fairly simple mission.

A lot of the constructs and restrictions imposed on the data model and the database by both parties seemed arbitrary, and when I asked them for their reasoning, their answers had no relevance to my single data set use case. So, when I discovered Hadoop while working around a complex set of ETLs that had ground our database to a halt in 2009, I was elated, because both parties were sidelined and I could do what I needed to do. Their apoplectic reactions to 'schema on read' used to fill my heart with joy. That makes my current perspective very ironic.

The change in perspective happened when I changed jobs. I went from being a developer that used Hadoop to deliver solutions to an architect/product manager position where I led a team that operationalized Hadoop, Cassandra, Mongo, and Solr and then started to store and curate enterprise level data as a centralized service offering. The moment I started doing thinking past pure functionality and more about operations, I started to care about a lot of the things that people who manage data -- the people I used to disparage -- have been caring about for a while. The moment I became a data curator or steward of a set of very diverse and valuable data, I started to see real value in Data Governance, which was a term that previously made me roll my eyes in impatient disgust.

Data Governance is defined in wikipedia as a set of processes around "data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization". As a developer in a startup tasked with doing a few things very well, I could care less about data governance. I know my data, it may be vast but it has the same schema, which I know and can code for.

However, as a member of a large enterprise storing many kinds of data who wants to use some of that data, I now must rely on data quality, definition, and security. Without those three I may be able to generate some value from that data, but the value is undermined by the lack of defined quality of the inputs, the lack of defined standards to normalize the data to, and the lack of security which means that the data could be compromised by a rogue user/process. Those 'constraints' on enterprise data are in place because they have a direct impact on the bottom line.

Breaking down Data Governance from the above definition, I get the following:

Data resources must be discoverable: there must be a set of defined metadata that analysts can search for data by.

common metadata is most effective when there is a common data model, which becomes hard to enforce when an enterprise spans many different business units.

Data structure must be describable so that analysts can consume it.

Schema is a contract, even if it is not applied up front. Without that contract, processing code gets much more complex.
Any effort to describe structure must account for versioning, because all data input changes over time.

Data Quality must be known and identified every time new data is ingested.

When I load data from an external source, downstream processes rely on it conforming to a schema. I need to score the data by how much of it conforms to that schema.
When Data Quality dips below a defined level, it must be treated as an operational issue.

Data Replication must be defined and enforceable.

Storage systems must allow users to set a replication factor to account for storage failures.

Data Replication must be defined, enforceable, and applicable per unique data type.

Replicating full data sets may be a business requirement: if so, a Data Platform must replicate data along the following dimensions: resource location (where) and range (how much).

The context of the data, e.g. what the units of the data are, must be defined in order for an analyst to successfully trust and consume the data.

Unitless data is much less useful. Note that this implies common units across an enterprise, which is harder to do than one would think. Defining and encouraging a Master Data Model helps restrict units and meaning to allow different people to utilize data that they haven't produced.
Data that has been vetted is 'trusted' which means that someone is standing behind it. Standards around trust are important when you have many owners of data. They imply standards around data quality and context.

Data must be secure

Access must be restricted to specific owners.
Those owners must be able to centrally manage permission granting to share data with other users.
Users must be authenticated to the overall system, and can only access data that the data owners have authorized them to.
All actions must be audited.

These requirements are foundational to a data platform that houses many diverse data sets. You could build one without them, but it would only work for a limited set of data. Which is why startups don't care about data governance, and why most NoSQL products are only now starting to think about security. They don't need to -- and to be clear, that's perfectly fine. But that doesn't work or scale at the enterprise level.

What would an Enterprise Data Platform that implemented the requirements above enable? It would enable qualified, standardized, and secure use of data for both analytics and runtime consumption. As a company ran different kinds of analytic efforts on it's data, and generated runtime models from those analytic efforts, other analysts could reuse the input, intermediate, and generated data in different, unanticipated, "recombinant" ways because they can trust the data and can apply company standards to it. The platform would become an enabler, allowing analysts to discover new insights by removing the overhead of managing the data.

I have other requirements of a Data Management Platform besides management, but management is a foundational aspect when managing very many, very diverse sets of data across multiple storage platforms. I hope to start addressing those shortly.

Calculating Conditional Probability with MapReduce part 4: Applying Correlation and Independence to get better results

2012-07-12T22:18:00.000-07:00

In my last post I discussed using mapreduce to calculate item to item conditional probabilities, in other words, the probability that you will click on page B given that you have clicked on page A.

In this post I'm going to discuss how to determining whether a user click on page A is independent of a user click on page B. I went over this in the first post, but it's worth revisiting because taking the theory literally leads to confusion about why it is relevant for recommending products or content.

Independence and Correlation

Some review of the theory using common sense: if I roll two dice together, it's obvious that the probability of getting a 6 on the first dice is independent of the probability of getting a 6 on the second dice. If, on the other hand, I have 4 blue balls and 4 red balls in a drawer, and retrieved a blue ball the first time I reached for a ball, the probability of pulling out a red ball has gone up because now there are 3 blue balls and 4 red balls left. In the first scenario,

P(roll a 6 with dice A) | P(roll a 6 with dice B) = P(roll a 6 with dice A)

in the second:

P(red ball) | P(blue ball) = # of red balls/(# of red balls + # of blue balls left).

Given that I've reduced the number of blue balls, P(red | blue) is > P(red in the original case), I'm more likely to pick a red ball if I've picked a blue ball first.

How does that translate into whether to recommend page B given the user has viewed page A? URIs aren't balls in a drawer. The theory of independent events vs dependent events doesn't really apply to page views...until you consider that the implications of using the independence test actually make sense when deciding whether to recommend page B given the user is viewing page A.

Given that we know P(B | A) and P(B), and we know that

for independent events P(B | A) = P(B)
for negatively correlated events P(B | A) < P(B)

then any calculated P(B | A) significantly <= P(B) means that recommending page B given the user is on page A is likely to disengage the user, because many more viewers are likely to view page B without viewing page A than they are to view page B after viewing page A.

Calculating Probabilities Again

From above, we need to calculate P(B | A) and P(B) to determine if B should be recommended when the user is viewing A. We have P(B | A) from the last post: the output of our last reducer was

uriX uriY=P(Y | X) uriZ=P(Z | X)...

As for calculating raw probabilities, we can do this by taking the output from our group-by-user mapreduce from the second post, which looked like

user1 pageA:num_views_A_by_user1 pageB:num_views_B_by_user1...

A map job that submitted key:value pairs from each line of that output, and a reduce job that grouped by (page uri) key and summed the counts would give a total count of page views per page.

public static class AggregateViewCountsMapper extends
   Mapper {

    @Override
    public void map(final LongWritable key, final Text value,
    final Context context) {
 // assume output of GroupViewsByUser:
 // user1\turi1:count1\turi2:count2...\turiN:countN
   
 String raw = value.toString();

 String fields[] = raw.split("\t");
   
 long totalCount = 0;
 for(int i = 1; i < fields.length; i++ ) {
   
         String rawFieldData = fields[i];
         String tokens[] = rawFieldData.split(":");
    
  if(tokens.length != 2) {
  context.getCounter(AggregateViewCounters.BAD_URI_SUBFORMAT).increment(1);
  continue;
  }
  // we care about both tokens
    
  try {
          String count = tokens[1].replaceAll("\\s","");
   Long actualCount = new Long(count);
   context.write(new Text(tokens[0]), new LongWritable(actualCount));
   totalCount += actualCount;
  } catch (IOException e) {
   context.getCounter(
    AggregateViewCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)
     .increment(1);
    e.printStackTrace();
   } catch (InterruptedException e) {
    context.getCounter(
    AggregateViewCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
       .increment(1);
    e.printStackTrace();
    }
   }
   
   // write total counts to a guaranteed unique key
   try {
    context.write(new Text(UNIQUE_KEY), 
                                      new LongWritable(totalCount));
   } catch (IOException e) {
    context.getCounter(AggregateViewCounters.
                                     MAP_CONTEXT_WRITE_IO_EXCEPTION)
         .increment(1);
    e.printStackTrace();
   } catch (InterruptedException e) {
    context.getCounter(AggregateViewCounters.
                                     MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
         .increment(1);
    e.printStackTrace();
   }
   

  }
 }
    }
}

A special key (UNIQUE_KEY above) could be used to sum up all views per user.

The reducer is a simple aggregator, the final output would look like this:

pageA total_view_A_ct

pageB total_view_B_ct

....

UNIQUE_KEY total_view_all_pages_ct

The number of lines of output is equal to the total number of pages in your candidate set+1.

Notes on Final Calculations

I'm leaving this one as a mental exercise:
If the data set is small (<10k pages), you could write some script to calculate probabilities of pages A..N by dividing

total views of pageA / pageTotal count

total views of pageX / pageTotal count

with those counts in place, you can now build a model of views and keep each item B if

P(B | A) > P(B)

Note that building the model consists of getting the item:item output, and filtering items as above.

If the data set is larger, and the overall size of the set spans several blocks, then processing the results via mapreduce becomes cumbersome if you don't know the total view count of all pages. I think that this last step definitely requires storing the raw probabilities in a lookup table, then processing the conditional probabilities and filtering out the lists by doing the comparision between P(B | A) and P(B).

How to build the model and store the data above for a more optimal retrieval during model build time is definitely a completely new blog post :)

Calculating Conditional Probability via Mapreduce part 3: item to item mappings

2012-06-10T21:46:00.002-07:00

In my last post I gave an example of code that grouped views by user ID. In this post I'm going to show how to take the output of that mapreduce and generate item to item mappings that show the conditional probability of URIs being clicked on, given that a specific URI was clicked on.

The output from the views-by-user-ID reducer had the key, followed by a colon, followed by tab delimited URIs and counts (colon separated).

user1\turiX:3\turiY:4\turiZ:12....
user2\turiW:5\turiQ:5
...

We want to generate output that shows the following: given that the user has clicked on X, how many times did they click on Y vs Z, or P? In this output, we are not concerned with specific users, but user behavior in general. We want to sum up general conditional probabilities.

In the mapper, our input is as above: a more concrete example is :

user1\thttp://foo.com/page1:12\thttp://foo.com/page2:23\thttp://foo.com/page3:3

In the mapper we generate permutations of the different key pairs (and ignore the counts for now). In the example above we would generate:

key=>value
http://foo.com/page1=>http://foo.com/page2
http://foo.com/page1=>http://foo.com/page3
http://foo.com/page2=>http://foo.com/page1
http://foo.com/page2=>http://foo.com/page3
http://foo.com/page3=>http://foo.com/page1
http://foo.com/page3=>http://foo.com/page2

Here is the mapper code to generate the permutations

public static class ItemToItemGroupingMapper extends
   Mapper {

  @Override
  public void map(final LongWritable key, final Text value,
	final Context context) {
        // this job takes the input from GroupViewsByUser
	// user1\turiX:3\turiY:4\turiZ:12
	String raw = value.toString();
			
	String fields[] = raw.split("\t");

	if(fields.length == 1) {
		context.getCounter(ItemToItemCounters.BAD_MAP_INPUT_FORMAT)
		.increment(1);
		return;
	}
	// fields[0] is the user ID, we skip that for item-item co-occurrence.
	for(int i = 1; i < fields.length;i++) {
		// permute all combinations of co-occurrences
				
		for(int j = 1;j < fields.length;j++) {
			if(j == i) {
				continue; // dont count self co-occurrences
			}
			else {
		 	    String partsI[] = fields[i].split(GroupViewsByUser.INTERNAL_DELIMITER);
			    String partsJ[] = fields[j].split(GroupViewsByUser.INTERNAL_DELIMITER);
						
			    if(partsI.length != 2 || partsJ.length != 2) {
				context.getCounter(ItemToItemCounters.BAD_MAP_INPUT_FORMAT)
				.increment(1);
				return;
			}
		        try {
			    LOGGER.info(partsI[0]+"=>"+partsJ[0]);
			    context.write(new Text(partsI[0]),new Text(partsJ[0]));
			} catch (IOException e) {
			    context.getCounter(ItemToItemCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)
				.increment(1);
			    e.printStackTrace();
			} catch (InterruptedException e) {
			    context.getCounter(ItemToItemCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
				.increment(1);
			    e.printStackTrace();
			}
		}
					
					
	}
  }
}

In the reducer, these pairs would show up and we can aggregate the counts of each URI. Again, we're not factoring in a normalized weighting at map time. In other words, from the example above, we're not factoring in the 12 hits of page1 vs the 23 hits of page2 when doing co-occurrences, because, as noted above, we're ignoring counts and focusing on co-occurrences

We could do some calculations on these pairs at reduce time -- for a specfic URI X, we would sum total co-occurrences per unique URI Y and divide by total co-occurrences of all URIs to get a number that represented the probability of the total co-occurrences that X and Y made up. We could then reverse sort the URIs by co-occurrence probability to produce output like this for uriX:

uriX\turiY=P(Y | X)\turiZ=P(Z | X)

a real version would look like:

/foo/bar.html   /foo/baz.html=0.5       /foo/car.html=0.5

As expected, in the example above, the probabilities all sum to 1.0.

The reducer code to aggregate counts is below. Its more complex than the previous GroupViewsByUser code because of the probability calculation and reverse sorting.

public void reduce(Text key, Iterable input, Context context) {

        Map URItoCount = new HashMap();
	Iterator it = input.iterator();
	long totalCount = 0;
	// these are co-occurrences: sum them.
	while (it.hasNext()) {
		Text value = new Text(it.next());

		Integer count = URItoCount.get(value);

		if (count == null) {
			URItoCount.put(value, Integer.valueOf(1));
		} else {
			URItoCount.put(value, Integer.valueOf(count + 1));
		}

		totalCount++;
	}

	// emit tab delimited and from greatest to least
	// uri2=prob(uri2 | uri1) uri3=prob(uri3 | uri1)

	Map> byProb = new HashMap>();
	List sortedSet = new ArrayList();
			
	for (Map.Entry entry : URItoCount.entrySet()) {
		double probability = ((double)entry.getValue())/totalCount;
		List list = byProb.get(probability);
				
		if(list == null) {
			list = new ArrayList();
			byProb.put(probability,list);
			sortedSet.add(probability);
		}
		list.add(entry.getKey());
	}

	// sort from greatest to least.
	Collections.sort(sortedSet);  
	Collections.reverse(sortedSet);
			
	// emit uri:value where value = count / total 
	StringBuffer sbuf = new StringBuffer();
			
	for(double probability : sortedSet){
		List uris = byProb.get(probability);
				
		for(Text uri : uris) {				      
                    sbuf.append(uri.toString()).append(INTERNAL_DELIMITER).append(probability).append(FIELD_DELIMITER);
                }
	}
				
	// convert list to hadoop Text
	String coOccurrenceBuffer = sbuf.toString();
	LOGGER.debug("reduce: key = " + key.toString() + ", value = "
			+ coOccurrenceBuffer);

	Text allViews = new Text(coOccurrenceBuffer);

	try {
		context.write(key, allViews);
	} catch (IOException e) {
		LOGGER.error(e);
		context.getCounter(
				GroupByUserCounters.REDUCE_CONTEXT_WRITE_IO_EXCEPTION)
				.increment(1);
	} catch (InterruptedException e) {
		LOGGER.error(e);
		context.getCounter(
				GroupByUserCounters.REDUCE_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
				.increment(1);

	}
}

Are we done yet? Well....no. While it's nice to see conditional probabilities, they dont tell us anything about whether any two URIs are correlated. That's next. Sample code can be found at https://github.com/arunxarun/cpsamples.

Calculating Conditional Probability via Mapreduce part 2: group by userID

2012-05-15T21:54:00.000-07:00

In my last post I walked through the basics of conditional probability in order to get around to actually discussing how I would implement it using mapreduce.

The whole calculation of Conditional Probability is a little contrived -- I admitted that up front -- my goal was to provide a 'non word count' example of mapreduce. It also requires multiple mapreduce steps. I'll provide the context around those steps to show how they link together to calculate conditional probability and correlation, both of which are required if we are going to recommend or avoid recommending products when a specific product is shown.

The entire source for all samples (as I write them) can be found at:

https://github.com/arunxarun/cpsamples

Starting Assumptions: the User View Log.
Log format is dependent on who wrote the application, but if you were using something like Common Log Format, you would see the following kind of log:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /products/p1234.html HTTP/1.0" 200 2326

where:

frank is the userid
The timestamp is the bracketed next field
The quoted string afterword contains the HTTP verb plus the uri plus the protocol version
The next field contains the HTTP status code
The final field denotes the amount of time in milliseconds that the request took.

Mapreduce 1: group by UserID
The first mapreduce will group all pages by user during the map phase and sum page counts by user during the reduce phase. It does this by parsing the input lines and emitting the user and URI as a pair into the context:

    
/**
 * the key the user, the value is the URI
 *
 */
public static class GroupViewsByUserMapper extends Mapper {
 
  @Override
  public void map(final LongWritable key, final Text value,  final Context context) {
    // assume common log format: 
    //127.0.0.1\t-\tfrank\t[10/Oct/2000:13:55:36 -0700]\t"GET /apache_pb.gif HTTP/1.0"\t200\t2326\n
    String raw = value.toString();
  
    String fields[] = raw.split("\t");
  
    if(fields.length < 6 || fields.length > 7) {
        context.getCounter(GroupByUserCounters.BAD_LOG_FORMAT).increment(1);

        return;
    }
  
    String userIdText = fields[2];
    String rawHttpInfo = fields[4];
   
    String uriComponents[] = rawHttpInfo.split(" ");
  
    if(uriComponents.length < 3) {
         context.getCounter(GroupByUserCounters.BAD_URI_SUBFORMAT).increment(1);

         return;
    }
    Text userId = new Text(userIdText);
    Text uri = new Text(uriComponents[1]);

    try {
       context.write(userId, uri);
    } catch (IOException e) {
         LOGGER.error(e);

         context.getCounter(GroupByUserCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)

            .increment(1);
    } catch (InterruptedException e) {

         LOGGER.error(e);
         context.getCounter(GroupByUserCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)

            .increment(1);
    }
  
  }
}

the reduce phase will aggregate counts per URI for each user into a text based map. Note that while Sequence Files are the more optimal way to write this information, we're using Text output to minimize complexity.

user1\t/pages/foo.html:2\t/pages/bar.html:1\t/pages/baz.html:1

/**
 * emit user: uri1:count1...uriN:countN 
 *
 */
public static class GroupViewsByUserReducer extends Reducer {
 
   
 
  /**
   *
   */
  @Override
  public void reduce(Text key, Iterable input, Context context) {
  
    Map URItoCount = new HashMap();
    Iterator it   = input.iterator();
  
    // aggregate views by uri.
    while(it.hasNext()) {
      Text value =  new Text(it.next());
   
      Integer count = URItoCount.get(value);
   
      if(count == null) {
        URItoCount.put(value, Integer.valueOf(1));
      } else {
        URItoCount.put(value, Integer.valueOf(count+1));
      }
   
    }
  
    // emit view:count format, tab delimited.
  
    StringBuffer sbuf = new StringBuffer();
  
    for(Map.Entry entry : URItoCount.entrySet()) {
      sbuf.append(entry.getKey().toString()).append(":").append(entry.getValue());

      sbuf.append("\t");
    }
  
    // convert list to hadoop Text
    String viewBuffer = sbuf.toString();
    LOGGER.debug("reduce: khttps://github.com/arunxarun/cpsamples/blob/master/src/main/java/org/arunxarun/mapreduce/sample/cp/GroupViewsByUser.javay = "+key.toString()+", value = "+viewBuffer);
    Text allViews = new Text(viewBuffer);
  
    try {
      context.write(key, allViews);
    } catch (IOException e) {
      LOGGER.error(e);
      context.getCounter(GroupByUserCounters.REDUCE_CONTEXT_WRITE_IO_EXCEPTION)

         .increment(1);
    } catch (InterruptedException e) {
      LOGGER.error(e);
      context.getCounter(GroupByUserCounters.REDUCE_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)

         .increment(1);
   
    }
  }
}

A Final Note: Unit Testing
I'd like to point out some quick helper classes that I used to aid my unit testing -- these are at

https://github.com/arunxarun/cpsamples/tree/master/src/main/java/org/arunxarun/mapreduce/unitmocks.

I've started to use these (and/or variants of these) to quickly validate that my mappers and reducers are emitting what I think they should: the bolded lines below allow me to retrieve mapped values and validate them at map time:

        @Test
 public void testMapperValidInput() throws IOException, InterruptedException {

  GroupViewsByUser.GroupViewsByUserMapper mapper = new GroupViewsByUser.GroupViewsByUserMapper();

  MockRecordWriter rw = new MockRecordWriter();
  Mapper.Context context = getMapperContext(mapper,rw);


  LongWritable key = new LongWritable(1);
  Text value = new Text("127.0.0.1\t-\tfrank\t[10/Oct/2012:13:55:36 -0700]\t\"GET /products/prod2.html HTTP/1.0\"\t200\t1002\n");
  mapper.map(key, value, context);
  LongWritable key2 = new LongWritable(2);
  Text value2 = new Text("127.0.0.1\t-\tjoe\t[10/Oct/2012:14:35:22 -0700]\t\"GET /products/prod1.html HTTP/1.0\"\t200\t2326\n");
  mapper.map(key2, value2, context);
  Map map = rw.getMap();
  assertNotNull(map);
  assertTrue(map.size() ==  2);
  assertNotNull(map.get(new Text("joe")));
  assertNotNull(map.get(new Text("frank")));



 }

The getMapperContext() method uses most of these stubs (that return empty values) when it returns a context object that can be passed into the mapper.

        /**
  * return a valid mapper context.
  * 
  * @param mapper
  * @param rw
  * @return
  * @throws IOException
  * @throws InterruptedException
  */
 private Mapper.Context getMapperContext(
   Mapper mapper,
   MockRecordWriter rw) throws IOException,
   InterruptedException {
  Configuration conf = new Configuration();
  TaskAttemptID taskId = new TaskAttemptID();

  Mapper.Context context = mapper.new Context(
    conf, taskId, new MockRecordReader(), rw,
    new MockOutputCommitter(), new MockStatusReporter(),
    new MockInputSplit());

  return context;

 }

The reducer logic is also testable:

        @Test
 public void testReducerValidInput() throws IOException,
   InterruptedException {

  GroupViewsByUser.GroupViewsByUserReducer reducer =

                  new GroupViewsByUser.GroupViewsByUserReducer();

  Configuration conf = new Configuration();
  TaskAttemptID taskId = new TaskAttemptID("foo", 1, true, 1, 12);
  MockRecordWriter dwr = new MockRecordWriter();

  Reducer.Context context = getReducerContext(reducer,dwr,conf,taskId); 


  Iterable input = new Iterable() {

   @Override
   public Iterator iterator() {
    List list = new ArrayList();

    list.add(new Text("http://foobar.com"));
    list.add(new Text("http://foobar.com"));
    list.add(new Text("http://foobar.com"));
    list.add(new Text("http://foobar.com"));

    return list.iterator();
   }

  };

  Text key = new Text("userkey");

  reducer.reduce(key, input, context);

  Map map = dwr.getMap();

  assertNotNull(map);
  assertTrue(map.size() == 1);
  assertNotNull(map.get(new Text("userkey")));
  Text value = map.get(key);
  assertEquals("http://foobar.com:4\t", value.toString());

 }

again, the getReducerContext() returns a mocked up context that is passed into the reducer and evaluated. The MockRecordWriter can be interrogated for results. Here is that getReducerContext():

  * 
  * return a valid reducer context
  * @param reducer
  * @param dwr
  * @param conf
  * @param id
  * @return
  * @throws IOException
  * @throws InterruptedException
  */
 private Reducer.Context getReducerContext(
   Reducer reducer, 
   MockRecordWriter dwr,
   Configuration conf, 
   TaskAttemptID id) throws IOException, 
   InterruptedException {

  MockOutputCommitter doc = new MockOutputCommitter();
  MockStatusReporter dsr = new MockStatusReporter();
  MockRawKeyValueIterator drkv = new MockRawKeyValueIterator();


  Reducer.Context context = reducer.new Context(
  conf, id, drkv, new MockCounter(), new MockCounter(), dwr,
  doc, dsr, null, Text.class, Text.class);

  return context;
 }

In my next post I'll discuss how to take the output of this job and generate item to item mappings.

Calculating Conditional Probability via Mapreduce Part 1: the theory

2012-03-18T22:01:00.000-07:00

Motivation:

A colleague of mine was bemoaning the lack of non trivial mapreduce examples out there, and wanted to teach a class on how to apply mapreduce to "something other than word count". This sounded like a great idea to me, so I volunteered to provide a mapreduce implementation of Conditional Probability (CP) calculations from user clickstream data.

CP calculation is one of many collaborative filtering techniques used to deliver personalized product/content recommendations to users at sites like Amazon, Pandora, Netflix, etc. Conditional Probability is the probability of one event happening once another has occurred. Doing this for page views can be used to determine which page views are correlated. Suggesting correlated pages to a user can greatly enhance their browsing and ultimately their purchasing experience.

For a web application, page correlation is possible to determine because user actions map to page views of products or content that are logged to the application servers. From those logs we can determine which pages are more likely to be viewed together by calculating potential correlations between all available pages. For any large scale website, this means that GB of data per day need to be processed to calculate CP and correlation-- applying map-reduce in this case allows large amounts of log data to be processed in parallel.

Implementing CP calculations via mapreduce is a great way to start to understand how collaborative filtering is done at scale -- it is initially non intuitive, but ends up being a great way to understand how mapreduce works. Of course, CP calculation is one part of a larger set of steps, one of which must be determining page correlation once CP has been calculated.

For me, it was necessary to understand Conditional Probability in order to correctly implement the CP calculation as well as understand how to use it to determine whether events are correlated. And that in turn requires a basic understanding of Probability. Note: my interpretation of basic probability concepts is solely mine, and I welcome all comments and corrections :)

Probability Overview:

The probability that an event occurs can be stated as P(A) = (total count of event A)/(total count of all events). For example, if I say there is a 5% probability that I will introduce a bug into our source code, that means that 5 out of every 100 commits I make contain bugs (people on my team would tend to round that number upward, but they only have anecdotal evidence so far:)

At this point it's useful to draw a picture. The Probability Space for a set of potential events is the representation of all possible events that can happen, with their assigned probabilities, expressed as P(Event).

In the Venn Diagram above, the intersection of P(A) and P(B) is shown in purple, and is expressed as A ∩ B. This is the probability space where both A and B occur, aka the intersection of P(A) and P(B). The complete space covered by A and B is expressed as A ∪ B aka the union of P(A) and P(B).

If two events are mutually exclusive, that means they can never occur together. In the above example, Event C is mutually exclusive with both A and B.

Independent events are never mutually exclusive events. Since independent events by definition may occur together, they cannot be mutually exclusive. For example: I can wear a blue shirt and/or commit a bug. These two events may be independent. They are not mutually exclusive. I cannot wear a blue shirt and wear no shirt at the same time. Those two events are mutually exclusive.

So events that can co-occur may be independent -- in other words they may not be correlated. If A and B are independent, then the probability that one happens does not depend on the probability that the other. They may also be dependent -- B may occur more frequently after A occurs. I'll discuss determining independence more after providing a definition of Conditional Probability.

Conditional Probability, Defined and Applied

Conditional Probability, the probability of B occurring when A has occurred, is written as P(B | A). From the Venn Diagram above, we are explicitly interested in the B ∩ A area -- the intersection of A and B -- but only when A occurs. This can be expressed as

P(B | A) = P(B ∩ A)/P(A)

A Venn Diagram doesn't really show the 'in between state' of A (occurred) and B (not occurred) too well. A better way to visualize CP is to use a Probability Tree. In this tree, each branch represents a set of probabilities for a single event. Events are represented as nodes. Subsequent events are sub branched from the original event nodes. The sum of each set of branches that share a common origin node must be 1 (or 100%).

In the tree above, I'm using some new notation. A' is the complement of A, which means that P(A') is the probability that A will not happen. The complementary probability is accounted for in a probability tree where it is 'implied' in a venn diagram.

This tree shows the possible state map for A and B, in a specific order. If A occurs, there is a probability that B will or will not occur, noted by the conditional probabilities. The same set of possible B states is available if A doesn't occur. The probability tree makes it clear that conditional probabilities represent the state where A (or A') have occurred.

The tree above is incomplete because it doesn't assign actual probabilities to the states. Possible probabilities are shown below.

To get the actual probability of an event, you traverse the tree and multiply the values of branches traversed. This works because of the previous definitions. Remember that B occurring given A is represented as B ∩ A. From the equation above,

P(B | A) = P(B ∩ A)/P(A) ->
B ∩ A = P(A)*P(B | A)

P(A) = 25%, P(B | A) = 33%, P(B∩A) = 25%*33% = 8.25%

The probability tree above defined an event space with two events that co-occur. One thing it doesn't give us is P(B). But we can get P(B), assuming that A is the only other event in the space, by adding up end states. B can be expressed as P(B∩A) + P(B∩A'). If we traverse A->B to get P(B∩A), we get P(A)*P(B | A). If we traverse A'->B to get P(B∩A'), we get P(A')*P(B | A'). The sum of these states is the total probability of B happening in this event space:

P(B) = P(A)*P(B | A) + P(A')*P(B | A')
= 25%*33% +75%*55% = 8.25% +41.25% = 49.5%.

Conditional Probability and Independent Events

All of that is neat (really!), but how do we tie it back to providing real value to our site? We do this by realizing how CP helps determine whether an event is independent or not. If A and B are truly independent, then

P(B | A) = P(B).

In other words, B is independent from A when B happens regardless of whether A happens. An example of this is two consecutive rolls of a dice. The probability of getting a 1 on the second roll is not affected by the value obtained from the first roll.

Contrast this with two dependent variables. If I had 5 pairs of black socks in a drawer (B), and 5 pairs of white socks (W), and wanted to calculate the probability of getting a black pair P(B) when pulling a pair out of my drawer, it would originally be 50%, but would change because there would be one pair less of black or white socks after my first pull.

If P(B | A) = P(B), and

P(B | A) = P(B ∩ A)/P(A), then for independent events,

P(B) = P(B ∩ A)/P(A) because P(B | A) = P(B) and

P(B)*P(A) = P(B ∩ A)

So if B is independent from A, the probability of B happening after A is the same as the probability of B multiplied by the probability of A.

If P(B ∩ A) > P(B)*P(A), the two are positively correlated -- there is a greater possibility of B occurring when A occurs than the two occurring independently.

If P(B ∩ A) < P(B)*P(A), the two are negatively correlated -- there is less of a possibility of B occurring when A occurs than the two occurring independently.

Variable (in)dependence is something to consider when recommending products or content. If we recommend content that we think is correlated when it is (a) independent or (b)negatively correlated, that recommendation may drown out other recommendations that are actually positively correlated, but whose probabilities are much less than the independent or negatively correlated item.

Conclusion (So Far)

As discussed above, calculating CP is part of a larger effort to determine whether pages are correlated, independent, or anti-correlated. To do this for two events A and B it is necessary to know P(B | A), P(B), and P(A). In my next post I'll provide a high level algorithm to get these numbers and determine correlated page views from a generic serve log containing pages viewed by user ID.

Schema On Read? Not so fast!

2011-11-02T07:05:00.000-07:00

I just got back from HadoopWorld. I have many thoughts on what I saw and heard there, but that is probably a separate post. I've been trying to write something for the last 3 months, and HadoopWorld gave me the clarity I needed to finish it.

There was this phrase I kept hearing in the halls and meeting rooms...."Schema on Read". Schema on Read, in contrast to Schema on Write, gives you the freedom to define your schema after you've stored your data. Schema on Read sounds like Freedom. And who wouldn't like Freedom from the Tyranny of Schemas?

If, by the way, that phrase rings a bell, it may be because of this:

All Downfall Meme kidding aside, Schema on Read sure seems nice. Because there is nothing in any of the current NoSQL storage engines that enforces a schema down to the column level, can we just not worry about schemas? What's the point?

Schemas -- good for the consumer, bad for the producer.

The point is that schemas are a guarantee to a consumer of the data. They guarantee that the data follows a specific format, which makes it easier to consume. That's great for the consumer. They get to write processes that can safely assume that the data has structure.

But...not without a cost, to someone. For the producer of the data, schemas can suck. Several reasons:

Because you never get them right the first time, and you're left doing schema versioning and writing upgrade scripts to compensate for your sins of omission -- basically you get nailed for not being omniscient when you designed the schema.
Because they don't do well with variability. If you have data with a high rate of variability, your schema is guaranteed to change every time you encounter values/types/ranges that you didn't expect.

The various parties pumping freedom from schemas via NoSQL technologies seem to have an additional implicit message -- that even though you don't have to lock down your data to a schema, you still get the benefits of having one -- the data is still usable. Specifically, if you don't define or partially define the data, you can still get value from it. Because you're storing it. Is that true?

Sure it is. Sort of. Take a file in HDFS. If the file isn't formatted in a specific manner, can it still be processed? As long as you can split it, you can process it. Will the processing code need to account for an undefined range of textual boundary conditions? Absolutely. That code will be guaranteed to break arbitrarily because the format of the data is arbitrary.

The same thing can happen with column families. Any code that processes a schema-free column family needs to be prepared to deal with any kind of column that is added into that column family. Again, the processing code needs to be incredibly flexible to deal with potentially unconstrained change. Document stores are another example where even though the data is parse-able, your processing code may need to achieve sentience a la Skynet in order to process it.

So, yes, you can get value from randomly formatted data, if you can write bulletproof, highly adaptable code. That will eventually take over the world and produce cyborgs with Austrian accents that travel back in time.

But what about those of us that process (semi) structured data? Web logs, for example. Or (XML/JSON) data feeds. Things that have some kind of structure, where the meaning and presence -- aka the semantics -- of fields may change but the separators between them don't. Do we really need freedom from the tyranny of something that guarantees structure when we are processing things that have a basic structure?

Yes. Even though format may be well defined, semantics can be quite variable. Fields may be optional, mileage may vary. Putting some kind of schematic constraint on all data invalidates one of the key bonuses of big data platforms -- we wouldn't be able to clean it because we wouldn't be able to load it if we had to adhere to some kind of well defined format. In the big data world, imposing a schema early on would not only suck, it would suck at scale.

However, the moment we have done some kind of data cleansing, and have data that is ready for consumption, it makes sense to bind that data to a schema. Note: by schema I'm talking about a definition of the data that is machine readable. JSON keys and values work quite well, regardless of the actual storage engine.

Because the moment you guarantee your data can adhere to a schema, you liberate your data consumers. You give them...freedom! Freedom from....the tyranny of undefined data!

But wait, that's not all...

What else comes along for free? How about a quality bar by which you can judge all data you ingest? If your data goes through a cleansing process, you could publish how much data didn't make it through the cleansing process. Consumers could opt out of data consumption if too much data was lost.

And when your data changes (because it will) your downstream customers can opt out because none of the data would pass validation. This fast failure mode is much preferred to the one in which you discover that your financial reports were wrong after a month because of an undetected format change. That isn't an urban myth, it actually happened to -- ahem -- a friend of mine, who was left to contemplate the phrase that 'pain is a very effective motivator' while scrambling to fix the issue :)

So what does this all mean?
While the costs of Schema on Write are (a) well known and (b) onerous, Schema on Read is not much better the moment you have to maintain processes that consume the data.

However, by leveraging the flexibility of Hadoop, Cassandra, HBase, Mongo, etc, and loading the data in without a schema, I can then rationalize (clean) the data and apply a schema at that point. This provides freedom to the data producer while they are discovering what the data actually looks like, and freedom to the data consumer because they know what to expect. It also lets me change over time in a controlled manner that my consumers can opt in or out of.

That's not Schema on Read or Schema on Write, it's more like Eventual Schema. And I think it's a rational compromise between the two.

Rolling out splittable lzo on CDH3

2011-04-18T20:42:00.000-07:00

Until splittable lzo, compression options in HDFS were limited. Gzip generated unsplittable output -- great for reducing allocated block usage, terrible for mapreduce efficiency. Bz2 generated splittable output, but took far too long to be effectively used in production.

When we wanted to start incorporating compression into our storage procedures, splittable lzo was the only rational option to ensure parallel processing of compressed files.

We had tried to use bz2 compression on files prior to ingestion, but it took much longer -- approximately 20x as long as gzip compression on the same file.

For a 1GB text file,

gzip -1 took ~ 25 seconds (actually, this is strange. I was expecting gzip to be slightly faster than lzo)
lzo -1 took ~ 9 seconds, indexing took another 4.
bzip2 -1 took ~ 3 minutes.

I set the max speed of each compression routine to provide a relative benchmark: in reality we would be running at a slower speed that increased compression.

Installing The Bits

The java and native source for splittable lzo can be found at https://github.com/kevinweil/hadoop-lzo. If you're using the Cloudera distro, you should use the https://github.com/toddlipcon/hadoop-lzo fork.

The cluster I was installing splittable lzo on was running Centos and walled off from the rest of the world. I found it easiest to generate RPMs on a box with the same architecture, then install those RPMs on all nodes in the cluster. I did this using the https://github.com/toddlipcon/hadoop-lzo-packager code, which takes the native and java components and installs them to the right locations. Note that since I was building on a Centos box, I ran

./run.sh --no-deb

to build RPMs only. There were two rpms, the standard one and the debug-info one. The naming convention appears to be YYYYmmDDHHMMSS.full.version.git_hash_of_hadoop_lzo_project.arch, to allow you to upgrade when either the packaging code or the original hadoop lzo code changes.

The RPMs installed the following java and native bits (note that the packager timestamps the jars):

rpm -ql cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b-1.x86_64

/usr/lib/hadoop-0.20/lib/cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b.jar
/usr/lib/hadoop-0.20/lib/native
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.a
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.la
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0
/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.0.0

rpm -ql cloudera-hadoop-lzo-debuginfo-20110414162014.0.4.10.0.g2bd0d5b-1.x86_64

/usr/lib/debug
/usr/lib/debug/usr
/usr/lib/debug/usr/lib
/usr/lib/debug/usr/lib/hadoop-0.20
/usr/lib/debug/usr/lib/hadoop-0.20/lib
/usr/lib/debug/usr/lib/hadoop-0.20/lib/native
/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64
/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.0.0.debug
/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.0.debug
/usr/lib/debug/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/libgplcompression.so.debug

Hadoop Configuration Changes

After installing the bits via RPMs, There were a couple of changes necessary to get Hadoop to recognize the new codec.

In core-site.xml:

<property>
  <name>io.compression.codecs</name>
  <value>
    org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,
    com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,
    org.apache.hadoop.io.compress.BZip2Codec
  </value>
 </property>
 <property>
   <name>io.compression.codec.lzo.class
   <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>

registers the codec in the codec factory.
In mapred-site.xml:

<property>
   <name>mapred.compress.map.output</name>
   <value>true</value>
 </property>
 <property>
   <name>mapred.map.output.compression.codec</name>
   <value>com.hadoop.compression.lzo.LzoCodec</value>
 </property>

sets intermediate output to be lzo compressed. After pushing configs out to all nodes in the cluster, I restarted the cluster. The next step was to verify that lzo was installed correctly.

Validation

There were some hiccups I ran into during validation -- all pilot error, but I wanted to put them all in one place for next time. My validation steps looked like this:

(1) create an lzo file that was greater than my block size.
(2) upload and index it.
(3) run a mapreduce using the default IdentityMapper
(4) verify that multiple mappers were run from the one lzo file.
(5) verify that the output was the same size and format as the input.

My first mistake: I lzo compressed a set of files. The splittable lzo code only works with a single file. This took me a while to figure out -- mostly due to tired brain. After I had catted the files together into a single file, then lzo'd that file, I was able to upload it to HDFS and index it:

hadoop jar /usr/lib/hadoop/lib/cloudera-hadoop-lzo-20110414162014.0.4.10.0.g2bd0d5b.jar com.hadoop.compression.lzo.LzoIndexer /tmp/out.lzo

This created an index file. From this great article on the Cloudera site: "Once the index file has been created, any LZO-based input format can split compressed data by first loading the index, and then nudging the default input splits forward to the next block boundaries."

Since I had an uploaded, indexed file at this point, I moved to step 3 and 4. Before I could make the IdentityMapper, I needed to get the LZO bits on my mac so that the IdentityMapper could run.

Detour: Getting the Bits on my Mac

I dev on a Mac, but run the cluster on Centos (I can already feel the wrath of Ted Dziuba coming down from on high). I found the instructions here to be adequate to get the changes I needed to make to the IdentityMapper code to compile.

Back to Validation

I ran an IdentityMapper on the original source (side note: in 0.20, to run IdentityMapper, just don't specify a mapper, the default Mapper class implements pass through mapping). I watched the cluster to make sure that the original file was split out across mappers. It wasnt. I was stumped -- I knew this was something simple, but couldn't see what it was.

After a gentle reminder from Cloudera Support (one of many in the last couple of days, actually:), I set my input format class to LzoTextInputFormat, which -- as the same article above mentions in the next sentence -- "splits compressed data by first loading the index, and then nudges the default input splits forward to the next block boundaries. With these nudged splits, each mapper gets an input split that is aligned to block boundaries, meaning it can more or less just wrap its InputStream in an LzopInputStream and be done." When I had used the default TextInputFormat, the mapreduce was working, but the input was being compressed and not split.

job.setInputFormatClass(LzoTextInputFormat.class);

Once I had observed splitting behavior from my indexed lzo file by confirming multiple map tasks, I made sure that output was recompressed as lzo by setting FileOutputFormat properties:

FileOutputFormat.setCompressOutput(job, true); 
FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class) ;

This is different from instructions in Hadoop: The Definitive Guide, and I found it after some googling around. The instructions in the book -- setting properties in the Configuration objct -- did not work -- most likely because the book was written for an earlier version of Hadoop.

Once I had added those lines to my Tool subclass, I was able to get compressed output that matched my compressed input: the exact result I was looking for when validating using the IdentityMapper.

HDFS file size vs allocation

2011-04-11T22:15:00.000-07:00

Recently, I had to understand HDFS at a deeper level that had nothing to do with running mapreduce jobs or writing to the FileSystem API. Specifically, I had to understand the way that HDFS interacts with the underlying filesystem, and the difference between actual HDFS file size and the way HDFS calculates available storage when using quotas.

We recently discovered a bunch of files that were much smaller than our allocated block size -- on average they took up roughly 1/10th of an allocated block.

This was not the standard small file problem, where the namenode requires too much memory to track metadata for large (10s of millions) numbers of files at 150 bytes of metadata per file.

My immediate conclusion was that these small files were effectively taking up a block at a time, and that we were running out of space -- fast! -- because that was the behavior I thought I was seeing at the HDFS level -- I thought that storage was allocated a block at a time, and quotas were determined based on available blocks.

That last statement is partially correct. Storage is allocated a block -- actually a block * replication factor -- at a time. However quotas are determined based on available bytes. A space quota, according to the docs is "a hard limit on the number of bytes used by files in the tree rooted at that directory. Block allocations fail if the quota would not allow a full block to be written."

This is what that means: the only time files are measured in blocks is at block allocation time. The rest of the time, files are measured in bytes. The space quota is calculated against the number of bytes, not blocks, left in the cluster. That number of bytes is converted to the number of blocks (not bytes) that would be required to store a file when a user tries to upload a file. The key here is that space is calculated in blocks at allocation time, so no matter how small a file is, you will always need 1 block * replication factor available to put it in the cluster.

HDFS Operational Details

I spent some time asking, researching, and re-reading the book, and found that making analogies from a standard filesystem to understand HDFS helped me immensely -- to a point (more on that later).

In a standard filesystem, an inode contains file metadata, like permissions, ownership, last time changed, etc, in addition to a set of pointers that point to all blocks that comprise the file. Inodes are kept in a specific location in the filesystem are used to access files.

The inode and block equivalents in HDFS are distributed across the namenode and the datanode.

The namenode maintains file system metadata, which is analogous to the inodes in a standard FS. This metadata is stored in {dfs.name.dir}/current. Datanodes contain blocks of data, stored as block files in the underlying filesystem.

On the datanode, HDFS stores block data in files in the directory specified by dfs.data.dir, which defaults to {hadoop.tmp.dir}/dfs/data/current. HDFS may create subdirectories underneath that dir to balance out files across directories (many filesystems have a file-per-directory limit). The raw data per block is kept in two files, a blk_NNNN file, and a corresponding blk_NNNN_XXXX.meta file, which contains the block checksum, used in block integrity checks.

The block file and checksum file information is periodically sent to the namenode as a blockreport -- i.e. at HDFS startup (HDFS enters safemode while the namenode processes block reports from it's consituent datanodes). Note that each datanode has no idea which block files map to which actual files. It just tracks the blocks. This makes the namenode very critical to HDFS functionality.

To summarize: the metadata that inodes maintain in a standard FS is maintained in the HDFS namenode, and actual file data that is maintained in filesystem blocks in a standard FS is maintained in HDFS blocks on datanodes, which store that block data in block files, maintain checksums of the block data for integrity checking, and update the namenode with information about the blocks they manage.

FileSystem Analogies That Do and Don't Work

In a standard filesystem, disks have a minimum amount of data that they can read or write to, this is called a disk block. Unix disk blocks are 512 bytes. FileSystems also have minimum read/write filesystem blocks that are typically 1-2kb.

Files on a standard filesystem are typically much larger than a block in size. Since most files are not exactly X blocks in size, the 'remainder' of the file that does not fill up a block still takes up that much space on the system. In general (ReiserFS being one exception) the difference between the files real size and it's block size -- the slack -- cannot actually be used for any other file.

In HDFS, if a file is smaller than a block in size, it does not take up an entire HDFS block on disk. There is no concept of HDFS block 'slack space'. A small file takes up as many bytes as it would in a normal filesystem because it is stored as a block file in the normal filesystem. This is where the definition of HDFS block differs from a traditional file block, and this is where my mental model of HDFS as a filesystem failed me :)

While the file and block analogy is valid in HDFS, the size of the blocks makes the difference between file allocated size (always represented in blocks) and file actual size (always in bytes) much larger than it would be on a traditional file system. So you can't treat allocated vs actual size as equivalents, like you effectively can on a traditional filesystem where the block size to file size ratio is relatively tiny.

Small Files on the Datanodes

At allocation time, a small file will require a single block file per datanode. Note that the actual number of blocks required to store that file on the cluster depends on HDFS replication policy, which defaults to 3. So factoring in replication, a small (less than 1 block) file is replicated at three identical block files on separate nodes.

That block file is the same size as the small file -- large files would span several blocks and be split into block size files -- a large file that was 350MB big on a system with 128MB block size would be split into 3 blocks, the first two of 128MB, the last one of 94MB. Each of those would be replicated according to the replication policy of the cluster. The only files that don't take up space on the datanodes are zero byte files, which still take up space on the namenode.

Regardless of actual size, at allocation time, HDFS treats a small file as having a minimum size of one HDFS block per datanode when it is calculating available disk space. So, even if a file is really small, if there is less than a three blocks available on the cluster, the file cannot be stored on the system.

Space Quotas
HDFS only has less than the number of replicated blocks left than it needs to store a file when it is either running out of space, or, more commonly, if there is a space quota on the directory that the file is being copied to. Calculating storage cost in blocks allows HDFS safely store data to a known maximum size, no matter what the actual size of the file is. HDFS will only permit new block creation if there is enough disk space to create a block on N datanodes, where N is the replication factor.

This article shows how HDFS block size, combined with the replication factor, not file actual size, determines available space.

Conclusion

Is this really a problem? Sort of...it's a matter of efficiency. Space quotas are checked by the amount of remaining space on a datanode disk. If a block file takes up 12MB on a system that has 128MB block, there are effectively 114MB available to be added into the available bytes for the space quota -- for a replication factor of 3, that would be 342MB available, or 2.67 blocks. While you could argue that effectively .67 blocks of that space is wasted, 2 blocks of that space is still available for quota calculations. While 2.67 blocks is less than the minimum amount of space required to store a file of _any_ size in an HDFS with a replication factor of 3, if you were to have 2 small files of 12MB, you have 5.34 blocks available across the system -- effectively if you always mod 'leftover space' by replication factor, at most you are wasting replication factor # of blocks.

Granted that's not the most efficient use of disk, but it's not as if a small file takes up a 'virtual' block that gets factored in the next time a file is copied into the cluster.

The bigger problem with small files is the lack of efficiency that is encountered in mapreduce operations. Reducing the number of mappers being used and traversing blocks of data at a time is not possible with small files -- one mapper is spun up per file, and the overhead involved in copying the jar file to the task tracker node, starting up the JVM, etc, only makes sense if there is a substantial amount of data to process. You can't go wrong with large files -- they will split across blocks, which are processed more efficiently.

Setting up YCSB for low latency data store testing

2011-03-14T21:41:00.000-07:00

Overview
When confronted with a problem, my first instinct is to look around to see where that problem has been handled before. This is because I believe that code is a liability, and I want to minimize risk by using code that has been vetted, tested, and put into production by others, and only add to it when necessary.

Right now I have several problems around storing and accessing lots of data in real time. I have several diverse use cases that span applications, but the one thing all of these use cases have in common is that there is no need for transactional integrity. There is also a need for scale beyond which a traditional RDBMS can provide. The first (un) requirement and the second very urgent requirement are pushing me towards (open source) low-latency, NOSQL data stores.

The two kinds of NOSQL data stores I'm looking at are Document Stores and Key-Value stores. Here is a great post discussing the differences between the two.

Some of the questions I need answered to address projects in progress and planned:

What Document Stores and Key-Value Stores out there have heavy adoption rates, a corporate sponsor, other community support that indicate good performance and support?
How much 'eventual consistency' can an application live with? If data doesn't need to be transactional, can it really be eventually consistent?
Is there a Document Store that is fast enough to act as a Key-Value store, since it would be easier to manage one piece of software over two.
How do different Key Value stores compare to one another? Anecdotal evidence is one thing, hard data that I can refer to makes me feel much better.
What happens when I shut a node down? How hard is it to restore?
What are the costs of maintenance of different stores? How hard are they to set up?

I wanted to have question 1and 2 narrow down the range somewhat, evaluate that range against 3 and 4 to filter out slower candidates, leaving me with a smaller set to run by questions 5 and 6.

In order to answer 3 and 4 above, I need to compare and contrast both Doc and KV stores in an 'apples to apples' way to gauge performance.

I was psyching myself up to write a generic test framework, when someone pointed me to Brian Cooper and YCSB, the Yahoo Cloud Serving Benchmark. I had originally dismissed it as being out of date, but a quick perusal of the code on GitHub convinced me that updating it would not be that hard because it cleanly separates specific database calls from core functionality.

YCSB implements different database client abstraction layers, and provides good documentation on how to set them up: https://github.com/brianfrankcooper/YCSB/wiki/getting-started.

Not Quite Ready For Prime Time

Before I could fully use YCSB, I had to fix up a couple of things. There are patches submitted for some of these fixes in the root project, but they hadn't been accepted yet. It made more sense for me to fork a repo and make the changes I needed (and push them if they hadn't already been pushed up to the upstream origin repo): https://github.com/arunxarun/YCSB

Here are some of the fixes I added, they havent been integrated into the master repo yet:

The Cassandra7Client needed to be retrofitted to use ByteBuffers instead of byte[]s.
The MongoDbClient was throwing a ClassCastException in the insert() method because it was casting a double encoded in a string to an Integer.
The MongoDbClient was not connecting to non localhost MongoDB instances because it wasn't appending the database name to the base database url.
There was no truncate functionality. For a Document Store like Mongo, this meant I had to manually truncate the db every time I wanted to reload data. I implemented the truncate method in the DB abstract class (and pushed it to the adaptor classes I used) so that I could do this via YCSB.

I'm going to continue adding functionality -- right now I'm in the middle of adding delete functionality because we want to benchmark that as well -- to my fork and pushing it up if I think it could be useful for other people.

Running A WorkLoad

In YCSB terminology, a workload is a defined set of operations on a database. Workloads are stored as flat files, and executed by specific classes that extend the abstract Workload class. I'm using the CoreWorkload class (the default) for now, and may extend later. CoreWorkload lets me set the proportion of Reads vs Writes vs Updates vs Deletes in a separate property file. There are default core workload files stored in the $YCSB_HOME/workloads directory. They break out like this:

workloada = 50/50 read/update ratio
workloadb = 95/5 read/update ratio
workloadc = 100/0 read/update ratio
workloadd = 95/5 read/update ratio
workloade = 95/5 scan/insert ratio
workloadf = 50/50 read/read-modify-write ratio

Because they are property files, workload files can be copied/tweaked as needed. If needed, I can also override the CoreWorkLoad class to do something different, but I haven't had to do that yet, even though I've added new functionality.

I followed the section on Running a Workload, below are my notes in addition to those instructions.

Building YCSB
Pretty self explanatory: there is an ant target for each db client you wish to compile with:

ant dbcompile-[DB Client Name]

Just make sure your all of the jars your DB Client class needs are in the $YCSB_HOME/db/[Client DB]/lib directory. Note that sometimes, like in the case of Mongo, you may have to find those jars (slf4j, log4j) in other places.

Data Store Setup
There are some a generic setup steps for all Data Stores:

Create a [namespace/schema-like-element] called 'userspace'. For example, in Cassandra this would be a keyspace. In Mongo, a database, etc.
Create a [table-like element] in 'userspace', called 'data'. Again, in Cassandra this would be a column family, in Mongo, a collection, in HBase a column family, in MySQL a table.

DB Specific details are found on the usage page.

Running YCSB

When running YCSB, make sure you specify the jar files used for the DB Client. The first command you will run is the load command:

java -cp $YCSB_INSTALL/db/[DB Client dir]/lib/*:$YCSB_INSTALL/build/ycsb.jar com.yahoo.ycsb.CommandLine -db [DB Client class]

Once you are on the commandline make sure you can connect and see the namespace/keyspace you've created. With that sanity check done, it's time to run a workload. In order to do this I also need to load some data. This is done using the command line client from ycsb.jar:

java -cp $YCSB_INSTALL/db/[DB Client dir]/lib/*:$YCSB_INSTALL/build/ycsb.jar com.yahoo.ycsb.Client -db [DB Client class] -p [commandline props] -P [property files] -s -load > out

Some explanation of the available commandline parameters, note that in the above I'm running with one thread, no target ops, and loading the database via the -load parameter.

-threads n: execute using n threads (default: 1) - can also be specified as the "threadcount" property using -p
-target n: attempt to do n operations per second (default: unlimited) - can also be specified as the "target" property using -p
-load: run the loading phase of the workload
-t: run the transactions phase of the workload (default)
-db dbname: specify the name of the DB to use (default: com.yahoo.ycsb.BasicDB) - can also be specified as the "db" property using -p
-P propertyfile: load properties from the given file. Multiple files can be specified, and will be processed in the order specified
-p name=value: specify a property to be passed to the DB and workloads; multiple properties can be specified, and override any values in the propertyfile
-s: show status during run (default: no status)
-l label: use label for status (e.g. to label one experiment out of a whole batch)
-truncate, my own special addition, to clean out data stores between runs.

Some notes on the properties files I'm loading: the first one specifies the actual workload configuration. I'm using workloads/workloada, which looks like this:

recordcount=100000
operationcount=100000
workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0

requestdistribution=zipfian

When I specify later properties files, they override the values set in the previous ones (commandline props override everything). I take advantage of this by creating other property files that override recordcount, insertioncount, and set db specific properties that are accessed in the DB Client classes.

The output of the run is the average, min, max, 95th and 99th percentile latency for each operation type (read, update, etc.), a count of the return codes for each operation, and a histogram of latencies for each operation.

The histogram shows the number of calls that were returned within the specified number of milliseconds. For example:

reads like this:

45553 calls returned in 0 ms
2344 calls returned in 1ms
399 calls returned in 2 ms
....

That's All For Now...

YCSB as it is provides a very solid foundation for me to do testing across candidate data stores. While it is very well documented, the code could use some love. I intend to give it enough love to evaluate the data stores I want to (in my fork), and push that love upstream. In the future I may end up writing a DB Client for some of the commercial stores we need to evaluate, as well as fix things that bug me.

Setting up CDH3 Hadoop on my new Macbook Pro

2011-01-05T21:46:00.000-08:00

A New Machine

I'm fortunate enough to have recently received a Macbook Pro, 2.8 GHz Intel dual core, with 8GB RAM. This is the third time I've turned a vanilla mac into a ninja coding machine, and following my design principle of "first time = coincidence, second time = annoying, third time = pattern", I've decided to write down the details for the next time.

Baseline
This section details the pre-hadoop installs I did.

Java
Previously I was running on Leopard, i.e. 10.4, and had to install soylatte to get the latest version of Java. In Snow Leopard, java jdk 1.6.0_22 is installed by default. That's good enough for me, for now.

Gcc, etc.
In order to get these on the box, I had to install XCode, making sure to check the 'linux dev tools' option.

MacPorts
I installed MacPorts in case I needed to upgrade any native libs or tools.

Eclipse
I downloaded the 64 bit Java EE version of Helios.

Tomcat
Tomcat is part of my daily fun, and these instructions to install tomcat6 where helpful. One thing to note is that in order to access the tomcat manager panel, you also need to specify

<role rolename="manager"/>

prior to defining

<user username="admin" password="password" roles="standard,manager,admin"/>

Also, I run tomcat standalone (no httpd), so the mod_jk install part didnt apply. Finally, I chose not to daemonize tomcat because this is a dev box, not a server, and the instructions for compiling and using jsvc for 64 bit sounded iffy at best.

Hadoop
I use the CDH distro. The install was amazingly easy, and their support rocks. Unfortunately, they don't have a dmg that drops Hadoop on the box configured and ready to run, so I need to build up my own psuedo mac node. This is what I want my mac to have (for starters):

distinct processes for namenode, job tracker node, and datanode/task tracker nodes.
formatted HDFS
Pig 0.8.0

I'm not going to try to auto start hadoop because (again) this is a dev box, and start-all.sh should handle bringing up the JVMs around namenode, job tracker, datanode/tasktracker.

I am installing CDH3, because I've been running it in psuedo-mode on my Ubuntu dev box for the last month and have had no issues with it. Also, I want to run Pig 0.8.0, and that version may have some assumptions about the version of Hadoop that it needs.

All of the CDH3 Tarballs can be found at http://archive.cloudera.com/cdh/3/, and damn, that's a lot of tarballs.

I downloaded hadoop 0.20.2+737, it's (currently) the latest version out there. Because this is my new dev box, I decided to forego the usual security motivated setup of the hadoop user. When this decision comes back to bite me, I'll be sure to update this post. In fact, for ease of permissions/etc, I decided to install under my home dir, under a CDH3 dir, so I could group all CDH3 related installs together. I symlinked the hadoop-0.20+737 dir to hadoop, and I'll update it if CDH3 updates their version of hadoop.

After untarring to the directory, all that was left was to make sure the ~/CDH3/hadoop/bin directory was in my .profile PATH settings.

Psuedo Mode Config
I'm going to set up Hadoop in psuedo distributed mode, just like I have on my Ubuntu box. Unlike Debian/Red Hat CDH distros, where this is an apt-get or yum command, I need to set up conf files on my own.

Fortunately the example-confs subdir of the Hadoop install has a conf.psuedo subdir. I needed to modify the following in core-site.xml:

<property>
     <name>hadoop.tmp.dir</name>
     <value>changed_to_a_valid_dir_I_own</value>
</property>

and the following in hdfs-site.xml:

<property>
     
     <name>dfs.name.dir</name>
     <value>changed_to_a_different_dir_I_own</value>
</property>

I also had to create masters and slaves files in the example-confs/conf.pseudo directory:

echo localhost > master
echo localhost > slave

finally, I symlinked the conf dir at the top level of the Hadoop install to example-configs/conf.pseudo after saving off the original conf:

mv ./conf install-conf
ln -sf ./example-confs/conf.pseudo conf

Pig
Installing Pig is as simple as downloading the tar, setting the path up, and going, sort of. The first time I ran pig, it tried to connect to the default install location of hadoop, /usr/lib/hadoop-0.20/. I made sure to set HADOOP_HOME to point to my install, and verified that the grunt shell connected to my configured HDFS (on port 8020).

More To Come
This psuedo node install was relatively painless. I'm going to continue to install Hadoop/HDFS based tools that may need more (HBase) or less (Hive) configuration, and update in successive posts.

Pig SPLITs, JOINs, and COGROUPs to manipulate multiple relations

2010-12-20T22:43:00.000-08:00

I've been playing around with Pig and UDFs for the last couple of weeks as we try to convert an application from using SQL to do ETL to using Pig for the same transforms.

In this particular application, we need to 'thread' logged messages together by fields that they can be joined on. Different messages represent different state around a single meta-state, kind of like a session, that unifies the different mesages. Messages can have a specific type, lets call those A,B,C,and D. The joining rules are:

A joins B on field y
B joins C on field y
D joins A on field x,y,z

Split

The first step prior to joining messages is to separate them into relations that only contain A,B,C, or D messages using the Pig SPLIT statement. SPLIT works like this:

SPLIT tuple INTO something IF condition, something else IF other condition.....);

basically SPLIT is a case statement, and I needed to write UDFs to implement the condition tests by comparing the input GMT against the specified day.

Writing UDFs for the SPLIT

In previous posts I've written eval UDFs. Those take input and transform it to something else. In this case I needed to implement filter UDFs. Filter UDFs return a boolean value based on their input.

I've found that the 'top down' approach works well when designing UDFs. By that I mean write the UDFs as they would be used in script:

SPLIT RAW_DATA INTO A IF isA(), B IF isB(), C IF isC(), D IF isD();

and then implement them. Because of the boolean nature of the UDFs I need to implement four different methods because I need to perform four tests in the SPLIT statement above. I'm basically going to implement the pattern:

public class IsA extends FilterFunc {

 @Override
 public Boolean exec(Tuple someTuple) throws IOException {
           return testForA(someTuple); 

 }
        
        protected Boolean testForA(Tuple someTuple) {
              ..... // determine if this is a type A, or not.
        }


}

So the SPLIT statement above works as advertised, partitioning the original raw data out by message type.

JOINing Relations

The next part of threading the messages together is to JOIN them along common fields. The JOIN statement groups relations by a single field:

JOINED_AB = JOIN A BY y, B BY y;

NOTE that this JOIN is an inner join, outer joins are a whole other beast. It simply aggregates all fields of B and C together.So the JOINED_AB relation looks like:

a::x,a::y,a::p,b::q,b::y,b::z

If you want to have an authoritative value of y for each tuple of JOINED_AB, you would need to explicitly generate it:

JOINED_AB = FOREACH JOINED_AB GENERATE a::y as y, .....;

In the case above, recall that

A joins B on field y
B joins C on field y
D joins A on field x,y,z

to knit these fields together, you would

JOINED_AB = JOIN A ON y, B on y;

JOINED_AB = FOREACH JOINED_AB GENERATE B::y as y,*;

JOINED_AB_C = JOIN JOINED_AB ON y, C on y;

At this point we want to join D to JOINED, but that needs to be done along a multiple column match. JOIN only handles single column matches. It's time to use COGROUP.

COGROUPing Relations

The first thing we need to do (for clarity) is to regenerate some of the fields in the JOINED relation:

JOINED = FOREACH JOINED generate A::x as x, A::y as y, A::z as z;

This allows us to COGROUP without having to dereference by sub-tuple:

ALL_DATA = COGROUP JOINED ON (x,y,z) D on (x,y,z);

This relation is actually comprised of all fields of A,B,C,and D, but because we joined A,B,and C into JOINED before joining it to D, the tuple structure looks like this:

ALL_DATA: (x,y.z), {JOINED_AB_C: { JOINED_AB::x,JOINED_AB::y,JOINED_AB::z,JOINED_AB::A::field1,

JOINED_AB::B::field2}, D: {x,y,z,..}}

In other words like a GROUP, that takes members of the same relation and binds tuples by similar fields ,creating a group and a bag that holds a list of matching tuples, COGROUP takes members of different relations, binds them by similar fields, and creates a bag that contains a single instance of both relations where those relations have common fields. In fact the COGROUP and GROUP operations are the same, it's just common practice to use COGROUP when grouping multiple relations, GROUP when grouping the same relation.

Writing a custom PIG Loader

2010-12-01T18:17:00.000-08:00

Foreward

I've been pretty happy using the default pig loader, which takes as input the delimiter of a CSV, and loads tuples into memory as specified:

A = LOAD '/csv/input/inputs*' USING PigStorage() AS (field1,field2,..fieldN)

However I'm in the middle of doing some transforms on a csv with ~ 226 fields. Yikes. For most of these transforms, we don't need all 226 fields, in fact we probably only need a reasonable subset, but which reasonable subset depends on what we are trying to do. Ideally I'd like to be able to extract the values I want into a tuple like this:

A = LOAD 'somefile' using CustomLoader(1,4,45,100...) as (timestamp:long, id:long, url:chararray...);

So, it's time to write a custom loader.

Setup

Hadoop
I first installed the CDH3 distro of hadoop -- specifically the psuedo mode configuration, which runs all hadoop core services, i.e. namenode, datanode, jobtracker and tasktracker on a single box. I then installed hadoop-pig. Cloudera makes this easy by leveraging apt-get and installing to the standard *nix hierarchical locations. The CDH3 version of Hadoop uses /etc/alternatives to allow for easy version switching, and logs reside in the usual /var/logs location.

sudo apt-get install hadoop-0.20-conf-pseudo

after starting core services as described in the link, I then installed CDH3 pig (version 0.7.0) via apt-get:

sudo apt-get install hadoop-pig

CDH3 Pig installs the pig shell script in /usr/bin, and provides libs in /usr/lib/pig. In order to run the pig shell, you need to set JAVA_HOME to /usr/lib/jvm/java-6-sun.

Mavenization

With the necessary services installed, I set up a maven project, mainly for brainless dependency management:

mvn archetype:create  -DarchetypeGroupId=org.apache.maven.archetypes

-DgroupId=org.arunxarun.data.prototypes  -DartifactId=pigloader

I then wanted to bring in the pig jars as dependencies. I found hadoop-0.20-core in the mvnrepository, but could not find pig.jar or pig-core.jar in any maven repository. So I installed the pig and pig-core jars to my local repository from the /usr/lib/pig directory where they had been put by the apt-get install. I did that after creating versionless symlinks to the real jars whose names contained version information:

mvn install:install-file -Dfile=/usr/lib/pig/pig-core.jar -DgroupId=org.apache.hadoop -DartifactId=hadoop-pig-core -Dversion=0.7.0 -Dpackaging=jar

mvn install:install-file -Dfile=/usr/lib/pig/pig.jar -DgroupId=org.apache.hadoop -DartifactId=hadoop-pig -Dversion=0.7.0 -Dpackaging=jar

Finally, I made sure that the dependencies were referenced in my pom file:

<dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-pig-core</artifactId>
      <version>0.7.0</version>
    </dependency>
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-pig</artifactId>
      <version>0.7.0</version>
    </dependency>

Implementation

As of 0.7.0, Pig loaders extend the LoadFunc abstract class.This means they need to override 4 methods:

getInputFormat() this method returns to the caller an instance of the InputFormat that the loader supports. The actual load process needs an instance to use at load time, and doesn't want to place any constraints on how that instance is created.
prepareToRead() is called prior to reading a split. It passes in the reader used during the reads of the split, as well as the actual split. The implementation of the loader usually keeps the reader, and may want to access the actual split if needed.
setLocation() Pig calls this to communicate the load location to the loader, which is responsible for passing that information to the underlying InputFormat object. This method can be called multiple times, so there should be no state associated with the method (unless that state gets reset when the method is called).
getNext() Pig calls this to get the next tuple from the loader once all setup has been done. If this method returns a NULL, Pig assumes that all information in the split passed via the prepareToRead() method has been processed.

Here is the current implementation: note that the constructor takes a var arg set of Strings, which is the only kind of argument that can be used with a Pig Loader. Also note from above that RecordReader is set in prepareToRead, but actually used in getNext().

public class CustomLoader extends LoadFunc {

 private static final String DELIM = "\t";
 private static final int DEFAULT_LIMIT = 226;
 private int limit = DEFAULT_LIMIT;
 private RecordReader reader;
 private List indexes;
 private TupleFactory tupleFactory;

 /**
  * Pig Loaders only take string parameters. The CTOR is really the only interaction
  * the user has with the Loader from the script.  
  * @param indexesAsStrings
  */
 public CustomLoader(String...indexesAsStrings) {
  this.indexes = new ArrayList();
  for(String indexAsString : indexesAsStrings) {
   indexes.add(new Integer(indexAsString));
  }
  
  tupleFactory = TupleFactory.getInstance();
 }
 
 
 @Override
 public InputFormat getInputFormat() throws IOException {
   return new TextInputFormat();

 }

 /**
  * the input in this case is a TSV, so split it, make sure that the requested indexes are valid, 
  */
 @Override
 public Tuple getNext() throws IOException {
  Tuple tuple = null;
  List values = new ArrayList();
  
  try {
   boolean notDone = reader.nextKeyValue();
   if (!notDone) {
       return null;
   }
   Text value = (Text) reader.getCurrentValue();
   
   if(value != null) {
    String parts[] = value.toString().split(DELIM);
    
    for(Integer index : indexes) {
     
     if(index > limit) {
      throw new IOException("index "+index+ "is out of bounds: max index = "+limit);
     } else {
      values.add(parts[index]);
     }
    }
    
    tuple = tupleFactory.newTuple(values);
   }
   
  } catch (InterruptedException e) {
   // add more information to the runtime exception condition. 
   int errCode = 6018;
            String errMsg = "Error while reading input";
            throw new ExecException(errMsg, errCode,
                    PigException.REMOTE_ENVIRONMENT, e);
  }

  return tuple;

 }

 @Override
 public void prepareToRead(RecordReader reader, PigSplit pigSplit)
   throws IOException {
  this.reader = reader; // note that for this Loader, we don't care about the PigSplit.
 }

 @Override
 public void setLocation(String location, Job job) throws IOException {
  FileInputFormat.setInputPaths(job, location); // the location is assumed to be comma separated paths. 

 }

}

Testing
Testing a Pig UDF requires two steps: basic unit testing and integration testing via a script. I'm including this section because it also shows how the loader is accessed via Pig Latin.

Unit Testing: Mocking the Reader
I've implemented a MockRecordReader that I can pass into my CustomLoader via prepareToRead(). The MockRecordReader will be accessed when getNext() is called. Note that I've only implemented the methods I need. This is by no means a fully functional RecordReader:

public class MockRecordReader extends RecordReader {

 private BufferedReader reader;
 private long key;
 private boolean linesLeft;

 
 /**
  * call this to load the file
  * @param fileLocation
  * @throws FileNotFoundException 
  */
 
 
 public MockRecordReader(String fileLocation) throws FileNotFoundException {
  reader  = new BufferedReader(new FileReader(fileLocation));
  key = 0;
  linesLeft = true;
  }

 @Override
  public void close() throws IOException {
   // TODO Auto-generated method stub
   
  }

 @Override
  public Long getCurrentKey() throws IOException, InterruptedException {
   return key;
  }

 @Override
  public Text getCurrentValue() throws IOException, InterruptedException {
  String line = reader.readLine();
  
  if(line != null) {
   key++;
  } else {
   linesLeft = false;
  }
   return new Text(line);
  }

 @Override
  public float getProgress() throws IOException, InterruptedException {
   // dont need this for unit testing
   return 0;
  }

 @Override
  public void initialize(InputSplit arg0, TaskAttemptContext arg1)
      throws IOException, InterruptedException {
   // not initializing anything during unit testing.
   
  }

 @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
   return linesLeft;
  }

}

Implementing Units using the MockRecordReader is super easy. Note that I load the MockRecordReader up with some fake data for testing purposes.

public class CustomLoaderTest {

 @Test
 public void testValidInput() throws Exception{
  MockRecordReader reader = new MockRecordReader("src/test/resources/valid1line_hit_data.tsv");
  
  CustomLoader custLoader = new CustomLoader("0","2","4");
  
  custLoader.prepareToRead(reader, null);
  
  Tuple tuple = custLoader.getNext();
  
  assertNotNull(tuple);
  
  String ts = (String)tuple.get(0);
  assertNotNull(ts);
  assertEquals(ts,"1130770920");
  
  String language = (String)tuple.get(1);
  assertEquals(language,"en-ca");
  String someCt = (String)tuple.get(2);
  assertEquals(someCt,"675");
  
  
 }
 
 @Test(expected=IOException.class)
 public void testInvalidInput() throws Exception {
  MockRecordReader reader = new MockRecordReader("src/test/resources/valid1line_hit_data.tsv");
  
  CustomLoader custLoader = new CustomLoader("300");
  
  custLoader.prepareToRead(reader, null);
  
  Tuple tuple = custLoader.getNext();
 }
  
}

Integration Testing
Now that I know I haven't generated any NPEs from basic usage (NOTE: there are plenty more tests that I could do around bad format), it's integration test time. Integration testing a loader via Pig Latin is pretty simple: load data, then dump it, and validate that it looks like it should. Right now this is manual, basically running the script below, but output could/should be automatically validated.

Note that in order to use the UDF I've written, I need to specifically register it as shown in the first line below.

register '../../../target/CustomLoader-1.0-SNAPSHOT.jar'

-- the loader is fully path specified, and args are passed in using single quotes.
-- the file being loaded exists at the specified location in HDFS

A = LOAD '/test/hit_data.tsv' USING com.foo.bar.CustomLoader('0','2','6','19') AS (zero:long,two:chararray,six:long,nineteen:chararray);

C =  GROUP A BY zero;

-- this forces pig to execute the query plan up to the DUMP, which means invoking the loader. 

DUMP C;

-- note that the same loader can be invoked with a different number of arguments, and
-- fields don't have to be cast
-- the file being loaded exists at the specified location in HDFS

B = LOAD '/test/hit_data.tsv' USING com.foo.bar.CustomLoader('100','200');

DUMP B;

Conclusion
Writing the code took about 10 minutes. Testing it took much longer. That seems to be the pattern for me when writing (simple) UDFs. What I've noticed about Pig scripts and UDFS is that in order to validate functionality throughout the script/UDF lifecycle you always need to validate the generated Tuples to feel confident that changes have been made without regression.

Other than the lack of automation around integration testing, the actual Loader works as advertised -- it might need to change to accommodate new requirements, but it will work just fine for prototypical work with multi column CSV files.