Monday, January 11, 2016

Changing Roles

It has been almost 2 years since I've posted anything in this blog.  In July 2014 I decided to leave Disney, and more significantly, change roles. At Disney I had been leading the engineering and analytics teams of a central data service. In my not-so-new role at HP I'm on the product management team for the Helion Development Platform, which in it's current incarnation is known as HPE Helion Stackato, a Cloud Foundry based Platform as a Service.

I was (and still am!) excited to do Product Management. As an engineer I have seen how essential the consistent application of product vision and stewardship can be to an engineering effort. I did not know if I really had product vision or was just delusional, but I wanted to find out either way...

The best part about this job is that I get to think - a lot - about how to make software development better for the average enterprise engineer. The struggle is real! Most engineering teams still write IaaS based applications like they were running on bare metal servers, and that leads to a world of hurt, because an application distributed across IaaS provided resources has to contend with underlying network, compute, and storage service failures.

The promise of Platform as a Service is that it enables easy development and deployment of cloud native applications - applications that take advantage of the elasticity of the cloud while dealing with the ephemerality of the underlying IaaS. Cloud native is a concept that takes most engineering organizations some time to get their head around.

As a result, there is a significant educational aspect to my role, which I love. I get to help people to focus on creating value with software, something that has enthralled me since I was 14 years old teaching myself BASIC on my Dads IBM PC/AT.

I get to present my thoughts to various captive audiences, either at customer onsite visits or conferences. Here is a presentation from HP Discover 2015 that I think captures the problem we are trying to solve and the best approaches to take to solve the problem.

In 2016 I'm really excited because of the acceleration and rapid maturity of several key technologies that have the potential to play very well together. 

Containers, mostly via Docker, have brought easy authoring and immutable infrastructure into mainstream software development. As an example, the other day, instead of hand installing Kafka and Zookeeper onto a VM and then doing it again by hand when I needed to grow my test Kafka cluster, I just typed "docker run...", pointing to a Kafka/ZK image built and published to Docker Hub by Spotify. I got to take advantage of all of their hard work, and save the potential multiple hours required to get that image working correctly.  

Kubernetes and Cloud Foundry are viable orchestration mechanisms for distributed, container based applications, handling deployment, scaling, and failure remediation -- deployment and scaling are things that we used to do by hand, late at night, on pins and needles. Dealing with failure usually meant doubling your hardware, or writing complex startup scripts customized to each application. Both approaches are quite differently opinionated, and I can see the merits of each one for different use cases, sometimes in the same overall application stack! 

Mesos has emerged as an intermediate resource management layer that abstracts the underlying IaaS away. The emergence of next generation big data workloads running on Mesos is something that really excites me as an ex service owner who struggled to justify value of the insights provided by using big data technologies against the steep costs of hardware. Having a layer that allocates a finite resources and maximizes resource allocation across very diverse workloads makes the start up cost of new experimental data investigation much lower, and therefore much more likely. 

All of these technologies are evolving at an incredible rate. I'm excited to see, and hopefully play a role in delivering, the next generation of platforms that make these technologies easy to consume and manage, and allow engineering teams to focus on features instead of infrastructure. 

I'm trying to post more this year - the last 18 months have been a heads down, get it done, tactical march. That was great, but I think I miss key insights when I don't occasionally digest and reflect what is going on around me. I hope to do more of that here over the next year. It's a new years resolution, hopefully one that will last longer than the one I made about not eating sugar :)

Tuesday, March 18, 2014

Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

A Swing And A Miss

In my last post I had trained a Bayesian classifier using a dataset pulled from, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:
  1. Did I really test and validate the data in the most effective way?
  2. As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
  3. If so, how do I go about increasing data from a finite set of data? 

Giving the classifier another chance with K fold validation

Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

I can get that coverage and reproducibility by using k fold validation across the data.

K fold validation works like this:

  1. break the dataset into K equivalent subsets.
  2. hold one of the subsets out for testing.
  3. use all of the other subsets for training. 
  4. train the data on the k-1 subsets, test it on the kth subset. 
  5. rotate through all subsets - repeat 2-4, holding out a different subset each time. 
  6. average the accuracy of all test+train processes.
When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

This is the method I used to split an array into k folds:

def partitionArray(self,partitions, array):
        @param partitions - the number of partitions to divide array into
        @param array - the array to divide
        @return an array of the partitioned array parts (array of subarrays)
        nextOffset = incrOffset = len(array)/partitions
        remainder = len(array)%partitions
        lastOffset = 0
        partitionedArray = []
        for i in range(partitions):
            lastOffset= nextOffset
            nextOffset += incrOffset
        partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])

        return partitionedArray

This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

 def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):
        build test and training sets
        @param iteration - the offset of the arrays to hold out
        @param reviewsByRating - the set of reviews to build from
        @return test and training arrays
        test = []
        training = []
        for i in range(folds):
            if i == iteration:

        return training, test

Increasing The Data Set with Sampling

How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach.  In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews.  In order to increase the positive data set I need to generate more positive reviews.

Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set. 

One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

Pretend the following sentence is actually a review:
         The big big green caterpillar ate the small green leaf.

putting the words in an array that looks like this: 
        somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'.   I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

In Python sampling with replacement looks like this:

        word = somearray[random.randint(0,len(somearray))] 

I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

def createReview(self,textFreqDist,reviewLength):
        @param textFreqDist -  the array containing the frequency distribution of words to choose from.
        @param reviewLength -  the length of the review (in words) to build
        @return the generated review as a string
        randLen = len(textFreqDist)
        reviewStr = ""
        for i in range(reviewLength):
            reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')

        return reviewStr

A Cautionary Note on Overfitting

When I first did the positive 'boost', I was getting really good results....really, really good results.  99% accuracy on a test set was a number that seemed too good to be true. And it was. 

In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart. 

When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased. 

New Test Results

I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process,  the ratio of positive to negative reviews in the test data remainsthe same.  The code used to train and test the data is the same as before.

My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews. 


This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency. 

If I were to do this over again, I would do the following: 
  1. If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
  2. k-fold validation from the beginning to get the average accuracy per approach.
  3. investigate mis-classification errors before doing any optimization! 
Most of my time was spent analyzing and building the optimal training data set.  The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews. 

A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data. 

Potential next steps include:
  1. Getting (more) new data from a different source.
  2. Trying the bayes classifier on that data
  3. Trying a different classifier, e.g. the maxent classifier on the same data
  4. Going deeper into sentiment: what entities were positive / negative sentiment directed at?

Tuesday, February 4, 2014

Making Sense of Unstructured Text in Online Reviews Part 3: Trying to Improve Classifier Accuracy

This post is part of a series where I try to classify online review text in more and more concrete ways. Right now I'm training a classifier to accurately classify one (bad) vs five (good) starred reviews. In the last post I had done some initial training and testing of an NLTK Bayesian classifier. In this post I want to see if I can improve the accuracy score of my classifier by getting smarter about which features I include.

In the last post I had experimented with varying the quantity of feature set, and had found that while encoding more features into a classifier during training helps accuracy, there is an eventual accuracy ceiling. My feature set came from taking the top N words from a frequency distribution of all words in the reviews text. Here is what the accuracy curve looks like:

One other way to improve accuracy is to address the 'quality' of the feature set by looking at features not only in terms of their frequency across the training corpus, but looking at their relative frequencies across classifications.

In the review classification done so far, individual words are the features. I'm going to try to 'tune' feature sets in several different ways -- I have no idea if these will work, but they seem reasonable. I'm going to call these attempts hypotheses, because my goal is to prove them to be true or false, with relatively minimal effort.

Hypothesis 1: Throw away features with a low 'frequency differential'

My hypothesis is that there are features that have a much higher chance of being in a negative review than a positive review, and vice versa. Those are the features that we want to keep. Other features are ones that have approximately the same chance of being in either type of review (positive or negative).

    P(review rating | features) = P(features, review rating)

In the equation above, the P(features, review rating) term is the multiplied probabilities of each P(feature, review rating). If I'm looking for a higher overall probability that a document is one star over five star or vice versa, having per feature probabilities that are similar for one star or five star reviews means that my overall probabilities for one and five star will be close to equal, which could tip classification results 'the other way' and increase my error rate.

I can validate this hypothesis by filtering out those low probability differential features and keeping the ones that have a high probability differential: a high difference between P(feature, review rating) for {1 star, 5 star} ratings.

Building The Feature Set

I had trained and tested the classifier by taking input data, splitting it into a test and a training set, then training and testing. I will recreate that process now to get the raw data so that I can 'remove' common terms with low probability:

            sjr = SiteJabberReviews(pageUrl,filename)
            asd = AnalyzeSiteData()
I ended up recoding the building of the training and test data so that the data sets being built had a more even distribution of ratings across them:

def generateLearningSetsFromReviews(self,reviews, ratings,buckets):
        # check to see that percentages sum to 1
        # get collated sets of reviews by rating. 
        val = 0.0
        for pct in buckets.values():
            val += pct
        if val > 1.0:
            raise 'percentage values must be floats and must sum to 1.0'
        reviewsByRating = defaultdict(list)
        for reviewSet in reviews:
            for rating in ratings:
                reviewList = [(self.textBagFromRawText(review.text), rating) 
                      for review in reviewSet.reviewsByRating[rating]]
                random.shuffle(reviewsByRating[rating]) # mix up reviews from different reviewSets
        # break collated sets across all ratings into percentage buckets
        learningSets = defaultdict(list) 
        for rating in ratings:
            sz = len(reviewsByRating[rating]) 
            lastidx = 0
            for (bucketName, pct) in buckets.items():
                idx=lastidx + int(pct*sz)
                lastidx  = idx

        return learningSets

When I built up the training data using this method, the sets were returned in the buckets[] array:

        buckets = asd.generateLearningSetsFromReviews([sjr],[1,5],{'training': 0.8,'test':0.2})

Each training set in this list is actually an array of (textBag, rating) tuples:
     buckets = [[(bagOfText,rating)...],[..]]

I want to get frequency distributions of common terms from one and five star reviews in the training data, so that I can find terms with a high probability differential:                         

            # get common terms and frequency differentials

            allWords1 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 1]
            fd1 = FreqDist(allWords1)
            allWords5 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 5]
            fd5 = FreqDist(allWords5)

            commonTerms = [w for w in fd1.keys() if w in fd5.keys()]

            # now get frequency differentials

            commonTermFreqs = [(w,fd1.freq(w),fd5.freq(w),abs(fd1.freq(w)-fd5.freq(w))) 
                for w in commonTerms]


Now we've got common terms, sorted by their absolute differential between frequency distributions in 1 and 5 star reviews.

if I plot this distribution:

            freqdiffs = [diff for (a,b,c,diff) in commonTermFreqs]

I can see that it falls off sharply:

This looks like a Zipfian distribution: "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word..." 

The shape of this distribution implies that only a small subset of  the terms actually have a frequency differential that really 'matters' in the hypothesis -- all terms aren't needed. I can start arbitrarily by keeping all terms with a frequency differential > 0.001 to quickly test the hypothesis. That leaves 131 of the original 688 common terms.

Note that in getting and filtering common terms, I have not retained the terms that very strongly signal one review rating or another: those would be the terms that exist only in one review rating corpus or another. Note that even though those terms do not exist in one or the other review corpus, and that would make the calculation go to zero, the non existent terms are 'smoothed out' by including them in the other corpus and adding a very small value to the frequency of all terms in that corpus, which guarantees that there are no terms with a zero frequency, and the Bayesian calculation won't zero out.

I would need to add those terms into the set of terms that we filter by.

The full set of filtered terms is comprised of both uncommon and filtered common words:

            filterTerms = [w for (w,x,y,diff) in commonTermFreqs if diff > 0.001]
            fd1Only = [w for w in fd1.keys() if w not in fd5.keys]
            fd5Only = [w for w in fd5.keys() if w not in fd1.keys]

            defaultWordSet = set(filterTerms) # rename so I dont have to rewrite the encoding method 

And I use those words as features identified at encoding time:

            def emitDefaultFeatures(tokenizedText):
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet

                return featureSet

Testing The Hypothesis

Now I can train the classifier: asd.encodeData() takes care of encoding features from the training and test sets by calling emitDefaultFeatures() for each review.

            encodedTrainSet = asd.encodeData(rawTrainingSetData,emitDefaultFeatures )
            classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)
            encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

            print nltk.classify.accuracy(classifier, encodedTestSet)

And I get an accuracy of 0.83, the same accuracy I got with no manipulation of the feature set, which is 0.02 less than my optimal accuracy. Whoops.

Detailed Error Analysis

There is one other step  I can take to understand the accuracy of the classifier, and that is to analyze the errors made on the test set. If I know how I mis-classified the data, that can help me affect the classifier.

            shouldBeClassed1 = []
            shouldBeClassed5 = []
            for (textbag, rating) in buckets['test']:
                testRating = classifier.classify(emitDefaultFeatures(textbag))
                if testRating != rating:
                    if rating == 1:

A quick check on the error arrays shows me that I've only made mistakes on the reviews that should be classified as positive:

          >>>> print len(shouldBeClassed5.append(textbag))

Wait a minute. That number looks familiar. Let me review the raw data again: 
          >>>>print len(sjr.reviewsByRating[5])
          >>>>print int(0.2*len(sjr.reviewsByRating[5]))

This data shows that  I mis-classified all 11 positive reviews in the test data, because my error analysis showed that I had eleven mis-classified positive reviews, and I only had 11 positive reviews in the teset set based on an 80% training/20% testing split.

A quick reversal to the original test method (that collected features from a FreqDist of all terms in the training data) shows that I mis-classified all 11 positive reviews as well.


This was one attempt to improve classifier accuracy by trying something reasonable with the feature set -- removing features whose probability differential across 1 star and 5 star review corpuses was very small.

While the numbers initially looked 'decent', deeper analysis shows that my classifier completely mis-classified positive reviews. In the future I'll do error analysis of classifiers before trying to theorize about what could make the classifier more accurate.

Looking closer at the data, the data set had 55 total  positive reviews and 273 total negative reviews.  In other words only 20% of my data was actually positive review data.

I had originally scraped only one reviewed site for data, but now I think I'm going to need to scrape more sites to get a more representative set of positive review data so that the classifier has more training examples.

In my next post I'm going to try to collect a more representative 'set' of data, and also take a slightly different approach to validating my classifier. I'm going to do error analysis up front and attempt to correct my classifier based on the errors I see, then test the classifier against new test data -- testing a fixed classifier against the data I used to fix it will give me a false sense of accuracy, because the test data used to do error analysis has in effect become training data.

Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants  who wrote and maintain the NLTK package.  In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that  I could explore the data interactively. The source is at

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site,

    from sitejabberreviews import SiteJabberReviews
    from analyzesitedata import AnalyzeSiteData

    pageUrl = 'reviews/'
    filename = "siteyreviews.pkl"
    sjr = SiteJabberReviews(pageUrl,filename) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

    pageUrl = 'reviews/'
    filename = "siteyreviews.pkl"
    sjr = SiteJabberReviews(pageUrl,filename)    

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this:  the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

    P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those  features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'.  So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

    P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier  by training it with one star and five star review data. This is a pretty simple, binary approach to review classification. 

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:
  1. Extract train and test data from my review data.
  2. Encode train and test data.
  3. Train the classifier with the encoded training data
  4. Test the classifier with the encoded test data.
  5. Investigate errors during the test
  6. Modify training set and repeat as needed.
I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
       # generate tuples of (textbag,rating)
        reviewList = [(self.textBagFromRawText(review.text), key) 
           for review in reviews.reviewsByRating[key]]
        return reviewList[: int(trainSetPercentage*len(reviewList))],

the generateTestAndTrainingSetsFromReviews() method calls  textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

 def textBagFromRawText(self,rawText):
        @param rawText: a string of whitespace delimited text, 1..n sentences
        @return: the word tokens in the text, stripped of non text chars including punctuation
        rawTextBag = []        
        sentences = re.split('[\.\(\)?!&,]',rawText)
        for sentence in sentences:
            lowered = sentence.lower()
            parts = lowered.split()
        textBag = [w for w in rawTextBag if w not in stopwords.words('english')]    

        return textBag

I generate test and training data for one and five star reviews using  generateTestAndTrainingSetsFromReviews():

            # load helper objects
            sjr = SiteJabberReviews(pageUrl,filename)
            asd = AnalyzeSiteData()

            trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
            trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
            rawTrainingSetData = []

            rawTestSetData = []


With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document.  I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

            # for raw Training Data, generate all words in the data
            all_words = [w for (words, condition) in rawTrainingSetData for w in words]
            fdTrainingData = FreqDist(all_words)
            # take an arbitrary subset of these
            defaultWordSet = fdTrainingData.keys()[:1000]
            def emitDefaultFeatures(tokenizedText):
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet
               return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here: 

      def encodeData(self,trainSet,encodingMethod):
          return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training  data encoded, we can encode the data and train the classifier as follows:

       encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
       classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

       encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
       print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy  drops to 75%.


The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.

Saturday, January 4, 2014

Making Sense of Unstructured Text in Online Reviews, Part 1


I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from (I'm focusing on ecommerce sites first)

        # first get the pages we need to navigate to to get all reviews for this site. 
        page = urllib2.urlopen(self.pageUrl)
        soup = BeautifulSoup(page)
        pageNumDiv = soup.find('div',{'class':'page_numbers'})
        anchors = pageNumDiv.find_all('a')
        urlList = []
        for anchor in anchors:
            urlList.append(self.base + anchor['href'])
        # with all pages set, pull each page down and extract review text and rating. s
        for url in urlList:    
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page)
            divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))
            for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

    soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what  I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

    for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
               rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above: 

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements. 

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

         for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']
                r = Review(text,rawRating)
                if self.reviewsByRating.has_key(r.rating):
                    reviews = self.reviewsByRating[r.rating]
                    reviews = []
                    self.reviewsByRating[r.rating] = reviews

and flushed that map to disk using pickle:

     def saveToDisk(self):
          with open(self.filename,'w') as f:

this let me load the data from file without having to scrape it again:

    def load(self):
          with open(self.filename,'r') as f:
              self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data. 

Sunday, December 8, 2013

Innovation Week Recap

I previously posted about our leadup to Innovation Week, which ended up being more like Innovation Week-and-a-half because it shifted a sprint ending the week of Christmas, which is pretty sparsely attended due to everyone being out of town.

The 10 days of innovation ended up being much more successful than I had thought possible. There was some very out of the box thinking, both in the realm of infrastructure and analytics, and some of these ideas have huge potential to shift how we think about big data.

The reasons I thought that things might not go so well (and what ended up happening):

  1. Lack of ideas. At the time I wrote the last post, we were up to 6. We topped out at 14, all well thought out and presented. We had to narrow the ideas down to 5 based on peoples availability -- we did that with a team-wide vote 
  2. Lack of managment: we -- the management team --  had specifically decided to let the teams be self organizing, and not interfere with them even if we saw them go off the rails. No one went off the rails, and teams organized around the work and the capabilities of the team members. We did make ourselves available for questions/advice, but other than that we sat back and observed. 
  3. Technical roadblocks: the ideas we ended up voting in (as an entire team) had some steep technical hurdles. I wasn't sure if the teams could overcome those, and wasn't sure what they would do if they couldn't. Every team had at least one significant roadblock that they worked around with little to no guidance. 
  4. I'm as pessimist realist, and tend to prepare for worse case scenarios. Apparently I overestimate myself and my management team's contributions :)
The presentations were great in that all except for one were  live demos of working software -- one key difference between this and standard demos is that the teams owned the ideas and were therefore much more invested in how the demos went.

We're taking the top  ideas and starting new work that will get prioritized against existing deliverables. While I'm obviously excited about the ideas, some of which I consider to be fundamental game changers, I'm just as excited because of  what I learned  about leading teams. 

Our best ideas come from our people, and when we guide them and set the target, they crush it.  As management our primary job should be to clearly communicate a vision of where the team needs to be, inspire them by giving them ownership and autonomy, and get obstacles out of their way.  

Sometimes I feel like the best teams are the ones that build up ideas the way Barca moves the ball down the field:

There is no 'central control', there is just the idea -- the ball -- and the team, which supports each other as they move the ball downfield, and the magic that happens because the team is focused on doing what it takes to move the ball, develop the attack, and put together a combination that finishes in the opponent's net. What blows me away is that each of these players has amazing skill but they are so much more effective with one touch passing and holding the triangle. I see the same thing on engineering teams that work well together. The top talent doesn't hold onto the ideas, they share them and make themselves available to move it along, and in doing so bring everyone up to their level. Seeing that happen without explicit guidance was the best part of Innovation Week for me.

Sunday, November 17, 2013

Hadoop Streaming with MRJob

Motivation to use Streaming:

Writing java map-reduces for simple jobs feels like 95% boilerplate, 5% custom code. Streaming is a much simpler interface into Mapreduce, and it gives me the ability to tap into of the rich data processing, statistical analysis and nlp modules of Python.

Motivation to use mrjob:

While the interface to Hadoop Streaming couldn't be simpler, not all of my jobs are simple 'one and done' map-reduces, and most of them require custom options MRJob allows you to configure and run a single map and multiple reduces.  It also does some blocking and tackling, allowing me to customize arguments and passing them into specified jobs. Finally, mrjob can be applied to an on prem cluster or an amazon cluster - and we are looking at running amazon clusters for specific prototype use cases.

mrjob and streaming hurdles

The mrjob documentation is excellent for getting up and running with a simple job. I'm going to assume that you have read enough to know how to subclass MRJob, set up a map and a reduce function, and run it.

I'm going to discuss some of the things that weren't completely obvious to me after I had written my first job, or even my second job. Some of these things definitely made sense after I had read through the documentation, but it took multiple reads, some debug attempts on a live cluster, and some source code inspection.

Hurdle #1: passing arguments

My first job was basically a multi dimensional grep: I wanted to walk input data that had timestamp information  a tab delimited field and only process those lines that were in my specified date range.  In order to do this  I needed two range arguments that took date strings to do the range check in the mapper.  I also wanted to be able to apply specified regex patterns to those lines at map time.  Because there were several regex patterns,  I decided to put them in a file and parse them. So I needed to pass three arguments into my job, and those arguments were required for every mapper that got run in the cluster.

In order to pass arguments into my job, I had to override the configure_options() method of MRJob and use add_passthrough_option() for the range values, and add_file_option() for the file that held the regexes:

def configure_options(self):

All options were passed straight through to my job from the command-line:

python --startDateRange 01/01/13 --endDateRange 12/01/13 --filters filters.json

I referenced them in an init function of my job class, which subclassed the MRJob class:

class MyJob:
    def task_init(self):
        self.startDateRange = dateutil.parser.parse(self.options.startDateRange)
        self.endDateRange = dateutil.parser.parse(self.options.endDateRange)
        self.filters = parseJsonOptions(self.options.filters)

This init method was specified in the MyJob.steps() override of the default MRJob method:

def steps(self):
        return [

   = self.task_init,

Something to note here: In the code I had written during development,  I had neglected to really read the documentation and as a result I had previously done all validation of my custom args using a standard OptParse class in my main handler. This worked for me in inline mode, which is what I was developing in. It does not work at all when running the job on a cluster, and it took some source code digging to figure out. Do as I say, not as I do :) In hadoop mode, the main MRJob script file is passed to mapper and reducer nodes with the step parameter set to the appropriate element in the steps array. The entry point into the script is the default main, and MRJob has a set of default parameters it needs to pass through to the MRJob subclassed job class. Overriding parameter handling in main effectively breaks MRJob when it tries to spawn mappers and reducers on worker nodes. MRJob handles the args for you, and you need to let it handle all arg parsing, and pass custom arguments as passthrough or file options. 

Hurdle #2: passing python modules

This nuance has more to do with streaming than it does with mrjob. But it's worth understanding if you're going to leverage non-standard Python modules in your mapper or reducer code, and those modules have not been installed on all of your datanodes.

I was using the dateutil class because it makes parsing dates from strings super easy. On a single node, getting dateutil up and running is this hard:

easy_install python-dateutil

But when you're running a streaming job on a cluster, that isn't an option. Or, it wasn't an option for me because the ops team didn't give me sudoers permissions on the cluster nodes, and even if they did, I would have had to write the install script to ssh in, do the install, and roll back on error. Arrgh, too hard.

What worked for me was to
  1. Download  the source code
  2. Zip it up (it arrived in tar.gz)
  3. Change the extension of the zip file because files that end in .zip are automatically moved to the lib folder of the task's working directory
  4. Access  it from within my script by putting it into the load path: 
import dateutil

I'm passing dateutil.mod as a file passed in via add_file_option() in  myjob.configure_options(). Leveraging the add_file_option() method puts dateutil.mod in the local hadoop job's working directory:

def configure_options(self):

Three things to note from the above code: (1) dateutil.mod is the zip file, (2) I'm referencing a module within the zip file by it's path location in that zipfile, and (3) because I've renamed the file, it gets placed in the job working directory, which means it is on my path by default. 

This is how I pass dateutil.mod into the job:

python ... --dateutil dateutil.mod

Hurdle #3 (not quite cleared): chaining reduces vs map-reduces

As mentioned in the doc, it's super easy to chain reduces to do successive filtering and processing. Simply specify your multiple reduces in the steps() override:

def steps(self):
        return [
   = self.task_init,
   = self.task_init,


I haven't found it necessary to run successive mapreduces -- successive reduces work just as well in the use cases I've tried. When chaining reduces to the end of your first mapreduce, you can specify the key value from the first mapreduce as the key value in the next reduce.

What is not easy at this time is the ability to save intermediate output to a non intermediate location. While doing that is relatively straightforward in 'inline' mode, the approach suggested in the link won't work in hadoop mode because MRJob is invoking the python script with the right --step-num argument based on what it sees in the steps() method.

I did read about the --cleanup option, but from what I understand the intermediate output dir of a complex job is based on a naming convention, not on something I can set. As this is somewhat of an edge case, I can work around it by chaining MRJob runs with Oozie.


What I've learned about MRJob is that while it does a great job of allowing you to set and pass options, and allows you to construct good workflows (assuming you don't care about intermediate output), it is so easy to use that I fell into the trap of believing that running local on my machine was equivalent to running on a hadoop cluster.

As I've found out several times above, that is not the case. For me the keys here are (1) let MRJob handle your job specific variables, (2) leverage the steps() method for your more complex flows, and (3) if you need to save intermediate output, chain your jobs using an external scheduler.