Tuesday, December 6, 2016

Culture Core Operating Principles v0.1


"Building a great culture" sounds like one of those HR webinars that you are compelled to watch, usually like this:




But as much as I like to make fun of Corporate Valuespeak, I have observed that if you want to like where you work, you either take conscious steps to create that environment, or "culture happens", and not always in a good way.  

I've had the chance to work for 4 person startups in a garage to 100,000+ people enterprises. The places I've enjoyed the most are the ones where the leaders made a conscious, sustained effort to establish a set of shared values. 

When they did this, they spent a lot of time outlining the value in specific terms, but left how that value was expressed up to us. They repeated the value and it's specifics until we could see how the values manifested in what we did. Once that happened, we started to own that value, and we carried it forward. 

Recently I had the chance to make my own conscious effort to set up the values I wanted to see in the organization I was a part of. I was giving a presentation to the engineering team. Lunch was served, so I had a captive audience. I called this presentation



Because as the new guy, I had the freedom to question assumptions and make assertions that I wont be able to as time goes on and I become a fully integrated team member.

I chose the term 'Core Operating Principles' to describe the culture I wanted to be a part of because I wanted to establish a set of guidelines for how we act and what we do in our day to day work. 

Principles are about execution, which is where I feel we - like every other engineering team in the world - have room to improve. 



I wanted to focus on principles of execution, not precise, detailed execution steps - because principles scale, instructions don't. Principles can be applied as a heuristic, detailed instructions cannot. 

I also put a version number on these Core Operating Principles. I love version numbers because they indicate that what is versioned is really a work in progress, not a final statement. I'm 100% sure that these will get refined over time.
Here they are, v0.1:




  • Releasing Wins - the more we release, the more chances we have to better fulfill a customer problem. Short release cycles let us build, measure, and learn from wheat we've built. These days, that's a competitive requirement. 
  • Clear Roles, Clear Process - it helps when we all know our basic roles, and have a good idea of the process that we are trying to follow (note, in this org we had spent some time clarifying what Product, Engineering - dev, qa, ops, and Program Management do)
  • We Are One Team - We are product focused, working together to deliver incremental, valuable releases of the product that meet the Definition of Done. 
  • We Finish Together - It is not unusual for a  sprint to have blocking issues that no one saw coming. We team to complete stories and resolve issues that prevent us from completing stories in priority order. 
  • We Own What We Build - We are committed to building, releasing, and operating a high quality product. There are no fences to throw work over, even if we hand something off, we make sure we are available to answer questions and work on any problems that come up. 
  • We Are Always Getting Better - as a team, we learn from what we go through:

      • Strive for Simple - the simplest possible solution wins, at all times. Simple scales, clever doesn't. Simple != Hacked together. Work still needs to meet the Definition of Done. 
      • Don't Solve Solved Problems - we have so much work to do here, what should we focus on, and what should we leverage other solutions for? We should strive to build what is our core competency, and where possible, outsource non core functionality that is commonly available.
      • The Best Idea Wins - no matter where or who it comes from. Assert your idea, and defend it, and let the best idea emerge. 
      • Disagree But Commit - if you've been unable to convince everyone that the idea they want to do is flawed, disagree but commit to implementing it. That way it will either fail or succeed much faster than it would if you were passively resisting it. I've learned the most from disagreeing but committing. Mostly I've learned that I can be wrong in many different ways....
      • Why Not Who - when things go wrong, focus on why they went wrong, not who made a mistake. Ask 'why' until the root cause is discovered, then make necessary changes.



  • We Question Reality - "that's just the way we do things around here" really isn't a game changer. Question things that you think could be better. Point them out, and better yet, point out a way to get the same results. Do we really have to do things a specific way? Or is there a better way to do them? Are the 'immovable constraints' actually  movable? We get better as a team when we question the status quo. 

  • Now comes the hard part - putting these principles to work, refining them, and most importantly, owning them as a team. 

    In order for the team to buy off on those principles, they need to pick them up, try them on, apply them to specific situations, and see the value. 

    We've had some initial successes, which is really nice to see. But in order for these to stick, or more importantly, evolve to better principles, we need keep being intentional about them - invoking them as a compass, to guide the decisions we make and the work we do. 

    Saturday, November 26, 2016

    Changing Roles...again

    So much for that new years resolution to post more -- its coming up on a year since my last post. About two months ago I moved from Product Management back into Engineering. In September, I left the HPE Stackato product management team  and took a senior engineering management position at Zonar Systems, a Seattle based vehicle telematics company.

    The last two years were hard, but the lessons  I learned were ultimately worth the struggle, because I learned so much.

    I learned to execute better, from being very specific about what I wanted out of every meeting, to always driving discussions to closure, even if it is across several meetings and emails.

    I learned how important it is as a Product Owner to define the arc of a backlog - the way product themes are implemented via epics, stories, and ultimately tasks -- and how important it is to keep the backlog well defined.

    I learned how inspirational leaders can unleash the best out of a team, and how non inspirational leaders can demotivate the same team.

    I also learned a lot about myself. I did have some misgivings about the role, and as time went on those misgivings were shown to be true. Unfortunately I ignored those initial instincts, and by the time I started listening to them, the compromises I had made to stay in the role were significant, and I didn't know if I had compromised so much that I was 'trapped' in the role.

    The one thing I really missed was thinking about the 'How'. As a Product Manager,  I loved thinking about the Why and the What, but in order to be effective as a Product Manager I had to delegate the How completely.

    When I realized that being a Product Manager was not for me,  I sat back and made a list of what I wanted to do next, and after some searching, found that position at Zonar.

    I've been very happy digging into the new job. Ironically, it's the things that I learned as a product manager that have been the most critical to my success in the first two months. That's great, because there was a period of time where I was really questioning the move to product management and whether it was a mistake. At this point it looks like the best possible move to have made.

    I would like to say that I was that smart and forward looking, and intentional about my career path. The truth was that I was never thinking that far ahead. What I really did was follow my curiosity, which is what I've always done. I started in software  because I was interested in writing the logic behind the applications I used.  The job choices I made as a software engineer were motivated by the chance to learn new technologies and skills to solve harder and more interesting problems. That's been the story of my career and my life, and so far it has panned out well. 

    Monday, January 11, 2016

    Changing Roles

    It has been almost 2 years since I've posted anything in this blog.  In July 2014 I decided to leave Disney, and more significantly, change roles. At Disney I had been leading the engineering and analytics teams of a central data service. In my not-so-new role at HP I'm on the product management team for the Helion Development Platform, which in it's current incarnation is known as HPE Helion Stackato, a Cloud Foundry based Platform as a Service.

    I was (and still am!) excited to do Product Management. As an engineer I have seen how essential the consistent application of product vision and stewardship can be to an engineering effort. I did not know if I really had product vision or was just delusional, but I wanted to find out either way...

    The best part about this job is that I get to think - a lot - about how to make software development better for the average enterprise engineer. The struggle is real! Most engineering teams still write IaaS based applications like they were running on bare metal servers, and that leads to a world of hurt, because an application distributed across IaaS provided resources has to contend with underlying network, compute, and storage service failures.

    The promise of Platform as a Service is that it enables easy development and deployment of cloud native applications - applications that take advantage of the elasticity of the cloud while dealing with the ephemerality of the underlying IaaS. Cloud native is a concept that takes most engineering organizations some time to get their head around.

    As a result, there is a significant educational aspect to my role, which I love. I get to help people to focus on creating value with software, something that has enthralled me since I was 14 years old teaching myself BASIC on my Dads IBM PC/AT.

    I get to present my thoughts to various captive audiences, either at customer onsite visits or conferences. Here is a presentation from HP Discover 2015 that I think captures the problem we are trying to solve and the best approaches to take to solve the problem.



    In 2016 I'm really excited because of the acceleration and rapid maturity of several key technologies that have the potential to play very well together. 

    Containers, mostly via Docker, have brought easy authoring and immutable infrastructure into mainstream software development. As an example, the other day, instead of hand installing Kafka and Zookeeper onto a VM and then doing it again by hand when I needed to grow my test Kafka cluster, I just typed "docker run...", pointing to a Kafka/ZK image built and published to Docker Hub by Spotify. I got to take advantage of all of their hard work, and save the potential multiple hours required to get that image working correctly.  

    Kubernetes and Cloud Foundry are viable orchestration mechanisms for distributed, container based applications, handling deployment, scaling, and failure remediation -- deployment and scaling are things that we used to do by hand, late at night, on pins and needles. Dealing with failure usually meant doubling your hardware, or writing complex startup scripts customized to each application. Both approaches are quite differently opinionated, and I can see the merits of each one for different use cases, sometimes in the same overall application stack! 

    Mesos has emerged as an intermediate resource management layer that abstracts the underlying IaaS away. The emergence of next generation big data workloads running on Mesos is something that really excites me as an ex service owner who struggled to justify value of the insights provided by using big data technologies against the steep costs of hardware. Having a layer that allocates a finite resources and maximizes resource allocation across very diverse workloads makes the start up cost of new experimental data investigation much lower, and therefore much more likely. 

    All of these technologies are evolving at an incredible rate. I'm excited to see, and hopefully play a role in delivering, the next generation of platforms that make these technologies easy to consume and manage, and allow engineering teams to focus on features instead of infrastructure. 

    I'm trying to post more this year - the last 18 months have been a heads down, get it done, tactical march. That was great, but I think I miss key insights when I don't occasionally digest and reflect what is going on around me. I hope to do more of that here over the next year. It's a new years resolution, hopefully one that will last longer than the one I made about not eating sugar :)



    Tuesday, March 18, 2014

    Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

    A Swing And A Miss

    In my last post I had trained a Bayesian classifier using a dataset pulled from sitejabber.com, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

    As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:
    1. Did I really test and validate the data in the most effective way?
    2. As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
    3. If so, how do I go about increasing data from a finite set of data? 

    Giving the classifier another chance with K fold validation

    Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

    I can get that coverage and reproducibility by using k fold validation across the data.

    K fold validation works like this:

    1. break the dataset into K equivalent subsets.
    2. hold one of the subsets out for testing.
    3. use all of the other subsets for training. 
    4. train the data on the k-1 subsets, test it on the kth subset. 
    5. rotate through all subsets - repeat 2-4, holding out a different subset each time. 
    6. average the accuracy of all test+train processes.
    When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

    In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

    Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

    This is the method I used to split an array into k folds:

    def partitionArray(self,partitions, array):
            """
            @param partitions - the number of partitions to divide array into
            @param array - the array to divide
            @return an array of the partitioned array parts (array of subarrays)
            """
            nextOffset = incrOffset = len(array)/partitions
            remainder = len(array)%partitions
            lastOffset = 0
            partitionedArray = []
            
            for i in range(partitions):
                partitionedArray.append(array[lastOffset:nextOffset])
                lastOffset= nextOffset
                nextOffset += incrOffset
            
            partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])
                        

            return partitionedArray

    This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

     def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):
            """
            build test and training sets
            @param iteration - the offset of the arrays to hold out
            @param reviewsByRating - the set of reviews to build from
            @return test and training arrays
            """
            
            test = []
            test.extend(reviewsByRating[1][iteration])
            test.extend(reviewsByRating[5][iteration])
            
            training = []
        
            for i in range(folds):
                if i == iteration:
                    continue
                training.extend(reviewsByRating[1][i])
                training.extend(reviewsByRating[5][i])

            return training, test

    Increasing The Data Set with Sampling

    How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach.  In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

    How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews.  In order to increase the positive data set I need to generate more positive reviews.

    Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set. 

    One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

    Pretend the following sentence is actually a review:
             The big big green caterpillar ate the small green leaf.

    putting the words in an array that looks like this: 
            somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

    I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'.   I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

    In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

    In Python sampling with replacement looks like this:

            word = somearray[random.randint(0,len(somearray))] 

    I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

    def createReview(self,textFreqDist,reviewLength):
            """
            @param textFreqDist -  the array containing the frequency distribution of words to choose from.
            @param reviewLength -  the length of the review (in words) to build
            @return the generated review as a string
            """
            randLen = len(textFreqDist)
            reviewStr = ""
            
            for i in range(reviewLength):
                reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')
            

            return reviewStr

    A Cautionary Note on Overfitting

    When I first did the positive 'boost', I was getting really good results....really, really good results.  99% accuracy on a test set was a number that seemed too good to be true. And it was. 

    In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart. 

    When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased. 

    New Test Results

    I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process,  the ratio of positive to negative reviews in the test data remainsthe same.  The code used to train and test the data is the same as before.

    My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews. 

    Summary

    This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency. 

    If I were to do this over again, I would do the following: 
    1. If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
    2. k-fold validation from the beginning to get the average accuracy per approach.
    3. investigate mis-classification errors before doing any optimization! 
    Most of my time was spent analyzing and building the optimal training data set.  The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

    To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews. 

    A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data. 

    Potential next steps include:
    1. Getting (more) new data from a different source.
    2. Trying the bayes classifier on that data
    3. Trying a different classifier, e.g. the maxent classifier on the same data
    4. Going deeper into sentiment: what entities were positive / negative sentiment directed at?


    Tuesday, February 4, 2014

    Making Sense of Unstructured Text in Online Reviews Part 3: Trying to Improve Classifier Accuracy

    This post is part of a series where I try to classify online review text in more and more concrete ways. Right now I'm training a classifier to accurately classify one (bad) vs five (good) starred reviews. In the last post I had done some initial training and testing of an NLTK Bayesian classifier. In this post I want to see if I can improve the accuracy score of my classifier by getting smarter about which features I include.

    In the last post I had experimented with varying the quantity of feature set, and had found that while encoding more features into a classifier during training helps accuracy, there is an eventual accuracy ceiling. My feature set came from taking the top N words from a frequency distribution of all words in the reviews text. Here is what the accuracy curve looks like:

    One other way to improve accuracy is to address the 'quality' of the feature set by looking at features not only in terms of their frequency across the training corpus, but looking at their relative frequencies across classifications.

    In the review classification done so far, individual words are the features. I'm going to try to 'tune' feature sets in several different ways -- I have no idea if these will work, but they seem reasonable. I'm going to call these attempts hypotheses, because my goal is to prove them to be true or false, with relatively minimal effort.

    Hypothesis 1: Throw away features with a low 'frequency differential'

    My hypothesis is that there are features that have a much higher chance of being in a negative review than a positive review, and vice versa. Those are the features that we want to keep. Other features are ones that have approximately the same chance of being in either type of review (positive or negative).

        P(review rating | features) = P(features, review rating)

    In the equation above, the P(features, review rating) term is the multiplied probabilities of each P(feature, review rating). If I'm looking for a higher overall probability that a document is one star over five star or vice versa, having per feature probabilities that are similar for one star or five star reviews means that my overall probabilities for one and five star will be close to equal, which could tip classification results 'the other way' and increase my error rate.

    I can validate this hypothesis by filtering out those low probability differential features and keeping the ones that have a high probability differential: a high difference between P(feature, review rating) for {1 star, 5 star} ratings.

    Building The Feature Set

    I had trained and tested the classifier by taking input data, splitting it into a test and a training set, then training and testing. I will recreate that process now to get the raw data so that I can 'remove' common terms with low probability:

                sjr = SiteJabberReviews(pageUrl,filename)
                sjr.load()
                asd = AnalyzeSiteData()
               
    I ended up recoding the building of the training and test data so that the data sets being built had a more even distribution of ratings across them:

    def generateLearningSetsFromReviews(self,reviews, ratings,buckets):
            
            # check to see that percentages sum to 1
            # get collated sets of reviews by rating. 
            
            val = 0.0
            for pct in buckets.values():
                val += pct
                
            if val > 1.0:
                raise 'percentage values must be floats and must sum to 1.0'
            
            reviewsByRating = defaultdict(list)
                
            for reviewSet in reviews:
                for rating in ratings:
                    reviewList = [(self.textBagFromRawText(review.text), rating) 
                          for review in reviewSet.reviewsByRating[rating]]
                    reviewsByRating[rating].extend(reviewList)
                    random.shuffle(reviewsByRating[rating]) # mix up reviews from different reviewSets
                
            
            # break collated sets across all ratings into percentage buckets
            learningSets = defaultdict(list) 
            
            for rating in ratings:
                sz = len(reviewsByRating[rating]) 
                
                lastidx = 0
                for (bucketName, pct) in buckets.items():
                    idx=lastidx + int(pct*sz)
                    
                    learningSets[bucketName].extend(reviewsByRating[rating][lastidx:idx])
                    
                    lastidx  = idx
                    

            return learningSets


    When I built up the training data using this method, the sets were returned in the buckets[] array:

            buckets = asd.generateLearningSetsFromReviews([sjr],[1,5],{'training': 0.8,'test':0.2})


    Each training set in this list is actually an array of (textBag, rating) tuples:
         buckets = [[(bagOfText,rating)...],[..]]

    I want to get frequency distributions of common terms from one and five star reviews in the training data, so that I can find terms with a high probability differential:                         

                # get common terms and frequency differentials

                allWords1 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 1]
                fd1 = FreqDist(allWords1)
                
                allWords5 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 5]
                fd5 = FreqDist(allWords5)

                commonTerms = [w for w in fd1.keys() if w in fd5.keys()]

                # now get frequency differentials

                commonTermFreqs = [(w,fd1.freq(w),fd5.freq(w),abs(fd1.freq(w)-fd5.freq(w))) 
                    for w in commonTerms]

                commonTermFreqs.sort(key=itemgetter(3),reverse=True)

    Now we've got common terms, sorted by their absolute differential between frequency distributions in 1 and 5 star reviews.

    if I plot this distribution:

                freqdiffs = [diff for (a,b,c,diff) in commonTermFreqs]
                plt.plot(freqdiffs)
                plt.show()

    I can see that it falls off sharply:

    This looks like a Zipfian distribution: "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word..." 

    The shape of this distribution implies that only a small subset of  the terms actually have a frequency differential that really 'matters' in the hypothesis -- all terms aren't needed. I can start arbitrarily by keeping all terms with a frequency differential > 0.001 to quickly test the hypothesis. That leaves 131 of the original 688 common terms.

    Note that in getting and filtering common terms, I have not retained the terms that very strongly signal one review rating or another: those would be the terms that exist only in one review rating corpus or another. Note that even though those terms do not exist in one or the other review corpus, and that would make the calculation go to zero, the non existent terms are 'smoothed out' by including them in the other corpus and adding a very small value to the frequency of all terms in that corpus, which guarantees that there are no terms with a zero frequency, and the Bayesian calculation won't zero out.

    I would need to add those terms into the set of terms that we filter by.

    The full set of filtered terms is comprised of both uncommon and filtered common words:

                filterTerms = [w for (w,x,y,diff) in commonTermFreqs if diff > 0.001]
                fd1Only = [w for w in fd1.keys() if w not in fd5.keys]
                filterTerms.extend(fd1Only)
                fd5Only = [w for w in fd5.keys() if w not in fd1.keys]
                filterTerms.extend(fd5Only)

                defaultWordSet = set(filterTerms) # rename so I dont have to rewrite the encoding method 

    And I use those words as features identified at encoding time:

                def emitDefaultFeatures(tokenizedText):
                    '''
                    @param tokenizedText: an array of text features
                    @return: a feature map from that text.
                    '''
                    tokenizedTextSet = set(tokenizedText)
                    featureSet = {}
                    for text in defaultWordSet:
                        featureSet['contains:%s'%text] = text in tokenizedTextSet
                    

                    return featureSet

    Testing The Hypothesis

    Now I can train the classifier: asd.encodeData() takes care of encoding features from the training and test sets by calling emitDefaultFeatures() for each review.

                encodedTrainSet = asd.encodeData(rawTrainingSetData,emitDefaultFeatures )
                classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)
                
                encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

                print nltk.classify.accuracy(classifier, encodedTestSet)

    And I get an accuracy of 0.83, the same accuracy I got with no manipulation of the feature set, which is 0.02 less than my optimal accuracy. Whoops.

    Detailed Error Analysis

    There is one other step  I can take to understand the accuracy of the classifier, and that is to analyze the errors made on the test set. If I know how I mis-classified the data, that can help me affect the classifier.

                shouldBeClassed1 = []
                shouldBeClassed5 = []
                
                for (textbag, rating) in buckets['test']:
                    testRating = classifier.classify(emitDefaultFeatures(textbag))
                    if testRating != rating:
                        if rating == 1:
                            shouldBeClassed1.append(textbag)
                        else:
                            shouldBeClassed5.append(textbag)

    A quick check on the error arrays shows me that I've only made mistakes on the reviews that should be classified as positive:

              >>>> print len(shouldBeClassed5.append(textbag))
              11

    Wait a minute. That number looks familiar. Let me review the raw data again: 
              >>>>print len(sjr.reviewsByRating[5])
              55
              >>>>print int(0.2*len(sjr.reviewsByRating[5]))
              11

    This data shows that  I mis-classified all 11 positive reviews in the test data, because my error analysis showed that I had eleven mis-classified positive reviews, and I only had 11 positive reviews in the teset set based on an 80% training/20% testing split.

    A quick reversal to the original test method (that collected features from a FreqDist of all terms in the training data) shows that I mis-classified all 11 positive reviews as well.

    Summary

    This was one attempt to improve classifier accuracy by trying something reasonable with the feature set -- removing features whose probability differential across 1 star and 5 star review corpuses was very small.

    While the numbers initially looked 'decent', deeper analysis shows that my classifier completely mis-classified positive reviews. In the future I'll do error analysis of classifiers before trying to theorize about what could make the classifier more accurate.

    Looking closer at the data, the data set had 55 total  positive reviews and 273 total negative reviews.  In other words only 20% of my data was actually positive review data.

    I had originally scraped only one reviewed site for data, but now I think I'm going to need to scrape more sites to get a more representative set of positive review data so that the classifier has more training examples.

    In my next post I'm going to try to collect a more representative 'set' of data, and also take a slightly different approach to validating my classifier. I'm going to do error analysis up front and attempt to correct my classifier based on the errors I see, then test the classifier against new test data -- testing a fixed classifier against the data I used to fix it will give me a false sense of accuracy, because the test data used to do error analysis has in effect become training data.




    Wednesday, January 22, 2014

    Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

    In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

    To do classification I'm going to stand on the shoulders of the giants -- specifically the giants  who wrote and maintain the NLTK package.  In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

    Brief Recap

    I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that  I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

    To review: Here is how I would download data for all reviews from a review site:

    (note I've only implemented a 'scraper' for one site, http://sitejabber.com)

        from sitejabberreviews import SiteJabberReviews
        from analyzesitedata import AnalyzeSiteData


        pageUrl = 'reviews/www.zulily.com'
        filename = "siteyreviews.pkl"
        
        sjr = SiteJabberReviews(pageUrl,filename)    

        sjr.download(True) # this saves the reviews to the file specified above

    Once I've downloaded the data, I can always load it up from that file later:

        pageUrl = 'reviews/www.zulily.com'
        filename = "siteyreviews.pkl"
        
        sjr = SiteJabberReviews(pageUrl,filename)    
        sjr.load()

    Next Step: Bayesian Classification

    NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this:  the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

    This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

        P(review rating | features) = P(features, review rating)/P(features)

    In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those  features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'.  So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

        P(review rating | features) = P(features, review rating)

    This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

    Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier  by training it with one star and five star review data. This is a pretty simple, binary approach to review classification. 

    Feature Set Generation, Training, and Testing

    To train and initially test, the NLTK Bayesian classifier, I need to do the following:
    1. Extract train and test data from my review data.
    2. Encode train and test data.
    3. Train the classifier with the encoded training data
    4. Test the classifier with the encoded test data.
    5. Investigate errors during the test
    6. Modify training set and repeat as needed.
    I've written a helper method to generate training and test data:

    def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
           # generate tuples of (textbag,rating)
            reviewList = [(self.textBagFromRawText(review.text), key) 
               for review in reviews.reviewsByRating[key]]
            
            return reviewList[: int(trainSetPercentage*len(reviewList))],
                reviewList[int(trainSetPercentage*len(reviewList)):] 

    the generateTestAndTrainingSetsFromReviews() method calls  textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

     def textBagFromRawText(self,rawText):
            '''
            @param rawText: a string of whitespace delimited text, 1..n sentences
            @return: the word tokens in the text, stripped of non text chars including punctuation
            '''
            rawTextBag = []        
            sentences = re.split('[\.\(\)?!&,]',rawText)
            for sentence in sentences:
                lowered = sentence.lower()
                parts = lowered.split()
                rawTextBag.extend(parts)
             
            
            textBag = [w for w in rawTextBag if w not in stopwords.words('english')]    

            return textBag

    I generate test and training data for one and five star reviews using  generateTestAndTrainingSetsFromReviews():

                # load helper objects
                sjr = SiteJabberReviews(pageUrl,filename)
                sjr.load()
                asd = AnalyzeSiteData()

                trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
                
                trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
                
                rawTrainingSetData = []
                rawTrainingSetData.extend(trainingSet1)
                rawTrainingSetData.extend(trainingSet5)
                random.shuffle(rawTrainingSetData)

                rawTestSetData = []
                rawTestSetData.extend(testSet1)

                rawTestSetData.extend(testSet5)
                random.shuffle(rawTestSetData)

    With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document.  I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

                # for raw Training Data, generate all words in the data
                
                all_words = [w for (words, condition) in rawTrainingSetData for w in words]
                fdTrainingData = FreqDist(all_words)
                
                # take an arbitrary subset of these
                defaultWordSet = fdTrainingData.keys()[:1000]
                
                def emitDefaultFeatures(tokenizedText):
                    '''
                    @param tokenizedText: an array of text features
                    @return: a feature map from that text.
                    '''
                    tokenizedTextSet = set(tokenizedText)
                    featureSet = {}
                    for text in defaultWordSet:
                        featureSet['contains:%s'%text] = text in tokenizedTextSet
                    
                   return featureSet

    That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here: 

          def encodeData(self,trainSet,encodingMethod):
              return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

    (Aside: I love list comprehensions! ) With training  data encoded, we can encode the data and train the classifier as follows:

           encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
           classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

    Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

           encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
           print nltk.classify.accuracy(classifier, encodedTestSet)

    This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

    It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy  drops to 75%.

    Summary

    The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.



    Saturday, January 4, 2014

    Making Sense of Unstructured Text in Online Reviews, Part 1

    Introduction

    I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

    After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

    I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

    One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

    I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

    Phase 1: Getting The Data

    I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

    Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

            # first get the pages we need to navigate to to get all reviews for this site. 
            page = urllib2.urlopen(self.pageUrl)
            soup = BeautifulSoup(page)
            
            pageNumDiv = soup.find('div',{'class':'page_numbers'})
            
            anchors = pageNumDiv.find_all('a')
            
            urlList = []
            urlList.append(self.pageUrl)
            for anchor in anchors:
                urlList.append(self.base + anchor['href'])
            
            # with all pages set, pull each page down and extract review text and rating. s
            for url in urlList:    
                page = urllib2.urlopen(url)
                soup = BeautifulSoup(page)
                divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))
                
                        
                for div in divs:
                    text = div.find('p',id=re.compile('ReviewText-.*')).text
                    rawRating = div.find(itemprop='ratingValue')['content']

    Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

        soup = BeautifulSoup(page)

    Once I had that, it was a matter of finding what  I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

        for div in divs:
                    text = div.find('p',id=re.compile('ReviewText-.*')).text
                   rawRating = div.find(itemprop='ratingValue')['content']

    text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above: 

    itemprop="ratingValue" content = "1.0"/>

    find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements. 

    Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

             for div in divs:
                    text = div.find('p',id=re.compile('ReviewText-.*')).text
                    rawRating = div.find(itemprop='ratingValue')['content']
                    
                    
                    r = Review(text,rawRating)
                    
                    if self.reviewsByRating.has_key(r.rating):
                        reviews = self.reviewsByRating[r.rating]
                    else:
                        reviews = []
                        self.reviewsByRating[r.rating] = reviews
                    
                    reviews.append(r) 

    and flushed that map to disk using pickle:

         def saveToDisk(self):
              with open(self.filename,'w') as f:
                  pickle.dump(self.reviewsByRating,f)

    this let me load the data from file without having to scrape it again:

        def load(self):
              with open(self.filename,'r') as f:
                  self.reviewsByRating = pickle.load(f)

    Next step will be to start investigating the data.