Tuesday, March 18, 2014

Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

A Swing And A Miss

In my last post I had trained a Bayesian classifier using a dataset pulled from sitejabber.com, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:
  1. Did I really test and validate the data in the most effective way?
  2. As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
  3. If so, how do I go about increasing data from a finite set of data? 

Giving the classifier another chance with K fold validation

Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

I can get that coverage and reproducibility by using k fold validation across the data.

K fold validation works like this:

  1. break the dataset into K equivalent subsets.
  2. hold one of the subsets out for testing.
  3. use all of the other subsets for training. 
  4. train the data on the k-1 subsets, test it on the kth subset. 
  5. rotate through all subsets - repeat 2-4, holding out a different subset each time. 
  6. average the accuracy of all test+train processes.
When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

This is the method I used to split an array into k folds:

def partitionArray(self,partitions, array):
        """
        @param partitions - the number of partitions to divide array into
        @param array - the array to divide
        @return an array of the partitioned array parts (array of subarrays)
        """
        nextOffset = incrOffset = len(array)/partitions
        remainder = len(array)%partitions
        lastOffset = 0
        partitionedArray = []
        
        for i in range(partitions):
            partitionedArray.append(array[lastOffset:nextOffset])
            lastOffset= nextOffset
            nextOffset += incrOffset
        
        partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])
                    

        return partitionedArray

This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

 def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):
        """
        build test and training sets
        @param iteration - the offset of the arrays to hold out
        @param reviewsByRating - the set of reviews to build from
        @return test and training arrays
        """
        
        test = []
        test.extend(reviewsByRating[1][iteration])
        test.extend(reviewsByRating[5][iteration])
        
        training = []
    
        for i in range(folds):
            if i == iteration:
                continue
            training.extend(reviewsByRating[1][i])
            training.extend(reviewsByRating[5][i])

        return training, test

Increasing The Data Set with Sampling

How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach.  In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews.  In order to increase the positive data set I need to generate more positive reviews.

Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set. 

One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

Pretend the following sentence is actually a review:
         The big big green caterpillar ate the small green leaf.

putting the words in an array that looks like this: 
        somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'.   I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

In Python sampling with replacement looks like this:

        word = somearray[random.randint(0,len(somearray))] 

I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

def createReview(self,textFreqDist,reviewLength):
        """
        @param textFreqDist -  the array containing the frequency distribution of words to choose from.
        @param reviewLength -  the length of the review (in words) to build
        @return the generated review as a string
        """
        randLen = len(textFreqDist)
        reviewStr = ""
        
        for i in range(reviewLength):
            reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')
        


        return reviewStr

A Cautionary Note on Overfitting

When I first did the positive 'boost', I was getting really good results....really, really good results.  99% accuracy on a test set was a number that seemed too good to be true. And it was. 

In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart. 

When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased. 

New Test Results

I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process,  the ratio of positive to negative reviews in the test data remainsthe same.  The code used to train and test the data is the same as before.

My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews. 

Summary

This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency. 

If I were to do this over again, I would do the following: 
  1. If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
  2. k-fold validation from the beginning to get the average accuracy per approach.
  3. investigate mis-classification errors before doing any optimization! 
Most of my time was spent analyzing and building the optimal training data set.  The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews. 

A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data. 

Potential next steps include:
  1. Getting (more) new data from a different source.
  2. Trying the bayes classifier on that data
  3. Trying a different classifier, e.g. the maxent classifier on the same data
  4. Going deeper into sentiment: what entities were positive / negative sentiment directed at?