Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants  who wrote and maintain the NLTK package.  In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that  I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

    from sitejabberreviews import SiteJabberReviews
    from analyzesitedata import AnalyzeSiteData


    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    

    sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    
    sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this:  the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

    P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those  features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'.  So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

    P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier  by training it with one star and five star review data. This is a pretty simple, binary approach to review classification. 

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:
  1. Extract train and test data from my review data.
  2. Encode train and test data.
  3. Train the classifier with the encoded training data
  4. Test the classifier with the encoded test data.
  5. Investigate errors during the test
  6. Modify training set and repeat as needed.
I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
       # generate tuples of (textbag,rating)
        reviewList = [(self.textBagFromRawText(review.text), key) 
           for review in reviews.reviewsByRating[key]]
        
        return reviewList[: int(trainSetPercentage*len(reviewList))],
            reviewList[int(trainSetPercentage*len(reviewList)):] 

the generateTestAndTrainingSetsFromReviews() method calls  textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

 def textBagFromRawText(self,rawText):
        '''
        @param rawText: a string of whitespace delimited text, 1..n sentences
        @return: the word tokens in the text, stripped of non text chars including punctuation
        '''
        rawTextBag = []        
        sentences = re.split('[\.\(\)?!&,]',rawText)
        for sentence in sentences:
            lowered = sentence.lower()
            parts = lowered.split()
            rawTextBag.extend(parts)
         
        
        textBag = [w for w in rawTextBag if w not in stopwords.words('english')]    

        return textBag

I generate test and training data for one and five star reviews using  generateTestAndTrainingSetsFromReviews():

            # load helper objects
            sjr = SiteJabberReviews(pageUrl,filename)
            sjr.load()
            asd = AnalyzeSiteData()

            trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
            
            trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
            
            rawTrainingSetData = []
            rawTrainingSetData.extend(trainingSet1)
            rawTrainingSetData.extend(trainingSet5)
            random.shuffle(rawTrainingSetData)

            rawTestSetData = []
            rawTestSetData.extend(testSet1)

            rawTestSetData.extend(testSet5)
            random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document.  I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

            # for raw Training Data, generate all words in the data
            
            all_words = [w for (words, condition) in rawTrainingSetData for w in words]
            fdTrainingData = FreqDist(all_words)
            
            # take an arbitrary subset of these
            defaultWordSet = fdTrainingData.keys()[:1000]
            
            def emitDefaultFeatures(tokenizedText):
                '''
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                '''
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet
                
               return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here: 

      def encodeData(self,trainSet,encodingMethod):
          return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training  data encoded, we can encode the data and train the classifier as follows:

       encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
       classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

       encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
       print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy  drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.



1 comment:

  1. I had originally written an encoding function that encoded the word as the value because I thought that the frequency of the feature value was being counted. That resulted in an incredibly low score. I read though the training method code and realized that what is being counted for each feature are the permutations of the value across each label. Also, the feature set needs to be invariant across the training examples. Doing this gave me the 'acceptable' baseline of 82% accuracy.

    ReplyDelete