Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants  who wrote and maintain the NLTK package.  In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that  I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

    from sitejabberreviews import SiteJabberReviews
    from analyzesitedata import AnalyzeSiteData


    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    

    sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

    pageUrl = 'reviews/www.zulily.com'
    filename = "siteyreviews.pkl"
    
    sjr = SiteJabberReviews(pageUrl,filename)    
    sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this:  the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

    P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those  features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'.  So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

    P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier  by training it with one star and five star review data. This is a pretty simple, binary approach to review classification. 

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:
  1. Extract train and test data from my review data.
  2. Encode train and test data.
  3. Train the classifier with the encoded training data
  4. Test the classifier with the encoded test data.
  5. Investigate errors during the test
  6. Modify training set and repeat as needed.
I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
       # generate tuples of (textbag,rating)
        reviewList = [(self.textBagFromRawText(review.text), key) 
           for review in reviews.reviewsByRating[key]]
        
        return reviewList[: int(trainSetPercentage*len(reviewList))],
            reviewList[int(trainSetPercentage*len(reviewList)):] 

the generateTestAndTrainingSetsFromReviews() method calls  textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

 def textBagFromRawText(self,rawText):
        '''
        @param rawText: a string of whitespace delimited text, 1..n sentences
        @return: the word tokens in the text, stripped of non text chars including punctuation
        '''
        rawTextBag = []        
        sentences = re.split('[\.\(\)?!&,]',rawText)
        for sentence in sentences:
            lowered = sentence.lower()
            parts = lowered.split()
            rawTextBag.extend(parts)
         
        
        textBag = [w for w in rawTextBag if w not in stopwords.words('english')]    

        return textBag

I generate test and training data for one and five star reviews using  generateTestAndTrainingSetsFromReviews():

            # load helper objects
            sjr = SiteJabberReviews(pageUrl,filename)
            sjr.load()
            asd = AnalyzeSiteData()

            trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
            
            trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
            
            rawTrainingSetData = []
            rawTrainingSetData.extend(trainingSet1)
            rawTrainingSetData.extend(trainingSet5)
            random.shuffle(rawTrainingSetData)

            rawTestSetData = []
            rawTestSetData.extend(testSet1)

            rawTestSetData.extend(testSet5)
            random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document.  I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

            # for raw Training Data, generate all words in the data
            
            all_words = [w for (words, condition) in rawTrainingSetData for w in words]
            fdTrainingData = FreqDist(all_words)
            
            # take an arbitrary subset of these
            defaultWordSet = fdTrainingData.keys()[:1000]
            
            def emitDefaultFeatures(tokenizedText):
                '''
                @param tokenizedText: an array of text features
                @return: a feature map from that text.
                '''
                tokenizedTextSet = set(tokenizedText)
                featureSet = {}
                for text in defaultWordSet:
                    featureSet['contains:%s'%text] = text in tokenizedTextSet
                
               return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here: 

      def encodeData(self,trainSet,encodingMethod):
          return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training  data encoded, we can encode the data and train the classifier as follows:

       encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
       classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

       encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
       print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy  drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.



Saturday, January 4, 2014

Making Sense of Unstructured Text in Online Reviews, Part 1

Introduction

I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

        # first get the pages we need to navigate to to get all reviews for this site. 
        page = urllib2.urlopen(self.pageUrl)
        soup = BeautifulSoup(page)
        
        pageNumDiv = soup.find('div',{'class':'page_numbers'})
        
        anchors = pageNumDiv.find_all('a')
        
        urlList = []
        urlList.append(self.pageUrl)
        for anchor in anchors:
            urlList.append(self.base + anchor['href'])
        
        # with all pages set, pull each page down and extract review text and rating. s
        for url in urlList:    
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page)
            divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))
            
                    
            for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

    soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what  I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

    for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
               rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above: 

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements. 

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

         for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']
                
                
                r = Review(text,rawRating)
                
                if self.reviewsByRating.has_key(r.rating):
                    reviews = self.reviewsByRating[r.rating]
                else:
                    reviews = []
                    self.reviewsByRating[r.rating] = reviews
                
                reviews.append(r) 

and flushed that map to disk using pickle:

     def saveToDisk(self):
          with open(self.filename,'w') as f:
              pickle.dump(self.reviewsByRating,f)

this let me load the data from file without having to scrape it again:

    def load(self):
          with open(self.filename,'r') as f:
              self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data.