Waving Not Drowning: 2014

Tuesday, March 18, 2014

Making Sense of Unstructured Text In Online Reviews Part 4: Sentiment Analysis: Is More Data The Cure?

A Swing And A Miss

In my last post I had trained a Bayesian classifier using a dataset pulled from sitejabber.com, which provides reviews of ecommerce sites. I had pulled that data for a single site. I then trained and tested the data -- and found that even though my classifier performed at 83%, it had completely mis-classified all positive reviews.

As noted in the last post, only 20% of my original review data was positive -- 55 records out of the total of approximately 275 records. This leads me to three questions:

Did I really test and validate the data in the most effective way?
As this is a bayesian classifier, will increasing the amount of positive data help the classifier identify positive data more effectively?
If so, how do I go about increasing data from a finite set of data?

Giving the classifier another chance with K fold validation

Before trying anything, I'd like to understand whether my test and validation approach could be made more deterministic. I had previously run several iterations using randomly selected test and train validations. That doesn't give me guaranteed coverage of my entire data set or a valid, reproducible process upon which I can try improvements.

I can get that coverage and reproducibility by using k fold validation across the data.

K fold validation works like this:

break the dataset into K equivalent subsets.
hold one of the subsets out for testing.
use all of the other subsets for training.
train the data on the k-1 subsets, test it on the kth subset.
rotate through all subsets - repeat 2-4, holding out a different subset each time.
average the accuracy of all test+train processes.

When I re-run my tests using K-fold validation with 10 folds, I got an average accuracy across the entire dataset of 84.6%. Which is different than the 83% score that I had gotten doing 'randomized' tests.

In this baseline run I implicitly 'stratified' my test and training data -- all test and training data folds had the same proportion of positive and negative reviews, in this case there was roughly a 1:5 ratio between positive and negative reviews.

Getting data to be k foldable involves two steps: dividing into k folds, and building test data from one fold and training data from all of the rest. I've done these steps separately so that I can iterate through all k test and training sets with the same k folds.

This is the method I used to split an array into k folds:

def partitionArray(self,partitions, array):

"""

@param partitions - the number of partitions to divide array into

@param array - the array to divide

@return an array of the partitioned array parts (array of subarrays)

"""

nextOffset = incrOffset = len(array)/partitions

remainder = len(array)%partitions

lastOffset = 0

partitionedArray = []

for i in range(partitions):

partitionedArray.append(array[lastOffset:nextOffset])

lastOffset= nextOffset

nextOffset += incrOffset

partitionedArray[i].extend(array[incrOffset:incrOffset + remainder])

return partitionedArray

This is the method I used to build test and train sets, holding out the partition specified by the iteration parameter. It assumes I'm handing it two k-partitioned arrays, one with bad reviews and one with good reviews.

def buildKFoldValidationSets(self,folds,iteration, reviewsByRating):

"""

build test and training sets

@param iteration - the offset of the arrays to hold out

@param reviewsByRating - the set of reviews to build from

@return test and training arrays

"""

test = []

test.extend(reviewsByRating[1][iteration])

test.extend(reviewsByRating[5][iteration])

training = []

for i in range(folds):

if i == iteration:

continue

training.extend(reviewsByRating[1][i])

training.extend(reviewsByRating[5][i])

return training, test

Increasing The Data Set with Sampling

How do I increase the set of positive data if there is no more data to be used? I can take advantage of the fact that I am using a Bayesian classifier, which takes a 'bag of words' approach. In Bayesian classification, there is no information that depends on the sentence structure of the review text or the sequence of words, just words and word frequency counts. And the features (the words) are assumed to be independent from one another.

How does that help? My theory is that mis-classification happened because there wasn't enough positive review data to help the classifier recognize positive vs negative reviews. In order to increase the positive data set I need to generate more positive reviews.

Knowing that the Bayesian classifier doesn't care about sentence structure or word interdependence allows me to treat reviews as bags of words and nothing more. The word frequency counts in those bags of words need to line up to the overall word frequency distribution of the entire review set.

One way to do this is to build the data from the data that already exists, by taking random samples from an array that contains all the words across the set of positive reviews in the training data.

Pretend the following sentence is actually a review:

The big big green caterpillar ate the small green leaf.

putting the words in an array that looks like this:

somearray = ['the','big','big','green','caterpillar','ate','the','small','green','leaf']

I can sample that array to build up another sentence. That sentence has a 1/10 chance of being 'leaf', and a 1/5 chance of being 'big'. I can extend the sample set to be as large as I want -- covering multiple sentences, a review, multiple reviews, etc.

In this case I'm 'sampling with replacement', meaning that I don't remove the sample I get from the sampled set, which means that the probability of picking a word does not change across samples. This is important because I want the words in any generated data to have the same probability distribution that they do in the real data, and my sample set is built from the real data.

In Python sampling with replacement looks like this:

word = somearray[random.randint(0,len(somearray))]

I use this method to create reviews comprised of words randomly selected from the distribution of positive training words, and make sure the review length is the average length of all real positive reviews:

def createReview(self,textFreqDist,reviewLength):

"""

@param textFreqDist - the array containing the frequency distribution of words to choose from.

@param reviewLength - the length of the review (in words) to build

@return the generated review as a string

"""

randLen = len(textFreqDist)

reviewStr = ""

for i in range(reviewLength):

reviewStr += (textFreqDist[random.randint(0,randLen-1)] + ' ')

return reviewStr

A Cautionary Note on Overfitting

When I first did the positive 'boost', I was getting really good results....really, really good results. 99% accuracy on a test set was a number that seemed too good to be true. And it was.

In my code I had not 'held out' the test data prior to growing the training data. So my training data was being seeded with words from my test data, and I was 'polluting' my training and test process. While the classifier performed incredibly well on the test set, it would have performed relatively poorly on other data when compared to a classifier trained and tested on data that has been held apart.

When I rewrote the training and testing process, I made sure to hold out test data prior to sampling from the training set. This meant that the terms in the positive review test data did not factor into the overall training data sample set. While those words may have been present in the training data sample set, they would be counted at a lower frequency, so the test process wouldn't be biased.

New Test Results

I ran the same 10 fold validation process over training data whose positive review set had been boosted to be 50% of the overall training set. This isn't stratified K fold validation -- by boosting the number positive reviews with resampling of the training word data, I am altering the positive to negative ratio of the training set. Because the test data was held out of the boosting process, the ratio of positive to negative reviews in the test data remainsthe same. The code used to train and test the data is the same as before.

My test results averaged to 89.4%, an improvement from 84.6%. However, when I look at the errors more closely, I see that most of the errors are still due to mis-classifying positive reviews, which is interesting, given that I've boosted positive training data to be 50% of the training set. In the base training run my best effort mis-classified 60% of the positive reviews, and my worst efforts mis-classified 100% of the positive reviews. In the boosted training run my best effort mis-classified 20% of the positive reviews, and my worst effort mis-classified 60% of the positive reviews.

Summary

This improvement makes sense because word frequency directly affects how the Bayesian classifier works. My 'boosting' effort worked because of the naive assumption of word independence in the classifier -- I didn't have to account for word dependencies, I only had to account for word frequency.

If I were to do this over again, I would do the following:

If at all possible, get more data. Having only 5-10 positive reviews in the test set didn't give me a lot to work with -- it is hard to draw conclusions from such a small positive review set.
k-fold validation from the beginning to get the average accuracy per approach.
investigate mis-classification errors before doing any optimization!

Most of my time was spent analyzing and building the optimal training data set. The biggest improvement made was not in tweaking the algorithm, but 'boosting' the positive training data to increase the recognition of positive reviews. The biggest mistake I made was to not examine training errors immediately.

To get improvements, I couldn't treat the algorithm as a black box, I had to know enough about how it functioned to prepare the data for an optimal classification score. Note that this approach wouldn't work in an approach at assumed some level of dependence between words in a review text -- I'd have to calculate that dependency in order to generate reviews.

A final note: this is a classifier that was trained on a single source of reviews. That's great to classify more reviews about that ecommerce site, but the classifier would probably suck tremendously on a travel review site. However, the approaches taken would work if we had travel review site training data.

Potential next steps include:

Getting (more) new data from a different source.
Trying the bayes classifier on that data
Trying a different classifier, e.g. the maxent classifier on the same data
Going deeper into sentiment: what entities were positive / negative sentiment directed at?

Tuesday, February 4, 2014

Making Sense of Unstructured Text in Online Reviews Part 3: Trying to Improve Classifier Accuracy

This post is part of a series where I try to classify online review text in more and more concrete ways. Right now I'm training a classifier to accurately classify one (bad) vs five (good) starred reviews. In the last post I had done some initial training and testing of an NLTK Bayesian classifier. In this post I want to see if I can improve the accuracy score of my classifier by getting smarter about which features I include.

In the last post I had experimented with varying the quantity of feature set, and had found that while encoding more features into a classifier during training helps accuracy, there is an eventual accuracy ceiling. My feature set came from taking the top N words from a frequency distribution of all words in the reviews text. Here is what the accuracy curve looks like:

One other way to improve accuracy is to address the 'quality' of the feature set by looking at features not only in terms of their frequency across the training corpus, but looking at their relative frequencies across classifications.

In the review classification done so far, individual words are the features. I'm going to try to 'tune' feature sets in several different ways -- I have no idea if these will work, but they seem reasonable. I'm going to call these attempts hypotheses, because my goal is to prove them to be true or false, with relatively minimal effort.

Hypothesis 1: Throw away features with a low 'frequency differential'

My hypothesis is that there are features that have a much higher chance of being in a negative review than a positive review, and vice versa. Those are the features that we want to keep. Other features are ones that have approximately the same chance of being in either type of review (positive or negative).

P(review rating | features) = P(features, review rating)

In the equation above, the P(features, review rating) term is the multiplied probabilities of each P(feature, review rating). If I'm looking for a higher overall probability that a document is one star over five star or vice versa, having per feature probabilities that are similar for one star or five star reviews means that my overall probabilities for one and five star will be close to equal, which could tip classification results 'the other way' and increase my error rate.

I can validate this hypothesis by filtering out those low probability differential features and keeping the ones that have a high probability differential: a high difference between P(feature, review rating) for {1 star, 5 star} ratings.

Building The Feature Set

I had trained and tested the classifier by taking input data, splitting it into a test and a training set, then training and testing. I will recreate that process now to get the raw data so that I can 'remove' common terms with low probability:

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

asd = AnalyzeSiteData()

I ended up recoding the building of the training and test data so that the data sets being built had a more even distribution of ratings across them:

def generateLearningSetsFromReviews(self,reviews, ratings,buckets):

# check to see that percentages sum to 1

# get collated sets of reviews by rating.

val = 0.0

for pct in buckets.values():

val += pct

if val > 1.0:

raise 'percentage values must be floats and must sum to 1.0'

reviewsByRating = defaultdict(list)

for reviewSet in reviews:

for rating in ratings:

reviewList = [(self.textBagFromRawText(review.text), rating)

for review in reviewSet.reviewsByRating[rating]]

reviewsByRating[rating].extend(reviewList)

random.shuffle(reviewsByRating[rating]) # mix up reviews from different reviewSets

# break collated sets across all ratings into percentage buckets

learningSets = defaultdict(list)

for rating in ratings:

sz = len(reviewsByRating[rating])

lastidx = 0

for (bucketName, pct) in buckets.items():

idx=lastidx + int(pct*sz)

learningSets[bucketName].extend(reviewsByRating[rating][lastidx:idx])

lastidx = idx

return learningSets

When I built up the training data using this method, the sets were returned in the buckets[] array:

buckets = asd.generateLearningSetsFromReviews([sjr],[1,5],{'training': 0.8,'test':0.2})

Each training set in this list is actually an array of (textBag, rating) tuples:
buckets = [[(bagOfText,rating)...],[..]]

I want to get frequency distributions of common terms from one and five star reviews in the training data, so that I can find terms with a high probability differential:

          # get common terms and frequency differentials

          allWords1 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 1]

fd1 = FreqDist(allWords1)

allWords5 = [w for (textBag,rating) in buckets['training'] for w in textBag if rating == 5]

fd5 = FreqDist(allWords5)

          commonTerms = [w for w in fd1.keys() if w in fd5.keys()]

          # now get frequency differentials

          commonTermFreqs = [(w,fd1.freq(w),fd5.freq(w),abs(fd1.freq(w)-fd5.freq(w)))
              for w in commonTerms]

          commonTermFreqs.sort(key=itemgetter(3),reverse=True)

Now we've got common terms, sorted by their absolute differential between frequency distributions in 1 and 5 star reviews.

if I plot this distribution:

freqdiffs = [diff for (a,b,c,diff) in commonTermFreqs]
  plt.plot(freqdiffs)
plt.show()

I can see that it falls off sharply:

This looks like a Zipfian distribution: "given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word..."

The shape of this distribution implies that only a small subset of the terms actually have a frequency differential that really 'matters' in the hypothesis -- all terms aren't needed. I can start arbitrarily by keeping all terms with a frequency differential > 0.001 to quickly test the hypothesis. That leaves 131 of the original 688 common terms.

Note that in getting and filtering common terms, I have not retained the terms that very strongly signal one review rating or another: those would be the terms that exist only in one review rating corpus or another. Note that even though those terms do not exist in one or the other review corpus, and that would make the calculation go to zero, the non existent terms are 'smoothed out' by including them in the other corpus and adding a very small value to the frequency of all terms in that corpus, which guarantees that there are no terms with a zero frequency, and the Bayesian calculation won't zero out.

I would need to add those terms into the set of terms that we filter by.

The full set of filtered terms is comprised of both uncommon and filtered common words:

filterTerms = [w for (w,x,y,diff) in commonTermFreqs if diff > 0.001]

fd1Only = [w for w in fd1.keys() if w not in fd5.keys]

filterTerms.extend(fd1Only)

fd5Only = [w for w in fd5.keys() if w not in fd1.keys]

filterTerms.extend(fd5Only)

defaultWordSet = set(filterTerms) # rename so I dont have to rewrite the encoding method

And I use those words as features identified at encoding time:

def emitDefaultFeatures(tokenizedText):

'''

@param tokenizedText: an array of text features

@return: a feature map from that text.

'''

tokenizedTextSet = set(tokenizedText)

featureSet = {}

for text in defaultWordSet:

featureSet['contains:%s'%text] = text in tokenizedTextSet

return featureSet

Testing The Hypothesis

Now I can train the classifier: asd.encodeData() takes care of encoding features from the training and test sets by calling emitDefaultFeatures() for each review.

encodedTrainSet = asd.encodeData(rawTrainingSetData,emitDefaultFeatures )

classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

print nltk.classify.accuracy(classifier, encodedTestSet)

And I get an accuracy of 0.83, the same accuracy I got with no manipulation of the feature set, which is 0.02 less than my optimal accuracy. Whoops.

Detailed Error Analysis

There is one other step I can take to understand the accuracy of the classifier, and that is to analyze the errors made on the test set. If I know how I mis-classified the data, that can help me affect the classifier.

shouldBeClassed1 = []

shouldBeClassed5 = []

for (textbag, rating) in buckets['test']:

testRating = classifier.classify(emitDefaultFeatures(textbag))

if testRating != rating:

if rating == 1:

shouldBeClassed1.append(textbag)

else:

shouldBeClassed5.append(textbag)

A quick check on the error arrays shows me that I've only made mistakes on the reviews that should be classified as positive:

>>>> print len(shouldBeClassed5.append(textbag))

Wait a minute. That number looks familiar. Let me review the raw data again:

>>>>print len(sjr.reviewsByRating[5])

>>>>print int(0.2*len(sjr.reviewsByRating[5]))

This data shows that I mis-classified all 11 positive reviews in the test data, because my error analysis showed that I had eleven mis-classified positive reviews, and I only had 11 positive reviews in the teset set based on an 80% training/20% testing split.

A quick reversal to the original test method (that collected features from a FreqDist of all terms in the training data) shows that I mis-classified all 11 positive reviews as well.

Summary

This was one attempt to improve classifier accuracy by trying something reasonable with the feature set -- removing features whose probability differential across 1 star and 5 star review corpuses was very small.

While the numbers initially looked 'decent', deeper analysis shows that my classifier completely mis-classified positive reviews. In the future I'll do error analysis of classifiers before trying to theorize about what could make the classifier more accurate.

Looking closer at the data, the data set had 55 total positive reviews and 273 total negative reviews. In other words only 20% of my data was actually positive review data.

I had originally scraped only one reviewed site for data, but now I think I'm going to need to scrape more sites to get a more representative set of positive review data so that the classifier has more training examples.

In my next post I'm going to try to collect a more representative 'set' of data, and also take a slightly different approach to validating my classifier. I'm going to do error analysis up front and attempt to correct my classifier based on the errors I see, then test the classifier against new test data -- testing a fixed classifier against the data I used to fix it will give me a false sense of accuracy, because the test data used to do error analysis has in effect become training data.

Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants who wrote and maintain the NLTK package. In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

from sitejabberreviews import SiteJabberReviews
from analyzesitedata import AnalyzeSiteData

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this: the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'. So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier by training it with one star and five star review data. This is a pretty simple, binary approach to review classification.

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:

Extract train and test data from my review data.
Encode train and test data.
Train the classifier with the encoded training data
Test the classifier with the encoded test data.
Investigate errors during the test
Modify training set and repeat as needed.

I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):

# generate tuples of (textbag,rating)

reviewList = [(self.textBagFromRawText(review.text), key)

for review in reviews.reviewsByRating[key]]

return reviewList[: int(trainSetPercentage*len(reviewList))],

reviewList[int(trainSetPercentage*len(reviewList)):]

the generateTestAndTrainingSetsFromReviews() method calls textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

def textBagFromRawText(self,rawText):

'''

@param rawText: a string of whitespace delimited text, 1..n sentences

@return: the word tokens in the text, stripped of non text chars including punctuation

'''

rawTextBag = []

sentences = re.split('[\.\(\)?!&,]',rawText)

for sentence in sentences:

lowered = sentence.lower()

parts = lowered.split()

rawTextBag.extend(parts)

textBag = [w for w in rawTextBag if w not in stopwords.words('english')]

return textBag

I generate test and training data for one and five star reviews using generateTestAndTrainingSetsFromReviews():

# load helper objects

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

asd = AnalyzeSiteData()

trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)

trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)

rawTrainingSetData = []

rawTrainingSetData.extend(trainingSet1)

rawTrainingSetData.extend(trainingSet5)

random.shuffle(rawTrainingSetData)

rawTestSetData = []

rawTestSetData.extend(testSet1)

rawTestSetData.extend(testSet5)

random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document. I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

# for raw Training Data, generate all words in the data

all_words = [w for (words, condition) in rawTrainingSetData for w in words]

fdTrainingData = FreqDist(all_words)

# take an arbitrary subset of these

defaultWordSet = fdTrainingData.keys()[:1000]

def emitDefaultFeatures(tokenizedText):

'''

@param tokenizedText: an array of text features

@return: a feature map from that text.

'''

tokenizedTextSet = set(tokenizedText)

featureSet = {}

for text in defaultWordSet:

featureSet['contains:%s'%text] = text in tokenizedTextSet

return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here:

def encodeData(self,trainSet,encodingMethod):

return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training data encoded, we can encode the data and train the classifier as follows:

encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)

classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.

Saturday, January 4, 2014

Making Sense of Unstructured Text in Online Reviews, Part 1

Introduction

I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

# first get the pages we need to navigate to to get all reviews for this site.

page = urllib2.urlopen(self.pageUrl)

soup = BeautifulSoup(page)

pageNumDiv = soup.find('div',{'class':'page_numbers'})

anchors = pageNumDiv.find_all('a')

urlList = []

urlList.append(self.pageUrl)

for anchor in anchors:

urlList.append(self.base + anchor['href'])

# with all pages set, pull each page down and extract review text and rating. s

for url in urlList:

page = urllib2.urlopen(url)

soup = BeautifulSoup(page)

divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above:

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements.

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

for div in divs:

text = div.find('p',id=re.compile('ReviewText-.*')).text

rawRating = div.find(itemprop='ratingValue')['content']

r = Review(text,rawRating)

if self.reviewsByRating.has_key(r.rating):

reviews = self.reviewsByRating[r.rating]

else:

reviews = []

self.reviewsByRating[r.rating] = reviews

reviews.append(r)

and flushed that map to disk using pickle:

def saveToDisk(self):

with open(self.filename,'w') as f:

pickle.dump(self.reviewsByRating,f)

this let me load the data from file without having to scrape it again:

def load(self):

with open(self.filename,'r') as f:

self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data.