Waving Not Drowning: Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

Wednesday, January 22, 2014

Making Sense of Unstructured Text in Online Reviews, Part 2: Sentiment Analysis

In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).

To do classification I'm going to stand on the shoulders of the giants -- specifically the giants who wrote and maintain the NLTK package. In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."

Brief Recap

I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.

To review: Here is how I would download data for all reviews from a review site:

(note I've only implemented a 'scraper' for one site, http://sitejabber.com)

from sitejabberreviews import SiteJabberReviews
from analyzesitedata import AnalyzeSiteData

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.download(True) # this saves the reviews to the file specified above

Once I've downloaded the data, I can always load it up from that file later:

pageUrl = 'reviews/www.zulily.com'

filename = "siteyreviews.pkl"

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

Next Step: Bayesian Classification

NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this: the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.

This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:

P(review rating | features) = P(features, review rating)/P(features)

In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'. So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.

P(review rating | features) = P(features, review rating)

This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.

Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier by training it with one star and five star review data. This is a pretty simple, binary approach to review classification.

Feature Set Generation, Training, and Testing

To train and initially test, the NLTK Bayesian classifier, I need to do the following:

Extract train and test data from my review data.
Encode train and test data.
Train the classifier with the encoded training data
Test the classifier with the encoded test data.
Investigate errors during the test
Modify training set and repeat as needed.

I've written a helper method to generate training and test data:

def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):

# generate tuples of (textbag,rating)

reviewList = [(self.textBagFromRawText(review.text), key)

for review in reviews.reviewsByRating[key]]

return reviewList[: int(trainSetPercentage*len(reviewList))],

reviewList[int(trainSetPercentage*len(reviewList)):]

the generateTestAndTrainingSetsFromReviews() method calls textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:

def textBagFromRawText(self,rawText):

'''

@param rawText: a string of whitespace delimited text, 1..n sentences

@return: the word tokens in the text, stripped of non text chars including punctuation

'''

rawTextBag = []

sentences = re.split('[\.\(\)?!&,]',rawText)

for sentence in sentences:

lowered = sentence.lower()

parts = lowered.split()

rawTextBag.extend(parts)

textBag = [w for w in rawTextBag if w not in stopwords.words('english')]

return textBag

I generate test and training data for one and five star reviews using generateTestAndTrainingSetsFromReviews():

# load helper objects

sjr = SiteJabberReviews(pageUrl,filename)

sjr.load()

asd = AnalyzeSiteData()

trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)

trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)

rawTrainingSetData = []

rawTrainingSetData.extend(trainingSet1)

rawTrainingSetData.extend(trainingSet5)

random.shuffle(rawTrainingSetData)

rawTestSetData = []

rawTestSetData.extend(testSet1)

rawTestSetData.extend(testSet5)

random.shuffle(rawTestSetData)

With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document. I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.

# for raw Training Data, generate all words in the data

all_words = [w for (words, condition) in rawTrainingSetData for w in words]

fdTrainingData = FreqDist(all_words)

# take an arbitrary subset of these

defaultWordSet = fdTrainingData.keys()[:1000]

def emitDefaultFeatures(tokenizedText):

'''

@param tokenizedText: an array of text features

@return: a feature map from that text.

'''

tokenizedTextSet = set(tokenizedText)

featureSet = {}

for text in defaultWordSet:

featureSet['contains:%s'%text] = text in tokenizedTextSet

return featureSet

That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here:

def encodeData(self,trainSet,encodingMethod):

return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]

(Aside: I love list comprehensions! ) With training data encoded, we can encode the data and train the classifier as follows:

encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)

classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)

Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.

encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)

print nltk.classify.accuracy(classifier, encodedTestSet)

This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?

It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy drops to 75%.

Summary

The number of features obviously plays a role in accuracy, but only to a point. I wonder what happens if we start looking at removing features that could dilute accuracy. For the Bayesian classifier, those kind of features would be ones that have close to the same probability in both good and bad reviews. I'm going to investigate whether this kind of feature grooming results in better performance, not only on the test set but on a larger set of data, in my next post.

21 comments:

Arun JacobJanuary 30, 2014 at 8:33 PM
I had originally written an encoding function that encoded the word as the value because I thought that the frequency of the feature value was being counted. That resulted in an incredibly low score. I read though the training method code and realized that what is being counted for each feature are the permutations of the value across each label. Also, the feature set needs to be invariant across the training examples. Doing this gave me the 'acceptable' baseline of 82% accuracy.
ReplyDelete
Replies
top essay writing serviceNovember 17, 2015 at 12:53 AM
It is really exciting experience to go through the article. The author has beautifully covered the topic without having any bit of boring element. Thanks for sharing the wonderful article with us. Expect many more articles from here! Keep on sharing!

ReplyDelete
Replies
pragyachitraSeptember 27, 2018 at 3:16 AM
I was looking for this certain information for a long time. Thank you and good luck.

angularjs Training in bangalore

angularjs Training in btm

angularjs Training in electronic-city

angularjs Training in online

angularjs Training in marathahalli
ReplyDelete
Replies
simbuOctober 18, 2018 at 10:21 PM
Really you have done great job,There are may person searching about that now they will find enough resources by your post

Java training in Chennai | Java training in Bangalore

Java interview questions and answers | Core Java interview questions and answers
ReplyDelete
Replies
UnknownOctober 18, 2018 at 10:52 PM
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.

Data Science Training in Indira nagar | Data Science Training in btmlayout

Python Training in Kalyan nagar | Data Science training in Indira nagar

Data Science Training in Marathahalli | Data Science Training in BTM Layout
ReplyDelete
Replies
UnknownNovember 2, 2018 at 1:47 AM

Very nice post here thanks for it .I always like and such a super contents of these post.
DevOps Online Training
ReplyDelete
Replies
AnonymousNovember 23, 2018 at 9:02 PM
Nice blog..! I really loved reading through this article. Thanks for sharing such an amazing post with us and keep blogging...Well written article Thank You for Sharing with Us project management courses in chennai |pmp training class in chennai | pmp training near me | pmp training courses online | <a
ReplyDelete
Replies
htopJuly 29, 2019 at 10:40 PM
nice blog best java training in chennai
best python training in chennai
selenium training in chennai
selenium training in omr
selenium training in sholinganallur
angularjs training in chennai
aws training center in chennai
ReplyDelete
Replies
TNK Design DeskDecember 4, 2019 at 2:27 AM
This is an amazing blog, thank you so much for sharing such valuable information with us.
Visit for best website design and SEO services at- Website Development Company in India
website designing in gurgaon
best website design services in gurgaon
best web design company in gurgaon
best website design in gurgaon
website design services in gurgaon
website design service in gurgaon
best website designing company in gurgaon
website designing services in gurgaon
web design company in gurgaon
best website designing company in india
top website designing company in india
best web design company in gurgaon
best web designing services in gurgaon
best web design services in gurgaon
website designing in gurgaon
website designing company in gurgaon
website design in gurgaon
graphic designing company in gurgaon
website company in gurgaon
website design company in gurgaon
web design services in gurgaon
best website design company in gurgaon
website company in gurgaon
Website design Company in gurgaon
best website designing services in gurgaon
best web design in gurgaon
website designing company in gurgaon
website development company in gurgaon
web development company in gurgaon
website design company
ReplyDelete
Replies
zeabrosJanuary 15, 2020 at 1:30 AM
Hello, I am glad to read the whole content of this blog and am very excited and happy to say that the webmaster has done a very good job here to put all the information content and information at one place. More info please visit :-
Top IT Company in Delhi NCR
Graphic Designing Company Delhi NCR
Website Designing Delhi NCR
Dynamic Website Design Delhi NCR
logo Design in Delhi NCR
Customized Design in Delhi NCR
Web Development in Delhi NCR
ReplyDelete
Replies
zeacreationsJanuary 15, 2020 at 1:44 AM
Very interesting and valuable information which I always wanted to read. thanks for sharing such an amazing blog.
Web Design and Development company Gurgaon
Web Designing Company in Gurgaon
Graphic Designing Company Gurgaon
Static Website Designing in Gurgaon
Responsive Website Design in Gurgaon
Dynamic Website Designing in Gurgaon
E-commerce Website Designing Company in Gurgaon.
ReplyDelete
Replies
aarthiJuly 13, 2020 at 9:31 AM
Excellent blog.Expecting further updates. Java training in Chennai | Certification | Online Course Training | Java training in Bangalore | Certification | Online Course Training | Java training in Hyderabad | Certification | Online Course Training | Java training in Coimbatore | Certification | Online Course Training | Java training in Online | Certification | Online Course Training

ReplyDelete
Replies
AnuJuly 19, 2020 at 8:31 AM
informative valuable blog. thanks for sharing DevOps Training in Bangalore | Certification | Online Training Course institute | DevOps Training in Hyderabad | Certification | Online Training Course institute | DevOps Training in Coimbatore | Certification | Online Training Course institute | DevOps Online Training | Certification | Devops Training Online
ReplyDelete
Replies
tektutesOctober 30, 2021 at 8:09 AM
Very Nice Blog…Thanks for sharing this information with us. Here am sharing some information about training institute.
digital transformation services by NGS
ReplyDelete
Replies
AnonymousMay 5, 2022 at 1:37 AM
mmorpg oyunlar
Instagram takipçi satın al
TİKTOK JETON HİLESİ
TİKTOK JETON HİLESİ
Sac ekimi antalya
referans kimliği nedir
instagram takipçi satın al
metin2 pvp serverlar
INSTAGRAM TAKİPCİ SATİN AL
ReplyDelete
Replies
AnonymousMay 17, 2022 at 2:15 PM
perde modelleri
sms onay
mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
trafik sigortası
dedektör
web sitesi kurma
ASK ROMANLARİ
ReplyDelete
Replies
AnonymousJune 3, 2022 at 9:22 PM
pendik beko klima servisi
ataşehir beko klima servisi
çekmeköy daikin klima servisi
ataşehir daikin klima servisi
maltepe toshiba klima servisi
kadıköy toshiba klima servisi
kadıköy beko klima servisi
pendik bosch klima servisi
çekmeköy bosch klima servisi
ReplyDelete
Replies
canlı poker siteleriDecember 26, 2022 at 8:48 PM
Success Write content success. Thanks.
kıbrıs bahis siteleri
kralbet
betpark
betturkey
betmatik
canlı slot siteleri
deneme bonusu
ReplyDelete
Replies
oğuzJuly 31, 2023 at 3:46 AM
maraş
bursa
tokat
uşak
samsun

R7LN
ReplyDelete
Replies
kenanAugust 4, 2023 at 5:58 PM
çorlu
niğde
urfa
aksaray
hatay

3Q0YE
ReplyDelete
Replies
PakizeAugust 13, 2023 at 5:17 AM
salt likit
salt likit
dr mood likit
big boss likit
dl likit
dark likit
WNRSW4
ReplyDelete
Replies

Add comment