In part 1 I spent time explaining my motivations for exploring online reviews and talked about getting the data with BeautifulSoup, then saving it with Pickle. Now that I have the raw text and the associated rating for a set of reviews, I want to see if I can leverage the text and the ratings to classify other review text. This is a bit of a detour from finding out 'why' people liked a specific site or not, but it was a very good learning process for me (that is still going on).
To do classification I'm going to stand on the shoulders of the giants -- specifically the giants who wrote and maintain the NLTK package. In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."
(note I've only implemented a 'scraper' for one site, http://sitejabber.com)
from sitejabberreviews import SiteJabberReviews
from analyzesitedata import AnalyzeSiteData
To do classification I'm going to stand on the shoulders of the giants -- specifically the giants who wrote and maintain the NLTK package. In it's own words, "NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets and tutorials supporting research and development in natural language processing."
Brief Recap
I put together some code to download and save, then reload and analyze the data. I wanted to build a set of classes I could easily manipulate from the command prompt, so that I could explore the data interactively. The source is at https://github.com/arunxarun/reviewanalysis.To review: Here is how I would download data for all reviews from a review site:
(note I've only implemented a 'scraper' for one site, http://sitejabber.com)
from sitejabberreviews import SiteJabberReviews
from analyzesitedata import AnalyzeSiteData
pageUrl = 'reviews/www.zulily.com'
filename = "siteyreviews.pkl"
sjr = SiteJabberReviews(pageUrl,filename)
sjr.download(True) # this saves the reviews to the file specified above
Once I've downloaded the data, I can always load it up from that file later:
pageUrl = 'reviews/www.zulily.com'
filename = "siteyreviews.pkl"
sjr = SiteJabberReviews(pageUrl,filename)
sjr.load()
the generateTestAndTrainingSetsFromReviews() method calls textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:
I generate test and training data for one and five star reviews using generateTestAndTrainingSetsFromReviews():
# load helper objects
With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document. I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.
all_words = [w for (words, condition) in rawTrainingSetData for w in words]
# take an arbitrary subset of these
return featureSet
Next Step: Bayesian Classification
NLTK comes with several built in classifiers, including a Bayesian classifier. There are much better explanations of Bayes theory than I could possibly provide, but the basic theory as it applies to text classification is this: the occurrence of a word across bodies of previously classified documents can be used to classify other documents as being in one of the input classifications. The existence of previously classified documents implies that the Bayesian classifier is a supervised classifier, which means it must be trained with data that has already been classified.
This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:
P(review rating | features) = P(features, review rating)/P(features)
This is a bastardized version of Bayes' theorem as it applies determining the probability that a review has a specific rating given the features (words) in it:
P(review rating | features) = P(features, review rating)/P(features)
In other words, the probability that a review has a specific rating given its features depends on the probabilities of those features as previously observed in other documents that have the specific rating / the probability that the review has those features. Since the features are the words in the review, they are the same no matter what the rating is, so that term effectively 'drops out'. So the probability that a review has a specific rating is the multiplied probabilities of the terms in the review being in previously observed documents that had the same rating.
P(review rating | features) = P(features, review rating)
P(review rating | features) = P(features, review rating)
This isn't completely true: there's some complexities in the details. For example: while the strongest features would be the ones that have no presence in one of the review classes, a Bayesian classifier cant work with P(feature) = 0, as this would make the above equation go to zero. In order to avoid that there are smoothing techniques that can be applied. These techniques basically apply a very small increment to the count of all features (including zero valued ones) so that there are no zero values, but the probability distribution essentially stays the same. The size of the increment depends on the values of the probabilities in the probability distribution of P(feature, label) for all features for a specific label.
Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier by training it with one star and five star review data. This is a pretty simple, binary approach to review classification.
Review data is awesome training data because there's lots of it, I can get it easily, and it's all been rated. I'm going to use NLTK's Bayesian classifier to help me distinguish between positive and negative reviews. The Bayesian classifier by training it with one star and five star review data. This is a pretty simple, binary approach to review classification.
Feature Set Generation, Training, and Testing
To train and initially test, the NLTK Bayesian classifier, I need to do the following:- Extract train and test data from my review data.
- Encode train and test data.
- Train the classifier with the encoded training data
- Test the classifier with the encoded test data.
- Investigate errors during the test
- Modify training set and repeat as needed.
def generateTestAndTrainingSetsFromReviews(self,reviews, key, trainSetPercentage):
# generate tuples of (textbag,rating)
reviewList = [(self.textBagFromRawText(review.text), key)
for review in reviews.reviewsByRating[key]]
return reviewList[: int(trainSetPercentage*len(reviewList))],
reviewList[int(trainSetPercentage*len(reviewList)):]
the generateTestAndTrainingSetsFromReviews() method calls textBagFromRawText(): In that method I create an array of words after stripping sentences, punctuation, and stop words:
def textBagFromRawText(self,rawText):
'''
@param rawText: a string of whitespace delimited text, 1..n sentences
@return: the word tokens in the text, stripped of non text chars including punctuation
'''
rawTextBag = []
sentences = re.split('[\.\(\)?!&,]',rawText)
for sentence in sentences:
lowered = sentence.lower()
parts = lowered.split()
rawTextBag.extend(parts)
textBag = [w for w in rawTextBag if w not in stopwords.words('english')]
return textBag
I generate test and training data for one and five star reviews using generateTestAndTrainingSetsFromReviews():
# load helper objects
sjr = SiteJabberReviews(pageUrl,filename)
sjr.load()
asd = AnalyzeSiteData()
trainingSet1, testSet1 = asd. generateTestAndTrainingSetsFromReviews(sjr, 1, 0.8)
trainingSet5, testSet5 = asd. generateTestAndTrainingSetsFromReviews(sjr, 5, 0.8)
rawTrainingSetData = []
rawTrainingSetData.extend(trainingSet1)
rawTrainingSetData.extend(trainingSet5)
random.shuffle(rawTrainingSetData)
rawTestSetData = []
rawTestSetData.extend(testSet1)
rawTestSetData.extend(testSet5)
random.shuffle(rawTestSetData)With training and test data built I need to encode features with their associated ratings. For the Bayesian classifier, I need to encode the same set of features across multiple documents. The presence (or absence) of those features in each document is what helps classify the document. I'm flagging those features as as True if they are in the review text and False if they are not -- which allows the classifier to build up feature frequency across the entire corpus and calculate the feature frequency per review type.
# for raw Training Data, generate all words in the data
all_words = [w for (words, condition) in rawTrainingSetData for w in words]
fdTrainingData = FreqDist(all_words)
# take an arbitrary subset of these
defaultWordSet = fdTrainingData.keys()[:1000]
def emitDefaultFeatures(tokenizedText):
'''
@param tokenizedText: an array of text features
@return: a feature map from that text.
'''
tokenizedTextSet = set(tokenizedText)
featureSet = {}
for text in defaultWordSet:
featureSet['contains:%s'%text] = text in tokenizedTextSet
That featureSet needs to be associated with the rating of the review, which I've already done during test set generation. The method that takes raw text to encoded feature set is here:
def encodeData(self,trainSet,encodingMethod):
return [(encodingMethod(tokenizedText), rating) for (tokenizedText, rating) in trainSet]
(Aside: I love list comprehensions! ) With training data encoded, we can encode the data and train the classifier as follows:
encodedTrainSet = asd.encodeData(rawTrainingSetData, emitDefaultFeatures)
classifier = nltk.NaiveBayesClassifier.train(encodedTrainSet)
Once we have trained the classifier, we will test it's accuracy against test data. As we already know the classification of the test data, accuracy is simple to calculate.
This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?
It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy drops to 75%.
encodedTestSet = asd.encodeData(rawTestSetData, emitDefaultFeatures)
print nltk.classify.accuracy(classifier, encodedTestSet)
This gives me an accuracy of 0.83, meaning 83% of the time I will be correct. That's pretty good, I'm wondering if I can get better. I picked an arbitrary set of features (the first 1000): what happens if I use all approximately 3000 words in the review as features ?
It turns out that I get the same level of accuracy (83%) with 3000 features as I do with 1000 features. If I go the other way and shorten the feature set to use the top 100 features only, the accuracy drops to 75%.
I had originally written an encoding function that encoded the word as the value because I thought that the frequency of the feature value was being counted. That resulted in an incredibly low score. I read though the training method code and realized that what is being counted for each feature are the permutations of the value across each label. Also, the feature set needs to be invariant across the training examples. Doing this gave me the 'acceptable' baseline of 82% accuracy.
ReplyDeleteIt is really exciting experience to go through the article. The author has beautifully covered the topic without having any bit of boring element. Thanks for sharing the wonderful article with us. Expect many more articles from here! Keep on sharing!
ReplyDeleteI was looking for this certain information for a long time. Thank you and good luck.
ReplyDeleteangularjs Training in bangalore
angularjs Training in btm
angularjs Training in electronic-city
angularjs Training in online
angularjs Training in marathahalli
Really you have done great job,There are may person searching about that now they will find enough resources by your post
ReplyDeleteJava training in Chennai | Java training in Bangalore
Java interview questions and answers | Core Java interview questions and answers
This is an awesome post.Really very informative and creative contents. These concept is a good way to enhance the knowledge.I like it and help me to development very well.Thank you for this brief explanation and very nice information.Well, got a good knowledge.
ReplyDeleteData Science Training in Indira nagar | Data Science Training in btmlayout
Python Training in Kalyan nagar | Data Science training in Indira nagar
Data Science Training in Marathahalli | Data Science Training in BTM Layout
ReplyDeleteVery nice post here thanks for it .I always like and such a super contents of these post.
DevOps Online Training
Nice blog..! I really loved reading through this article. Thanks for sharing such an amazing post with us and keep blogging...Well written article Thank You for Sharing with Us project management courses in chennai |pmp training class in chennai | pmp training near me | pmp training courses online | <a
ReplyDeletenice blog best java training in chennai
ReplyDeletebest python training in chennai
selenium training in chennai
selenium training in omr
selenium training in sholinganallur
angularjs training in chennai
aws training center in chennai
This is an amazing blog, thank you so much for sharing such valuable information with us.
ReplyDeleteVisit for best website design and SEO services at- Website Development Company in India
website designing in gurgaon
best website design services in gurgaon
best web design company in gurgaon
best website design in gurgaon
website design services in gurgaon
website design service in gurgaon
best website designing company in gurgaon
website designing services in gurgaon
web design company in gurgaon
best website designing company in india
top website designing company in india
best web design company in gurgaon
best web designing services in gurgaon
best web design services in gurgaon
website designing in gurgaon
website designing company in gurgaon
website design in gurgaon
graphic designing company in gurgaon
website company in gurgaon
website design company in gurgaon
web design services in gurgaon
best website design company in gurgaon
website company in gurgaon
Website design Company in gurgaon
best website designing services in gurgaon
best web design in gurgaon
website designing company in gurgaon
website development company in gurgaon
web development company in gurgaon
website design company
Hello, I am glad to read the whole content of this blog and am very excited and happy to say that the webmaster has done a very good job here to put all the information content and information at one place. More info please visit :-
ReplyDeleteTop IT Company in Delhi NCR
Graphic Designing Company Delhi NCR
Website Designing Delhi NCR
Dynamic Website Design Delhi NCR
logo Design in Delhi NCR
Customized Design in Delhi NCR
Web Development in Delhi NCR
Very interesting and valuable information which I always wanted to read. thanks for sharing such an amazing blog.
ReplyDeleteWeb Design and Development company Gurgaon
Web Designing Company in Gurgaon
Graphic Designing Company Gurgaon
Static Website Designing in Gurgaon
Responsive Website Design in Gurgaon
Dynamic Website Designing in Gurgaon
E-commerce Website Designing Company in Gurgaon.
Excellent blog.Expecting further updates. Java training in Chennai | Certification | Online Course Training | Java training in Bangalore | Certification | Online Course Training | Java training in Hyderabad | Certification | Online Course Training | Java training in Coimbatore | Certification | Online Course Training | Java training in Online | Certification | Online Course Training
ReplyDeleteinformative valuable blog. thanks for sharing DevOps Training in Bangalore | Certification | Online Training Course institute | DevOps Training in Hyderabad | Certification | Online Training Course institute | DevOps Training in Coimbatore | Certification | Online Training Course institute | DevOps Online Training | Certification | Devops Training Online
ReplyDeleteVery Nice Blog…Thanks for sharing this information with us. Here am sharing some information about training institute.
ReplyDeletedigital transformation services by NGS
mmorpg oyunlar
ReplyDeleteInstagram takipçi satın al
TİKTOK JETON HİLESİ
TİKTOK JETON HİLESİ
Sac ekimi antalya
referans kimliği nedir
instagram takipçi satın al
metin2 pvp serverlar
INSTAGRAM TAKİPCİ SATİN AL
perde modelleri
ReplyDeletesms onay
mobil ödeme bozdurma
nft nasıl alınır
Ankara Evden Eve Nakliyat
trafik sigortası
dedektör
web sitesi kurma
ASK ROMANLARİ
pendik beko klima servisi
ReplyDeleteataşehir beko klima servisi
çekmeköy daikin klima servisi
ataşehir daikin klima servisi
maltepe toshiba klima servisi
kadıköy toshiba klima servisi
kadıköy beko klima servisi
pendik bosch klima servisi
çekmeköy bosch klima servisi
Success Write content success. Thanks.
ReplyDeletekıbrıs bahis siteleri
kralbet
betpark
betturkey
betmatik
canlı slot siteleri
deneme bonusu
maraş
ReplyDeletebursa
tokat
uşak
samsun
R7LN
çorlu
ReplyDeleteniğde
urfa
aksaray
hatay
3Q0YE
salt likit
ReplyDeletesalt likit
dr mood likit
big boss likit
dl likit
dark likit
WNRSW4