Saturday, January 4, 2014

Making Sense of Unstructured Text in Online Reviews, Part 1

Introduction

I just returned from a meticulously researched vacation to a small fishing village an hour north of Cabo San Lucas, Mexico. The main reason for the great time we had was the amount of up front research that we put into finding the right places to stay, by researching the hell out of them via tripadvisor reviews.

After reading 100s of reviews, it occurred to me that If I were running a hotel, I would want to know why people liked me or why they didn't. I would want to be able to rank their likes and dislikes by type and magnitude, and make business decisions on whether to address them or not. I would also be interested in whether the same kind of issues (focusing on the dislikes here) grew or abated over time.

I could say the same thing about e-commerce sites. If I were in the business of selling someone something, and they really didn't like the way the transaction went, I'd like to know what they didn't like, and whether/how many other people felt the same way, so I could respond in a way that reduces customer dissatisfaction.

One nice thing about reviews is that they come with a quantitative summary: a rating. Every paragraph in a review section of a review site maps to a rating. This is great because it allows me to pre-categorize text. It's free training data!

I've broken this effort into two+ phases: getting the data, analyzing/profiling the data, and tbd next steps. I'm very sure I need to get the data, I'm pretty sure I can take some first steps at profiling the data, and from there on out it gets hazy. I know I want to determine why people like or don't like a site, but I don't have a very clear way to get there. Consider that a warning :)

Phase 1: Getting The Data

I had been out of the screen scraping loop for a while. I had heard of BeautifulSoup, the python web scraping utility. But I had never used it, and thought I was in for a long night of toggling between my editor and the documentation. Boy was I wrong. I had data flowing in 30 minutes. Beautiful Soup is the easy button as far as web scraping is concerned.

Here is the bulk of the logic I used to pull pagination data and then use that to navigate to review pages from sitejabber.com (I'm focusing on ecommerce sites first)

        # first get the pages we need to navigate to to get all reviews for this site. 
        page = urllib2.urlopen(self.pageUrl)
        soup = BeautifulSoup(page)
        
        pageNumDiv = soup.find('div',{'class':'page_numbers'})
        
        anchors = pageNumDiv.find_all('a')
        
        urlList = []
        urlList.append(self.pageUrl)
        for anchor in anchors:
            urlList.append(self.base + anchor['href'])
        
        # with all pages set, pull each page down and extract review text and rating. s
        for url in urlList:    
            page = urllib2.urlopen(url)
            soup = BeautifulSoup(page)
            divs = soup.find_all('div',id=re.compile('ReviewRow-.*'))
            
                    
            for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']

Note the need to download the page first, I used urlllib2.urlopen() to get the page. I then created a BeautifulSoup representation of the page:

    soup = BeautifulSoup(page)

Once I had that, it was a matter of finding what  I needed. I used find() and find_all() to get to the elements I needed. Any element returned is itself searchable, and has different ways to access it's attributes:

    for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
               rawRating = div.find(itemprop='ratingValue')['content']

text above retrieves inner text from any element. Element attributes are accessed as keys from the element, like the 'content' one above. The rawRating value was actually pulled from a meta tag that was in the ReviewText div above: 

itemprop="ratingValue" content = "1.0"/>

find()/find_all() are very powerful, a lot more detail and power is provided in the documentation. They can search by item ID, specific attributes (the itemprop attribute above is an example), and regexes can be used to match multiple elements. 

Crawling all of that data is fun but time consuming. I stored review text and rating data in a wrapper class, mapped by rating into a reviewsByRating map:

         for div in divs:
                text = div.find('p',id=re.compile('ReviewText-.*')).text
                rawRating = div.find(itemprop='ratingValue')['content']
                
                
                r = Review(text,rawRating)
                
                if self.reviewsByRating.has_key(r.rating):
                    reviews = self.reviewsByRating[r.rating]
                else:
                    reviews = []
                    self.reviewsByRating[r.rating] = reviews
                
                reviews.append(r) 

and flushed that map to disk using pickle:

     def saveToDisk(self):
          with open(self.filename,'w') as f:
              pickle.dump(self.reviewsByRating,f)

this let me load the data from file without having to scrape it again:

    def load(self):
          with open(self.filename,'r') as f:
              self.reviewsByRating = pickle.load(f)

Next step will be to start investigating the data. 

5 comments: