Thursday, July 12, 2012

Calculating Conditional Probability with MapReduce part 4: Applying Correlation and Independence to get better results

In my last post I discussed using mapreduce to calculate item to item conditional probabilities, in other words, the probability that you will click on page B given that you have clicked on page A.

In this post I'm going to discuss how to determining whether a user click on page A is independent of a user click on page B. I went over this in the first post, but it's worth revisiting because taking the theory literally leads to confusion about why it is relevant for recommending products or content.

Independence and Correlation

Some review of the theory using common sense: if I roll two dice together, it's obvious that the probability of getting a 6 on the first dice is independent of the probability of getting a 6 on the second dice. If, on the other hand, I have 4 blue balls and 4 red balls in a drawer, and retrieved a blue ball the first time I reached for a ball, the probability of pulling out a red ball has gone up because now there are 3 blue balls and 4 red balls left. In the first scenario,

P(roll a 6 with dice A) | P(roll a 6 with dice B) = P(roll a 6 with dice A)

in the second: 

P(red ball) | P(blue ball) = # of red balls/(# of red balls + # of blue balls left). 

Given that I've reduced the number of blue balls, P(red | blue) is > P(red in the original case), I'm more likely to pick a red ball if I've picked a blue ball first.

How does that translate into whether to recommend page B given the user has viewed page A? URIs  aren't balls in a drawer. The theory of independent events vs dependent events doesn't really apply to page views...until you consider that the implications of using the independence test actually make sense when deciding whether to recommend page B given the user is viewing page A.

Given that we know P(B | A) and P(B), and we know that 
  1. for independent events P(B | A) = P(B)
  2. for negatively correlated events P(B | A) < P(B)
then any calculated  P(B | A) significantly <= P(B) means that recommending page B given the user is on page A is likely to disengage the user, because many more viewers are likely to view page B without viewing page A than they are to view page B after viewing  page A. 

Calculating Probabilities Again
From above, we need to calculate P(B | A) and P(B) to determine if B should be recommended when the user is viewing A. We have P(B | A) from the last post: the output of our last reducer was 

    uriX     uriY=P(Y | X) uriZ=P(Z | X)...


As for calculating raw probabilities, we can do this by taking the output from our group-by-user mapreduce from the second post, which looked like

   user1  pageA:num_views_A_by_user1  pageB:num_views_B_by_user1...

A map job that submitted key:value pairs from each line of that output, and a reduce job that grouped by (page uri) key and summed the counts would give a total count of page views per page.

public static class AggregateViewCountsMapper extends
   Mapper {

    @Override
    public void map(final LongWritable key, final Text value,
    final Context context) {
 // assume output of GroupViewsByUser:
 // user1\turi1:count1\turi2:count2...\turiN:countN
   
 String raw = value.toString();

 String fields[] = raw.split("\t");
   
 long totalCount = 0;
 for(int i = 1; i < fields.length; i++ ) {
   
         String rawFieldData = fields[i];
         String tokens[] = rawFieldData.split(":");
    
  if(tokens.length != 2) {
  context.getCounter(AggregateViewCounters.BAD_URI_SUBFORMAT).increment(1);
  continue;
  }
  // we care about both tokens
    
  try {
          String count = tokens[1].replaceAll("\\s","");
   Long actualCount = new Long(count);
   context.write(new Text(tokens[0]), new LongWritable(actualCount));
   totalCount += actualCount;
  } catch (IOException e) {
   context.getCounter(
    AggregateViewCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)
     .increment(1);
    e.printStackTrace();
   } catch (InterruptedException e) {
    context.getCounter(
    AggregateViewCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
       .increment(1);
    e.printStackTrace();
    }
   }
   
   // write total counts to a guaranteed unique key
   try {
    context.write(new Text(UNIQUE_KEY), 
                                      new LongWritable(totalCount));
   } catch (IOException e) {
    context.getCounter(AggregateViewCounters.
                                     MAP_CONTEXT_WRITE_IO_EXCEPTION)
         .increment(1);
    e.printStackTrace();
   } catch (InterruptedException e) {
    context.getCounter(AggregateViewCounters.
                                     MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
         .increment(1);
    e.printStackTrace();
   }
   

  }
 }
    }
}


A special key (UNIQUE_KEY above) could be used to sum up all views per user.

The reducer is a simple aggregator, the final output would look like this:

pageA  total_view_A_ct
pageB  total_view_B_ct
....
UNIQUE_KEY total_view_all_pages_ct

The number of lines of output is equal to the total number of pages in your candidate set+1.

Notes on Final Calculations

I'm leaving this one as a mental exercise:
If the data set is small (<10k pages), you could write some script to calculate probabilities of pages A..N by dividing 

total views of pageA / pageTotal count
..
total views of pageX / pageTotal count

with those counts in place, you can now build a model of views and keep each item B if 

   P(B | A) > P(B)

Note that building the model consists of getting the item:item output, and filtering items as above.

If the data set is larger, and the overall size of the set spans several blocks, then processing the results via mapreduce becomes cumbersome if you don't know the total view count of all pages. I think that this last step definitely requires storing the raw probabilities in a lookup table, then processing the conditional probabilities and filtering out the lists by doing the comparision between P(B | A) and P(B).

How to build the model and store the data above for a more optimal retrieval during model build time is definitely a completely new blog post :)