In my last post I discussed using mapreduce to calculate item to item conditional probabilities, in other words, the probability that you will click on page B given that you have clicked on page A.
In this post I'm going to discuss how to determining whether a user click on page A is independent of a user click on page B. I went over this in the first post, but it's worth revisiting because taking the theory literally leads to confusion about why it is relevant for recommending products or content.
P(roll a 6 with dice A) | P(roll a 6 with dice B) = P(roll a 6 with dice A)
in the second:
P(red ball) | P(blue ball) = # of red balls/(# of red balls + # of blue balls left).
Given that I've reduced the number of blue balls, P(red | blue) is > P(red in the original case), I'm more likely to pick a red ball if I've picked a blue ball first.
How does that translate into whether to recommend page B given the user has viewed page A? URIs aren't balls in a drawer. The theory of independent events vs dependent events doesn't really apply to page views...until you consider that the implications of using the independence test actually make sense when deciding whether to recommend page B given the user is viewing page A.
Given that we know P(B | A) and P(B), and we know that
In this post I'm going to discuss how to determining whether a user click on page A is independent of a user click on page B. I went over this in the first post, but it's worth revisiting because taking the theory literally leads to confusion about why it is relevant for recommending products or content.
Independence and Correlation
Some review of the theory using common sense: if I roll two dice together, it's obvious that the probability of getting a 6 on the first dice is independent of the probability of getting a 6 on the second dice. If, on the other hand, I have 4 blue balls and 4 red balls in a drawer, and retrieved a blue ball the first time I reached for a ball, the probability of pulling out a red ball has gone up because now there are 3 blue balls and 4 red balls left. In the first scenario,P(roll a 6 with dice A) | P(roll a 6 with dice B) = P(roll a 6 with dice A)
in the second:
P(red ball) | P(blue ball) = # of red balls/(# of red balls + # of blue balls left).
Given that I've reduced the number of blue balls, P(red | blue) is > P(red in the original case), I'm more likely to pick a red ball if I've picked a blue ball first.
How does that translate into whether to recommend page B given the user has viewed page A? URIs aren't balls in a drawer. The theory of independent events vs dependent events doesn't really apply to page views...until you consider that the implications of using the independence test actually make sense when deciding whether to recommend page B given the user is viewing page A.
Given that we know P(B | A) and P(B), and we know that
- for independent events P(B | A) = P(B)
- for negatively correlated events P(B | A) < P(B)
then any calculated P(B | A) significantly <= P(B) means that recommending page B given the user is on page A is likely to disengage the user, because many more viewers are likely to view page B without viewing page A than they are to view page B after viewing page A.
Calculating Probabilities Again
Calculating Probabilities Again
From above, we need to calculate P(B | A) and P(B) to determine if B should be recommended when the user is viewing A. We have P(B | A) from the last post: the output of our last reducer was
uriX uriY=P(Y | X) uriZ=P(Z | X)...
As for calculating raw probabilities, we can do this by taking the output from our group-by-user mapreduce from the second post, which looked like
user1 pageA:num_views_A_by_user1 pageB:num_views_B_by_user1...
A map job that submitted key:value pairs from each line of that output, and a reduce job that grouped by (page uri) key and summed the counts would give a total count of page views per page.
A special key (UNIQUE_KEY above) could be used to sum up all views per user.
The reducer is a simple aggregator, the final output would look like this:
public static class AggregateViewCountsMapper extends
Mapper {
@Override
public void map(final LongWritable key, final Text value,
final Context context) {
// assume output of GroupViewsByUser:
// user1\turi1:count1\turi2:count2...\turiN:countN
String raw = value.toString();
String fields[] = raw.split("\t");
long totalCount = 0;
for(int i = 1; i < fields.length; i++ ) {
String rawFieldData = fields[i];
String tokens[] = rawFieldData.split(":");
if(tokens.length != 2) {
context.getCounter(AggregateViewCounters.BAD_URI_SUBFORMAT).increment(1);
continue;
}
// we care about both tokens
try {
String count = tokens[1].replaceAll("\\s","");
Long actualCount = new Long(count);
context.write(new Text(tokens[0]), new LongWritable(actualCount));
totalCount += actualCount;
} catch (IOException e) {
context.getCounter(
AggregateViewCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)
.increment(1);
e.printStackTrace();
} catch (InterruptedException e) {
context.getCounter(
AggregateViewCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
.increment(1);
e.printStackTrace();
}
}
// write total counts to a guaranteed unique key
try {
context.write(new Text(UNIQUE_KEY),
new LongWritable(totalCount));
} catch (IOException e) {
context.getCounter(AggregateViewCounters.
MAP_CONTEXT_WRITE_IO_EXCEPTION)
.increment(1);
e.printStackTrace();
} catch (InterruptedException e) {
context.getCounter(AggregateViewCounters.
MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
.increment(1);
e.printStackTrace();
}
}
}
}
}
A special key (UNIQUE_KEY above) could be used to sum up all views per user.
The reducer is a simple aggregator, the final output would look like this:
pageA total_view_A_ct
pageB total_view_B_ct
....
UNIQUE_KEY total_view_all_pages_ct
The number of lines of output is equal to the total number of pages in your candidate set+1.
If the data set is small (<10k pages), you could write some script to calculate probabilities of pages A..N by dividing
Notes on Final Calculations
I'm leaving this one as a mental exercise:If the data set is small (<10k pages), you could write some script to calculate probabilities of pages A..N by dividing
total views of pageA / pageTotal count
..
total views of pageX / pageTotal count
with those counts in place, you can now build a model of views and keep each item B if
P(B | A) > P(B)
Note that building the model consists of getting the item:item output, and filtering items as above.
If the data set is larger, and the overall size of the set spans several blocks, then processing the results via mapreduce becomes cumbersome if you don't know the total view count of all pages. I think that this last step definitely requires storing the raw probabilities in a lookup table, then processing the conditional probabilities and filtering out the lists by doing the comparision between P(B | A) and P(B).
How to build the model and store the data above for a more optimal retrieval during model build time is definitely a completely new blog post :)
If the data set is larger, and the overall size of the set spans several blocks, then processing the results via mapreduce becomes cumbersome if you don't know the total view count of all pages. I think that this last step definitely requires storing the raw probabilities in a lookup table, then processing the conditional probabilities and filtering out the lists by doing the comparision between P(B | A) and P(B).
How to build the model and store the data above for a more optimal retrieval during model build time is definitely a completely new blog post :)