Sunday, June 10, 2012

Calculating Conditional Probability via Mapreduce part 3: item to item mappings

In my last post I gave an example of code that grouped views by user ID. In this post I'm going to show how to take the output of that mapreduce and generate item to item mappings that show the conditional probability of URIs being clicked on, given that a specific URI was clicked on.

The output from the views-by-user-ID reducer had the key, followed by a colon, followed by tab delimited URIs and counts (colon separated).

user1\turiX:3\turiY:4\turiZ:12....
user2\turiW:5\turiQ:5
...

We want to generate output that shows the following: given that the user has clicked on X, how many times did they click on Y vs Z, or P? In this output, we are not concerned with specific users, but user behavior in general. We want to sum up general conditional probabilities.

In the mapper, our input is as above: a more concrete example is :

user1\thttp://foo.com/page1:12\thttp://foo.com/page2:23\thttp://foo.com/page3:3

In the mapper we generate permutations of the different key pairs (and ignore the counts for now). In the example above we would generate:

key=>value
http://foo.com/page1=>http://foo.com/page2
http://foo.com/page1=>http://foo.com/page3
http://foo.com/page2=>http://foo.com/page1
http://foo.com/page2=>http://foo.com/page3
http://foo.com/page3=>http://foo.com/page1
http://foo.com/page3=>http://foo.com/page2

Here is the mapper code to generate the permutations
public static class ItemToItemGroupingMapper extends
   Mapper {

  @Override
  public void map(final LongWritable key, final Text value,
	final Context context) {
        // this job takes the input from GroupViewsByUser
	// user1\turiX:3\turiY:4\turiZ:12
	String raw = value.toString();
			
	String fields[] = raw.split("\t");

	if(fields.length == 1) {
		context.getCounter(ItemToItemCounters.BAD_MAP_INPUT_FORMAT)
		.increment(1);
		return;
	}
	// fields[0] is the user ID, we skip that for item-item co-occurrence.
	for(int i = 1; i < fields.length;i++) {
		// permute all combinations of co-occurrences
				
		for(int j = 1;j < fields.length;j++) {
			if(j == i) {
				continue; // dont count self co-occurrences
			}
			else {
		 	    String partsI[] = fields[i].split(GroupViewsByUser.INTERNAL_DELIMITER);
			    String partsJ[] = fields[j].split(GroupViewsByUser.INTERNAL_DELIMITER);
						
			    if(partsI.length != 2 || partsJ.length != 2) {
				context.getCounter(ItemToItemCounters.BAD_MAP_INPUT_FORMAT)
				.increment(1);
				return;
			}
		        try {
			    LOGGER.info(partsI[0]+"=>"+partsJ[0]);
			    context.write(new Text(partsI[0]),new Text(partsJ[0]));
			} catch (IOException e) {
			    context.getCounter(ItemToItemCounters.MAP_CONTEXT_WRITE_IO_EXCEPTION)
				.increment(1);
			    e.printStackTrace();
			} catch (InterruptedException e) {
			    context.getCounter(ItemToItemCounters.MAP_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
				.increment(1);
			    e.printStackTrace();
			}
		}
					
					
	}
  }
}

In the reducer, these pairs would show up and we can aggregate the counts of each URI. Again, we're not factoring in a normalized weighting at map time. In other words, from the example above, we're not factoring in the 12 hits of page1 vs the 23 hits of page2 when doing co-occurrences, because, as noted above, we're ignoring counts and focusing on co-occurrences

We could do some calculations on these pairs at reduce time -- for a specfic URI X, we would sum total co-occurrences per unique URI Y and divide by total co-occurrences of all URIs to get a number that represented the probability of the total co-occurrences that X and Y made up. We could then reverse sort the URIs by co-occurrence probability to produce output like this for uriX:

uriX\turiY=P(Y | X)\turiZ=P(Z | X)


a real version would look like:

/foo/bar.html   /foo/baz.html=0.5       /foo/car.html=0.5

As expected, in the example above, the probabilities all sum to 1.0.

The reducer code to aggregate counts is below. Its more complex than the previous GroupViewsByUser code because of the probability calculation and reverse sorting.


public void reduce(Text key, Iterable input, Context context) {

        Map URItoCount = new HashMap();
	Iterator it = input.iterator();
	long totalCount = 0;
	// these are co-occurrences: sum them.
	while (it.hasNext()) {
		Text value = new Text(it.next());

		Integer count = URItoCount.get(value);

		if (count == null) {
			URItoCount.put(value, Integer.valueOf(1));
		} else {
			URItoCount.put(value, Integer.valueOf(count + 1));
		}

		totalCount++;
	}

	// emit tab delimited and from greatest to least
	// uri2=prob(uri2 | uri1) uri3=prob(uri3 | uri1)

	Map> byProb = new HashMap>();
	List sortedSet = new ArrayList();
			
	for (Map.Entry entry : URItoCount.entrySet()) {
		double probability = ((double)entry.getValue())/totalCount;
		List list = byProb.get(probability);
				
		if(list == null) {
			list = new ArrayList();
			byProb.put(probability,list);
			sortedSet.add(probability);
		}
		list.add(entry.getKey());
	}

	// sort from greatest to least.
	Collections.sort(sortedSet);  
	Collections.reverse(sortedSet);
			
	// emit uri:value where value = count / total 
	StringBuffer sbuf = new StringBuffer();
			
	for(double probability : sortedSet){
		List uris = byProb.get(probability);
				
		for(Text uri : uris) {				      
                    sbuf.append(uri.toString()).append(INTERNAL_DELIMITER).append(probability).append(FIELD_DELIMITER);
                }
	}
				
	// convert list to hadoop Text
	String coOccurrenceBuffer = sbuf.toString();
	LOGGER.debug("reduce: key = " + key.toString() + ", value = "
			+ coOccurrenceBuffer);

	Text allViews = new Text(coOccurrenceBuffer);

	try {
		context.write(key, allViews);
	} catch (IOException e) {
		LOGGER.error(e);
		context.getCounter(
				GroupByUserCounters.REDUCE_CONTEXT_WRITE_IO_EXCEPTION)
				.increment(1);
	} catch (InterruptedException e) {
		LOGGER.error(e);
		context.getCounter(
				GroupByUserCounters.REDUCE_CONTEXT_WRITE_INTERRUPTED_EXCEPTION)
				.increment(1);

	}
}


Are we done yet? Well....no. While it's nice to see conditional probabilities, they dont tell us anything about whether any two URIs are correlated. That's next. Sample code can be found at https://github.com/arunxarun/cpsamples.