Thursday, April 15, 2010

Synchronization Requirements

Recently I was lucky enough to pull the short straw and refactor the dreaded FooManager (name changed to protect the innocent). FooManager was a horrendously complex, overwrought class that suffered from at least two God Complexes, and my team had been playing hot potato with it for the last couple of months, since we learned we had inherited it.

My mission was to make FooManager understandable/maintainable without requiring heavy ingestion of psychotropic substances or becoming suicidal. After doing the usual de-Godification and Cleverectomy procedures, i.e. choosing decent abstractions, removing static methods, etc, I was left with a fundamental synchronization dilemma.  It was far from the most annoying part of FooManager, but it was definitely the most interesting to refactor.

FooManager received updates for each specific Foo subclass from different caches. It registered for those updates using a callback interface. The cache notified FooManager whenever it changed. This seemed like an innocent enough pub/sub pattern. But it made managing the datastructures in FooManager -- the ones that held pre-computed indexes of Foos by name, Foos by id, Foos by other id, etc -- tricky.

All access to these data structures were done within read/write locks. I think the motivation behind using a read/write lock instead of a more generic synchronization mechanism was to optimize reads at the expense of writes. This made sense, given that writes were infrequent and driven by periodic cache changes, and reads did not need immediate access to new data. But it meant that any time you wanted to access the structures, you did so within the confines of a read/write lock. Even when all you wanted to do was read the data, you needed to acquire the read lock on it.

The event-callback and the read/write lock had a Design-Pattern-as-a-Hammer smell to it, and as I thought about it some more, I realized why.

(1) The data structures under read/write locks did not have to be immediately consistent with the cache, they had to be eventually consistent with the cache. That meant that a cache change did not require immediate and total synchronization.
(2) The read/write lock was overkill, but necessary because of the asynchronous nature of the cache refill.
(3) The loading performance of the cache was really minimal, since the cache data structures were in memory by the time the notification occurred.
(4) What we really wanted to do was only block when a thread was updating the data structures. We didn't want other threads re-updating the data structures. But they shouldn't be blocked from accessing the current data structures while the update was going on.

The eventual consistency and low reload overhead of the system made it possible to do away with the event callback interface and poll the cache for an update by checking a cache version number via a synchronized method. Heres how it worked:
  1. Every request would check the cache version. 
  2. The first thread that got an updated version number would enter a lock and update a set of temporary data structures. 
  3. All other threads would block on the lock, enter once the original writer method had exited, make the version check, get the same version that they had entered with, and not attempt a reload.

synchronized protected int checkExpired(int currentVersion) throws Exception{

  int newVersion = specialFooCache.getCacheVersion();
  if(newVersion != currentVersion) {
   // update temp data structures.
   // assign temps to member data structures
  return newVersion;


Prior to every request, I checked the cache version and reloaded if necessary:

 // cacheVersion is a member variable of FooManager
 cacheVersion = checkExpired(cacheVersion);
 // now access structure w/o locks.

This was, well, OK. It was definitely more simple, but now all reader threads were blocked while waiting for the first one to update the data structures.

Thinking back to the original conditions, I remembered that as long as the data structures were eventually updated, all non writing threads would be just as happy using the original structures in an unblocked manner. In other words, we only needed to make sure that one thread reloaded the original structures.

This required synchronization at a more granular level: I removed synchronization on the checkExpired() method, and inside of it, blocked the rewrite code using a conditional statement followed by a lock, and allowed all threads that had gotten past the conditional and stuck on the lock to exit once (a) the lock was released and (b) they realized that they didn't need to do the cache reload.

protected void checkExpired() {

 int newVersion = specialFooCache.getCacheVersion();

 // we only want one thread to refresh the cache, and we
 // want the other ones to keep using the old data structures.

 if(newVersion != cacheVersion && inProgress == false) {
  synchronized(lock) {
   // first thread is in b/c cacheVersion has not been updated yet.    

   // All other threads    evaluate to false.
   int checkVersion = specialFooCache.getCacheVersion();
   if(checkVersion != cacheVersion) {

    inProgress = true;
    // reload from cache
    // set data structures so that subsequent threads get
    // the new data
    // set the version.
    cacheVersion = newVersion;
    // we are done here.
    inProgress = false;
   } // end version recheck
  } // end lock.
 } // end version and inProgress check

Note that in order to prevent thread local caching and allow all threads to get immediate access to the data structures after they were reassigned, I declared the data structures that I was updating as volatile.

The simplification of the cache update process made it a whole lot easier to understand,  I think that the conditions -- specifically the eventual consistency -- allowed us some latitude in when reader threads actually got the data. There isn't a specific by the numbers approach to solving synchronization issues, but it always helps to understand what you are synchronizing, and more importantly,  what you don't have to synchronize.

1 comment: