Tuesday, December 30, 2008

Bulk Resource Uploads via ActiveResource

I recently had to reduce the across the wire trips for the monitoring app I had hastily thrown together because the amount of time spent making trips serializing and deserializing individual resources was beginning to affect monitoring performance. The Second Fallacy of Distributed Computing was beginning to rear it's ugly Putinesque head.

I knew that this was coming, but premature optimization has never worked out for me, so I went with the default ActiveResource approach -- everything is a resource, and a CRUD operation on a resource maps to the corresponding http 'verb' -- until smoke started pouring out of my servers.

My basic requirements:
  1. Create a web service that can store data for hundreds of individual datapoints at 5 minute intervals.
  2. Those datapoints can come and go.
  3. The implementor of the statistics gathering code really doesn't need to know the by the wire details of how their data is getting to my web service.
Implied in these requirements is the need for efficiency:
  • I shouldn't have to perform individual CRUD ops on each statistic every 5 minutes.
  • I shouldn't have to make an over the wire request for data every time I want to read that data.
From those implications I arrived at the following distilled technical requirements:
  1. I need to bulk upload statistics, and create/update them in one transaction in order to reduce the need for individual CRUD ops. At this point I'm going to choose to 'fail fast', aborting if a single create/update fails, so that I know if something is wrong.
  2. I need to keep a client side cache of those statistics around, only updating them when they've changed (important aside: because this is a monitoring application, it is assumed that each statistic belongs to a single client, so there is no need for out of band updates).
The Juicy Bits
I'd love to go into a long digression about how I explored every which way to do this, but I'll summarize by saying that my final solution had the following advantages:
  • Uses the existing ActiveResource custom method infrastructure
  • No custom routes need to be defined
  • Complexity hidden from the user, restricted to client side upload_statistics call and server side POST handler method.
  • The priesthood of High REST will not need to crucify me at the side of the road.
ActiveResource extension:

I needed to extend default ActiveResource. By default, AR is not aware of data model relationships. For example, invoking the to_xml method on an AR class only shows it's attributes, even if you specify other classes to include, like this:


This limitation makes being smart about bulk updates pretty hard. I needed to introduce the notion of a client side cache, initialized and synchronized as needed.

My data model looks roughly like this:

Monitor=>has many=>Statistics

The default AR implementation of this looks like

class Statistic <>

I've extended as follows:
  • implemented an add_statistic method to Monitor that caches Statistic objects locally
  • Added an upload_statistics method to the Monitor that serializes the client local statistics and then sends them to the server.
  • modified the default POST handler for Statistic objects to handle bulk creates/updates.
  • initially loaded the statistics cache on the client side.
  • lazy synced the cache to the server side, updating on find and delete requests.

Client and Server code by Operation

I want to point out a couple of things in this code:

(1) Cache loading is done in Monitor.initialize(). That way it gets called whether the client is retrieving or creating a Monitor.

def initialize(attributes = {}, logger = Logger.new(STDOUT))
if(@@logger == nil)
@@logger = logger

@statistics = {}

if(attributes["statistics"] != nil)
attributes["statistics"].each do | single_stat_attributes|
@@logger.debug("loading #{single_stat_attributes["name"]}")
@statistics[single_stat_attributes["name"]] = Statistic.new(single_stat_attributes)


This required the following modification on the Monitor controller (server) side:

def index

if(params[:name] == nil)
@monitor_instances = Monitor.find(:all)
@monitor_instances = Monitor.find_all_by_name(params[:name])

respond_to do |format|
format.html #index.html.erb
format.xml { render :xml => @monitors.to_xml(:include=>[:statistics]) }
format.json { render :json => @monitors.to_json(:include=>[:statistics])}

I needed to make sure that returned monitor instances included child statistics in order to load the client side cache.
(2) get_statistic and delete_statistic synchronize with the server side.
(3) I've added a new upload_statistics method. I wanted to override save, but what I found at runtime is that the ActiveResource.save method calls update, which loads statistics as attributes. This wont work for us because some of those attributes may not exist on the server side, so an 'update' operation is invalid. In upload_statistics, a custom AR method posts the client side cache of statistics to the StatisticsController on the server side:

def upload_statistics

if(@statistics.length > 0)
data = @statistics.to_xml


Note that the first parameter is the method name, the second is the param options, and the third is the actual post data (that contains the serialized client side map of the statistics. The actual path that this POST gets sent to is /monitor_instances/:id/statistics.xml

In the server, I do not have to add/create any new routes, but I do need to make sure that the default POST handler checks for the bulk parameter and handles accordingly.

# POST /statistics
# POST /statistics.xml
def create

if(params[:bulk] == nil)
# handle a single update
#handle a bulk update

Marshalling and Saving stats on the Client side.

In the StatisticsController,create handler, I need to unmarshall the xml into statistics. There are these instructions to extend ActiveRecord via the standard lib/extensions.rb mechanism, but they won't work for me because I'm serializing a hash, not an array of Statistic objects. So I need to deserialize and create/update objects by 'hand', which actually isn't that hard:

cmd = request.raw_post
monitor_instance = MonitorInstance.find(params[:monitor_instance_id])
hash = Hash.from_xml(cmd)

hash["hash"].values.each do | options |
stat = Statistic.find(:first,
:conditions=>["monitor_instance_id = #{params[:monitor_instance_id]} and name = '#{options["name"]}'"])

if(stat == nil)
#create a new Statistic object
# update existing statistic object

respond_to do |format|
statistics = Statistic.find(:all,
:conditions=>["monitor_instance_id = #{params[:monitor_instance_id]}"])
format.xml { render :xml => statistics.to_xml, :status => :created, :location => monitor_instance_path(@monitor_instance) }

In the code above, I deserialize the xml payload using Hash.from_xml, which creates a hash around the hash encoded in the xml data.

To get to the original hash of statistics options, I had to extract them from the encoded hash:

hash = Hash.from_xml(cmd)
hash["hash"].values.each do | options |
# create / update the stat that corresponds to options["name"] under the monitor

This took a lot longer than expected, because I ran into issues with trying to use standard methods, i.e. save, that I still don't understand. However, I know a lot more about AR and how to extend it to do more intelligent sub resource handling.

Reblog this post [with Zemanta]

1 comment:

  1. This comment has been removed by a blog administrator.