Tuesday, September 7, 2010

Data Mining My GPS/HRM Data: Step 1, Formatting the Data

I've been wanting to analyze the data from my Garmin 305 for a while now. I've been a casual runner/biker/hardcore data geek for a while now, and last year I started doing triathlons, which means even more data to analyze. While I've always been curious, I just haven't had a great 'need' to analyze it until now...

I'm switching from a primarily heart rate based training program to a 'pace based' training program. The former had me training within specific heart rate zones, the latter has me running at specific pace ranges. I'm a 'measurer', and I'm curious to see how effective (or not!) the pace based training program is.  Fortunately I have one device that tracks both heart rate and speed, and I can analyze the effect that the pace based training has relative to the effect that the heart rate based training has on my overall fitness.

In this case I'm measuring fitness as a combination of heart rate vs terrain covered, i.e. hilly vs flat, vs pace. In order to measure my fitness up to now and going forward, I need to answer the following questions:
  1. how much time did I spend in 'recovery' mode, where my heart rate was < 70% max
  2. how much time did I spend in 'pain cave' mode, where my heart rate was > 85-90% max?
  3. how much faster (or slower) did I get at the same heart rate over the last year? 
  4. for the new pace based program, how much time am  I spending at the different paces, i.e. recovery pace, base pace, marathon pace, 1/2 marathon pace, 10k pace, 5k pace, 1 mile pace? 
  5. what is my average heart rate for those paces? 
  6. how much faster (slower) am I getting? 
I'm hoping to answer these questions using several approaches and several technologies that I've been using at work, and others that I've been itching to try. 

The first thing I needed to do prior to doing any analysis was to format the data into a format that I could easily operate on. The data is exported from the device into a format called tcx, which is a schema-validated XML. 

I need the data in csv format, mainly because the tools I want to process the data with are all hadoop based, and while I've read that it is not only possible, but easy, to process XML with hadoop,  hadoop works best with csv formats. XML is a nice format for nested data, and this is nested data, with the following structure
  • activities
    • activity
      • laps
        • lap -- contains summary averages from trackpoint data (see below)
          • trackpoints
            • trackpoint -- contains snapshot heart rate, altitude, distance, etc.
XML is especially good when there are optional attributes. CSV tends to suck with optional attributes, because nothing is optional. In this case none of the attributes are optional, so ultimately XML is overkill for storing this data.

I collapsed this structure into two csv lists: summary data and detail data, because I planned to act on summary and detail data separately. 

The summary data contains lap summary data:
  • activity id, lap id, total time, total distance, max speed, max heart rate, average heart rate, calories, number of trackpoints.
The detail data contains trackpoint data:
  • lap id,time, latitude, longitude, altitude, distance, heart rate
To do the conversion, I used the ruby libxml Sax parser. In order to use the libxml sax parser, I needed to create a callback handler that implemented the methods I wanted to override.

class PostCallbacks
  include XML::SaxParser::Callbacks

  def on_start_element_ns(element, attributes, prefix, uri, namespaces)

  def on_characters(chars)

  def on_end_element_ns (name, prefix, uri)

In the callback handler, I maintained state to track nested XML objects. Typically I would assign state in the on_start_element_ns() method, act on that state in the on_characters() method, and release state in the on_end_elemebt_ns() method. I would also flush my results to disk occasionally to avoid taking up an unreasonable amount of memory.

I had about 40 Meg of data from the previous 3 years, which was parsed into csv files in approximately 44 seconds. I'm more than happy with that performance right now, because this is essentially a one-off job to get the data.

Next Up: setting up a data processing pipeline using Pig.

No comments:

Post a Comment