Wednesday, November 17, 2010

Pull Parsing with STAX

Foreward
Due to some events beyond my control, I lost the source code for parsing my Garmin TCX data into CSV format. I had not gotten around to Git-ifying my source, and also had not backed it up via JungleDisk/Mozy, so the original Ruby code that used the LibXML SAX parser is gone.

That might not necessarily be a bad thing. In SAX parsing, events happen without any surrounding context. It is up to the programmer to supply the context, and doing that in a legible, maintainable way with the SAX event driven model is a challenge. When I had discovered issues with the parsing code, fixing those issues required a lot of time to determine the actual state at the time the bug occurred. SAX parsing code has a significant maintenance penalty.

DOM parsing is much more straightforward because you navigate the structure of the XML, and implicitly get the associated context. Unfortunately, it is prohibitively expensive because it requires that the entire document get loaded into memory.

STAX parsing is a reasonable compromise between the two extremes. It streams the file (i.e. only loading into memory what it needs), while allowing the developer to navigate the structure of the XML. In other words, you have context without memory overhead.

STAX can be used to read or write XML files -- in my application, that converted .TCX files to a CSV format, I was focused on reading XML, not writing it. For reading files, STAX provides two different APIs. The cursor based API uses XMLStreamReader. The iterator based API uses XMLEventReader. The key difference between the two is that the iterator based API treats events as first class objects and allows the user to peek ahead at the next element to be fetched. This supplies the user with more context than the iterator based API. That context comes with additional resource consumption, but still much less than loading an entire DOM into memory.

Onward
In my first STAX program, I wanted to see how far I could get with the cursor based API. Specifically, given it's better performance characteristics, would it provide enough context for me to write easily maintainable code?

A Moment...
A quick side note on why (again) I'm choosing to convert an XML stream to CSV. XML is great because DTDs and Schemas provide a way to validate document integrity when there are large numbers of optional elements. In the case of TCX, there are few optional elements -- the elements that exist always contain the same kinds of data. With an unchanging format, CSV makes more sense because it is a more compact representation of a stable data set.

Implementation Details
In my re-implementation of TCX->CSV parsing code, I needed to transform a set of nested parameters into two different CSV formats. In order to explain what I needed to do, I need to go into detail about what the Garmin TCX XML looks like, and what I wanted to extract from it.

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd">

  <Activities>
    <Activity Sport="Biking">
      <Id>2010-11-04T22:02:59Z</Id>
      <Lap StartTime="2010-11-04T22:02:59Z">
        <TotalTimeSeconds>1798.6000000</TotalTimeSeconds>
        <DistanceMeters>15412.0742188</DistanceMeters>
        <MaximumSpeed>12.8037510</MaximumSpeed>
        <Calories>581</Calories>
        <AverageHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
          <Value>140</Value>
        </AverageHeartRateBpm>
        <MaximumHeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
          <Value>158</Value>
        </MaximumHeartRateBpm>
        <Intensity>Active</Intensity>
        <Cadence>0</Cadence>
        <TriggerMethod>Manual</TriggerMethod>
        <Track>
          <Trackpoint>
            <Time>2010-11-04T22:03:00Z</Time>
            <Position>
              <LatitudeDegrees>47.5834731</LatitudeDegrees>
              <LongitudeDegrees>-122.2491668</LongitudeDegrees>
            </Position>
            <AltitudeMeters>17.4411621</AltitudeMeters>
            <DistanceMeters>26.8590393</DistanceMeters>
            <HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
              <Value>85</Value>
            </HeartRateBpm>
            <SensorState>Absent</SensorState>
          </Trackpoint>
          ...
        </Track>
     </Lap>
     ...
   </Activity>
     ...
 <TrainingCenterDatabase>


I want to extract two kinds of data out of the XML stream above:
  1. Lap summary data. Lap summary data is good for high level comparisons of effort. The basic components of Lap summary data of the same general duration can be compared across laps.
  2. Trackpoint data. Trackpoint data -- elevation, heart rate, lat/long can be analyzed/transformed across arbitrary intervals to measure input effort and output speed.
Lap Summary Data will look like this:
activity_id,lap_id,total_time,total_distance,max_speed,total_calories,average_heartrate,max_heartrate
Trackpoint detail data will look like this:
lap_id,trackpoint_id,time, latitude,longitude,altitude,distance,heartrate

Both files may end up being used when correlating track points to their parent laps and activities.

Initializing the SAX Parser
I've created a class TCXPullParser to parse Garmin TCX data. In the constructor I initialize the STAX parser:

XMLStreamReader parser;
         .....
         /**
  * ctor, with all file names to write to and read from.
  * 
  * @param lapSummaryWriter
  * @param trackDetailWriter
  * @param fileToParse
  * @throws IOException
  * @throws XMLStreamException
  */
 public TCXPullParser(CSVWriter lapSummaryWriter, CSVWriter trackDetailWriter,
     String fileToParse) throws IOException, XMLStreamException {
  
  FileInputStream fis = new FileInputStream(fileToParse);
  XMLInputFactory factory = XMLInputFactory.newInstance();
  parser = factory.createXMLStreamReader(fis);
                ...
 }

Initialization is simple, the parser is created from the XMLInputFactory, and takes the FileInputStream created on the input file name. From this point my primary access to the file is through the parser object. I use the parser object to advance the cursor (by calling next()), inspect the type of element (as a return value from parser.next()), and grab text (parser.getText()). These three methods, with some additional functionality I've added, give me enough context to actually top-down parse the XML. 

Pull Parsing
One advantage of using pull parsing is that the code is in charge of when events are fired. This lets us do things like skip processing/move directly to an element that we are interested in by using the parser.next() method and checking the returned element type:

/**
 * skips parser to the start element of the specified element name, while
 * stopElementName has not been encountered.
 * 
 * @param parser
 * @param elementName
 * @param stopElementName
 *          if we get this far, we've gone too far.
 * @return true if element is found
 * @throws Exception
 */
protected boolean skipTo(XMLStreamReader parser, String elementName,
    String stopElementName) throws Exception {
 boolean found = false;
 int parseType = parser.getEventType();
 while (parser.hasNext()) {
     parseType = parser.next();
     if (parseType == XMLStreamReader.CHARACTERS) {
             continue;
     }
     String elName = parser.getLocalName();
        if (parseType == XMLStreamReader.START_ELEMENT) {
  if (elName.equals(elementName)) {
         found = true;
      break;
  } else if (elName.equals(stopElementName)) {
      // in the case where we are looking across parallel elements
      // or into a container element,
      // stop when we find the stop element
     found = false;
    break;
  }
         } else if (parseType == XMLStreamReader.END_ELEMENT
         && elName.equals(stopElementName)) {
  // in the case where we are looking within a container element, stop
  // when we reach the end of that container element
  found = false;
  break;
     }
 }
 return found;
}

I typically use skipTo() to move to the next instance of an element, before it's containing element end tag is reached. For example, when I'm parsing the contents of a Trackpoint tag:

<Trackpoint>
   <Time>2010-11-04T22:03:00Z</Time>
   <Position>
   <LatitudeDegrees>47.5834731</LatitudeDegrees>
   <LongitudeDegrees>-122.2491668</LongitudeDegrees>
   </Position>
   <AltitudeMeters>17.4411621</AltitudeMeters>
   <DistanceMeters>26.8590393</DistanceMeters>
   <HeartRateBpm xsi:type="HeartRateInBeatsPerMinute_t">
   <Value>85</Value>
   </HeartRateBpm>
   <SensorState>Absent</SensorState>
</Trackpoint>


This is the code to parse that data:

/**
 * parses a single trackpoint and writes an output line to the trackDetail CSV
 * writer.
 * 
 * @param parser
 * @throws Exception
 */
protected void parseTrackPoint(XMLStreamReader parser, String lapId,
    String trackPointId) throws Exception {
 trackDetailWriter.writeArg(lapId);
 trackDetailWriter.writeArg(trackPointId);
 skipTo(parser, TIME, TRACKPOINT);
 trackDetailWriter.writeArg(getTimeValue(parser, parser.next()));
 skipTo(parser, LAT, TRACKPOINT);
 trackDetailWriter.writeArg(getValue(parser, parser.next()));
 skipTo(parser, LONG, TRACKPOINT);
 trackDetailWriter.writeArg(getValue(parser, parser.next()));
 skipTo(parser, ALT, TRACKPOINT);
 trackDetailWriter.writeArg(getValue(parser, parser.next()));
 skipTo(parser, DIST, TRACKPOINT);
 trackDetailWriter.writeArg(getValue(parser, parser.next()));
 skipTo(parser, HEARTRATE, TRACKPOINT);
 trackDetailWriter.writeArg(getValue(parser, parser.next()));
 trackDetailWriter.flushArgs();
}

With skipTo in place, I needed to extract the data from the XML. The text inside of elements is CHARACTER data, and is accessed by calling parser.next() after hitting the enclosing tag START_ELEMENT. The CHARACTER data is accessed via XMLStreamParser.getText():

/**
 * extracts a double value from a character stream.
 * 
 * @param parser
 * @param parseType
 * @return the double, or -1 if the element is not CHARACTERS. will also
 *         thrown runtime exception if parsing fails.
 */
private double getValue(XMLStreamReader parser, int parseType) {
 if (parseType == XMLStreamConstants.CHARACTERS) {
  return Double.parseDouble(parser.getText());
 } else {
  return -1;
 }
}


The combination of skipTo and getValue() allows me to extract Double values from the XML. I'm using Double to validate the format of the value, even though I'm going to persist that value back as a string. When extracting timestamps, I extract the data to a long:
private long getTimeValue(XMLStreamReader parser, int parseType)
    throws ParseException {
 if (parseType == XMLStreamConstants.CHARACTERS) {
  String raw = parser.getText();
  String date = raw.substring(0, raw.indexOf('T'));
  String time = raw.substring(raw.indexOf('T') + 1, raw.indexOf('Z'));
  SimpleDateFormat sdf = new SimpleDateFormat("MM-dd-yyyy-HH:mm:ss");
  Date actual = sdf.parse(date + '-' + time);
  return actual.getTime();
 } else {
  return -1;
 }
}

Separating Reading XML from Writing CSV
In the code above there are calls to a trackDetailWriter object. I chose to separate the writing of the CSV from the reading of the XML in order to test the XML reading logic more easily. This simplified things a lot, it allowed me to pass in CSVWriter objects, it relieved the parsing code of having to manage/open/close destination files, and it allowed me to write test implementations of CSVWriter that stored the data in memory for me to check during unit tests.

public interface CSVWriter {

 /**
  * write a single arg
  * @param arg
  * @throws Exception
  */
 public void writeArg(Object arg) throws Exception;
 
 /**
  * flush all pending args as a single CSV line
  * @throws Exception 
  */
 public void flushArgs() throws Exception;
 
}

The default implementation (used at runtime) looks like this:

/**
 * 
 * @author Arun Jacob
 *
 * push comma separated values to a file. 
 */
public class DefaultCSVWriterImpl implements CSVWriter {

 private FileWriter writer;
 private StringBuffer buffer;
 
 public DefaultCSVWriterImpl(String fileName) throws IOException {
  writer = new FileWriter(fileName);
  buffer = new StringBuffer();
 }
  
 /**
  * close the file: REQUIRED for all file writers.
  * @throws Exception
  */
 public void close() throws Exception {
  writer.flush();
  writer.close();
 }

 @Override
 public void writeArg(Object arg) throws Exception {
  buffer.append(arg);
  buffer.append(",");
 }

 @Override
 public void flushArgs() throws Exception {
  writeToFile(buffer);
 }

 /**
  * flush the contents of the buffer to file
  * @param buffer
  * @throws IOException
  */
 private void writeToFile(StringBuffer buffer) throws IOException {
  if(buffer.charAt(buffer.length()-1) == ',') {
   // remove the last comma before writing. 
   writer.write(buffer.toString().substring(0,buffer.length()-1));
  } else {
   writer.write(buffer.toString());
  }
  
  resetBuffer(buffer);
 }

 /**
  * clear out the StringBuffer
  * @param buffer
  */
 private void resetBuffer(StringBuffer buffer) {
  if(buffer.length() > 0) {
   buffer.delete(0,buffer.length());
  }
 }

}

The test implementation (used to verify that values are being pulled from the XML correctly) looks like this:

public class TestTrackPointCSVWriter implements CSVWriter {
 
 static final String TRACKPOINTID = "TrackPointId";

 static final String LAPID = "LapId";

 static final String ACTIVITYID = "activityId";

 List<Object> args;
 Map<String,Object> argsMap;
 public TestTrackPointCSVWriter() {
  args = new ArrayList<Object>();
  argsMap = new HashMap<String,Object>();
 }
 

 @Override
 public void writeArg(Object arg) throws Exception {
  args.add(arg);
  
 }

 @Override
 public void flushArgs() throws Exception {
  argsMap.put(LAPID, args.get(0));
  argsMap.put(TRACKPOINTID, args.get(1));
  argsMap.put(TCXPullParser.TIME, args.get(2));
  argsMap.put(TCXPullParser.LAT,args.get(3));
  argsMap.put(TCXPullParser.LONG, args.get(4));
  argsMap.put(TCXPullParser.ALT, args.get(5));
  argsMap.put(TCXPullParser.DIST, args.get(6));
  argsMap.put(TCXPullParser.HEARTRATE, args.get(7));
  args.clear();
  
 }

 
 /**
  * validation method
  * @param key
  * @return
  */
 public Object get(String key) {
  return argsMap.get(key);
 }

}

Conclusion
I'm not sure what I was thinking when I wrote the original SAX parser for TCX data, other than I just like to write in Ruby. The additional context that I get by being able to pull tags instead of getting them pushed at me makes the code much easier to follow and therefore maintain. 

1 comment: