Tuesday, March 9, 2010

Don't Drive by Dumb Dogmatic Data

My posting has fallen off severely since taking up the new job, mostly because I have had nothing intelligent to say while I'm trying to ramp up on the technology as well as the business drivers behind the technical decisions being made.  However, something did come up today that I want to remember, and I've heard more than once that if you don't write it down, you don't remember it

I've been thinking a lot about numbers. In this brave new world, a lot of companies that are 'data driven'. Meaning they make their decisions based on the data that is around them, and if they can't measure it, they can't manage it.

That is a statement that sounds basically logical, assuming that
(1) what is being measured is clearly understood by everyone, and
(2) changes that occur to the measurement correlate well to overall system state.

In the engineering world,  people don't have a lot of patience with numbers that are not explainable. So the mythical 'sales numbers' that drive entire sales teams off of cliffs every quarter are usually sneered at by engineers, who hold themselves up as the high priests and priestesses of logic.

For engineering treams, numbers like TPS, MTTF, etc, are not only easy to conceptualize, but changes in them are good indicators of system functionality. More importantly to engineering organizations, you don't have to be a software developer to understand what a decrease in TPS or MTTF means to the business.

So engineering management is always looking for other numbers that encapsulate system health. Again, this is a perfectly reasonable goal, because good metrics serve as a useful abstraction layer around the grimy bits of the sausage factory. However, I think that the quest for engineering is one more piece of evidence that shows that how rational starting points end up being ridiculous the moment logic is abandoned in favor of dogma.
We've all laughed ourselves silly at the old stories of measuring programmer productivity by lines of code written, but what are the programmers of tomorrow going to laugh at? My first candidate would be measuring the quality of unit tests by the  unit test code coverage metric -- specifically what percentage of total lines of code are covered by unit tests.

These days unit test code coverage is easy to get. We get ours from a Cobertura plugin for Maven.
Code coverage is one of those measurements that initially sounds really good. If the test coverage decreases, that's bad, right? If it increases, well, good job to the developers!

Wait, not so fast. If line coverage is supposed to be an indicator of quality, that implies that just because a test causes code to be exercised, the test is good. But wait,  I can write lots of tests with zero assertions. I've verified that in very specific cases there are no NPEs, but that's about it.

If I take unit test line coverage to represent the quantity of unit tests written without looking at the number of assertions being made per test, I'm only seeing part of the picture. If unit test coverage number changes indicate that testing is or is not being done on new code, there could be lots of false positives and negatives. For example, when I add a bunch of code in a finally block, and the function I'm adding that code to is in a unit test, my line coverage goes up without me actually writing any more tests. Conversely, if I'm adding that finally block to a function that is not covered, my line coverage goes down. In either case, do  the corresponding line coverage increases and decreases actually mean anything about the quality of the tests written?

What are good indicators of unit test quality if line coverage is misleading? As someone who writes a lot of unit tests, I would venture that test quality has some correlation to assertion density, with some caveats. In other words, what and how much is being checked when a method is tested?  Assuming that the tested method that returns a value, there is at a minimum one thing to check. If the value is a structure, there is more.

In any case, assertion density usually means that verification is being taken seriously, and also that any changes to the code have to pass all assertions - or the assertions need to be changed to match the new code. Either case requires explicit validation of the contract put in place by the assertions in the unit test. Note that assertion density is only valid when measuring direct output -- if a test is verifying  data that is not a direct output of the method being tested, is it condoning code side effects? Assertion density needs to be normalized by the number of acceptable assertions, i.e. the number of things you can check in the return value, if there is a return value. The assertion density metric should score badly if data that is not explicitly related to the method output is being checked. But maybe that would be conflating the concerns of side effect free code and high quality unit testing.

Another metric in unit testing that correlates to good coverage is conditional branch coverage.  If I can assume that every block of code may contain one or more possibly nested conditional statements,  then I know that I've at least got decent coverage when a high percentage of conditional branches are covered. I dont think that branch coverage means a lot without assertion density checks, but it does mean a lot more than simple line coverage. Ironically, Cobertura provides branch coverage, but all of the QA managers I've worked with have gravitated towards line coverage as the more meaningful metric.

Ideally I would like to see a number based on assertion density and branch coverage. This number would behave well across a wide range of assertion and branch coverage input, sort of like the half your age plus seven dating metric.  That would make it meaningful, and a good measurement to drive test quality 'up and to the right'.

1 comment:

  1. Actually I've been thinking about this a little more, and what makes the most sense is conditional coverage + line coverage, measured together. If both go down, that's bad. If one goes down, that bears investigation. I wouldn't say it's automatically bad.

    If one goes up, that's good, not as good as both going up. Worst case your coverage has stayed the same. If both go up, that implies that coverage has gotten better.