Quality Assurance Histrionics

Bosque del Apache National Wildlife Refuge

Quality Assurance Histrionics

Debating the Merits of Test Coverage Metrics

Monday, July 25, 2016

Once upon a time, in a land not distant at all, my friend and I had the privilege of maintaining a legacy code base that neither of us had any previous exposure to. As you can probably appreciate if you have any experience with legacy code bases, some of the stuff wasn’t as clean and tidy as you might want it to be. A lot of people had obviously worked on the project. You’d find it difficult to discover the original intent behind many of the design decisions you’d encounter.

There was a suite of unit tests, but at first glance it didn’t look very comprehensive. To my surprise, however, having dabbled in the source code for a few minutes, my friend exclaimed: “The code coverage is actually pretty good – almost 100%!” He had used a tool that tells you what proportion of your code gets executed by your tests…

What coverage?

Once the initial excitement faded and I regained my composure, I realized that you had to give whoever developed the tests credit for one thing: Efficiency. One test of no more than a dozen lines covered thousands and thousands of lines of production code.

The problem, as it too quickly became obvious, was that it didn’t actually perform any useful verification of anything at all. It invoked a method that retrieved from the application’s database, assembled, and returned a huge object graph representing the domain model, and it asserted that the value the method returned wasn’t null.

For starters, this obviously wasn’t a unit test. It talked to an actual database and executed all of the system’s layers. It was more of an integration test, maybe in this case actually an automated system test. That in and of itself wouldn’t, in my opinion, represent a problem – I believe that automated integration testing can be immensely valuable.

What did turn out highly problematic, though, was that you could easily nuke the entire production code base, replace it with a method that returned any object, and, as long as this return value wasn’t null (it could, for example, just be the “Hello world!” string), the test would still pass. After a few minutes, we had to conclude that while we had very high code coverage, virtually all of the tests were completely pointless. What we utterly and completely lacked was what I called meaningful coverage.

What does ‘meaningful coverage’ really mean?

Most software developers will (hopefully) agree that code coverage alone won’t do the trick. Running a line of code doesn’t verify that it does what it’s supposed to do. You need more than that. People have called it functional coverage or simply test coverage. For the purposes of this article, I call it meaningful coverage, and I’ll show another example before I attempt to define it, at least informally.

In one of my applications, I needed to JSON-serialize data transfer objects. I used Newtonsoft Json.NET, and the following code snippet helped me achieve my goal:

var stream = new MemoryStream();
var serializer = new JsonSerializer();
serializer.NullValueHandling = NullValueHandling.Ignore;
using(var writer = new StreamWriter(stream))
{
serializer.Serialize(writer, value);
}
return stream.ToArray();

Simple enough, eh? I wrote a unit test that executed this code to serialize a known value, deserialized the resulting byte array, and asserted that all property values of the deserialized instance equaled those of the original one. So far so good – you definitely can’t claim that this test is pointless. It achieves 100% code coverage, and there’s certainly no obvious way to fool it as there was in the earlier example. It appears to have good, meaningful coverage, right?

Close but no cigar. It looks perfect, until you realize that the following snippet also passes the test:

var stream = new MemoryStream();
var serializer = new JsonSerializer();
//serializer.NullValueHandling = NullValueHandling.Ignore;
using(var writer = new StreamWriter(stream))
{
serializer.Serialize(writer, value);
}
return stream.ToArray();

That’s right: I can comment out a production line, and my test remains happy. Unfortunately, the line was there for a reason. The application hashes the serialized data, and I need the ability to add new properties to existing classes without changing the hash values of existing serialized instances – that’s why I wanted null property values to be ignored. It turns out I need an additional test to verify this behavior. (I do have one in my suite, by the way.)

Rather than attempting to provide a formal definition of my meaningful coverage concept (which I probably couldn’t do, and even if I could, it would probably be obscure), I shall state two rules of thumb that can help you achieve it (or so I think):

When evaluating test coverage, you have to use your requirements as your starting point, whether high-level domain requirements or local requirements for individual classes and methods – as opposed to starting with the code under scrutiny
Whenever you can find a way to obviously introduce a bug into your production code without breaking any of your tests (the way I did by commenting out the line), you have identified a gap in meaningful coverage

The first point could be used as a convincing argument for test-first or test-driven development: A reasonable way to avoid designing your test suite around the code you’re trying to test is to write the tests before the code even exists. The second point is a bit of a “negative definition” – it describes the circumstances that prove your coverage is incomplete.

Are you saying that code coverage is a completely useless metric?

No. I like to think of code coverage as a necessary condition but not a sufficient condition of good, meaningful coverage: Insufficient code coverage implies you’re not achieving meaningful coverage. Conversely, however, the fact that your code coverage is good doesn’t mean that you’re doing an excellent job in terms of meaningful coverage.

The problem is that while attaining high code coverage is often easy, making your tests useful, meaningful usually isn’t. In the example from the beginning of this article, a single test of say a dozen lines covered a significant proportion of the code base. Actually testing a system that complex could easily demand tens of thousands of lines of test code – three orders of magnitude more.

The thing that makes code coverage attractive and popular is that it’s easily measurable – you can use a tool that gives you a number, and you shoot for 100%. Project managers and other non-developers can understand and “manage” it. So they do.

Could a similar metric be devised for “meaningful coverage”? I very much doubt it. The difficulty is that meaningful coverage should stem from real-world requirements – from intent. We’re getting to the very heart of what we as software developers do. Evaluating meaningful coverage will almost always take a qualitative, interpretivist approach, as it has the ambition to determine whether or not the test suite verifies that the software does what its users want it to do. (Note the word want – it signifies intent.) If you could automate this, you’d be very close to automating software development itself.

The thing that makes meaningful coverage unattractive and unpopular is that it’s not easily measurable – you can’t use a tool that gives you a number, and you shoot for 100%. Project managers and other non-developers can’t understand and “manage” it. So they don’t :-)

Quality assurance theater

You may have heard of security theater. It refers to situations where people make things look like security is being assured when in reality very little of what actually makes a difference takes place. This is how Wikipedia defines the concept, citing a book by Bruce Schneier that I haven’t read:

Security theater is the practice of investing in countermeasures intended to provide the feeling of improved security while doing little or nothing to actually achieve it.

The article mentions the “airport security repercussions due to the September 11 attacks” as a possible example of security theater.

I feel that we as software professionals are sometimes guilty of a fallacy that has a number of parallels with security theater – I call it quality assurance theater. Just take the Wikipedia definition, and substitute quality for security:

Quality assurance theater is the practice of investing in countermeasures intended to provide the feeling of improved quality while doing little or nothing to actually achieve it.

Okay, maybe ‘countermeasures’ isn’t the most suitable term here – but I think you get the idea. The example from the beginning of this article, in my opinion, epitomizes this: The developers managed to create an impression of very good test coverage without actually verifying a goddamn thing :-)

Who or what is to blame?

I’m not saying all of this happens all the time, but I have encountered it repeatedly. I believe the easy measurability of code coverage coupled with demand for testing (on the part of management, clients, as well as developers themselves) represent the immediate causes.

You want to test. Who doesn’t?

Perhaps your project manager believes that “if you can’t measure it, you can’t manage it”. It’s a quote from W. Edwards Deming, a management guru, so it must be correct, right? Alas, the full quote actually reads (emphasis mine) “it is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth”. (No offense to your project manager – maybe she’s one of the enlightened ones – I did say ‘perhaps’, okay?)

Anyway, so you want to test, and you want to do so in a measurable way. The problem is that you end up inadvertently creating a perverse incentive.

A serious limitation of testing consists in the fact that there’s no easy way to measure its actual quality. All the “hard” metrics are to be taken with a lot of grains of salt.

Upshot

I’ve had a lot of success with quality assurance – whether developer testing, QA specialists on software teams, or even dedicated QA departments. I consider test-first and test-driven development one of the most useful tools in a programmer’s toolbox.

Having said that, I find the hype and peer pressure surrounding it counterproductive. TDD has been such a big thing that, in some circles, admitting to not practicing it all the time, let alone harboring doubts about its universal applicability, is tantamount to self-ostracization.

It would be easy to conclude that you should always bite the bullet and do things “the right way”. I, however, agree with David Heinemeier Hansson that such “fundamentalism is like abstinence-only sex ed: An unrealistic, ineffective morality campaign for self-loathing and shaming.”

All software projects are not created equal. Not all of us always work on life-critical systems where mistakes and bugs are not an option.

I believe it should be legit to say that your budgetary, time, or other constraints don’t allow you to develop a comprehensive suite of automated tests and that, for a given (non-life-or-mission-critical) project, you will have to resort to manual or informal testing only. Hopefully, it would result in less pretentiousness and less quality assurance theater.

JON'S MUSINGS