David R. MacIver's Blog: Some empirically derived testing principles

Some empirically derived testing principles

22 April 2015

Here are some principles I have found useful when testing Hypothesis. I don’t promise any of these are universally good, all I promise is that all of them have resulted in my finding bugs that I would otherwise have missed, and that together they seem to give me a lot more confidence in my software than I otherwise would have.

1. 100% coverage is mandatory.

This is a slight lie in that I do have the occasionally nocover or ignore pragma sprinkled around the code, but those generally occur in places that I’ve tried really hard to cover and it’s just not possible to hit reliably.
People seem to think that coverage is a bad metric and doesn’t really tell you a lot about your testing. There’s a line I have here which I believe is stolen from someone from the sqlite project:

100% coverage tells you nothing, but less than 100% coverage tells you something.

Code that is not tested almost certainly contains bugs. Code that is tested probably still contains bugs, but the probability is lower. Therefore, all other things being equal, code with 100% coverage is likely to contain fewer bugs.
This is essentially operating on the principle that metrics make great constraints and lousy goals. The goal is not to maximize coverage, the goal is to test our project well. 100% coverage is the starting point, not the finish line.
(“Coverage” here means “branch coverage”. In reality it means “The finest grained coverage metric that you can get out of your tooling and that it’s meaningful to claim 100% coverage on”)

2. All bugs result in tests.

The core principle of this is hopefully both self-explanatory and uncontroversial. Step 1 to fixing a bug is to write a test for it. Otherwise the bug might reoccur.
But it goes deeper. Every bug is actually two bugs: A bug in your software, and a bug in your test suite.
The former is obvious. A bug in your software is a bug in your software. The latter is more subtle: It’s a bug in your test suite because your test suite didn’t find this bug, and that indicates a failure of your testing.
So. You’ve found a bug. Why didn’t your test suite catch it? What general failure of testing this indicates, and can you fix that failure in a way that would catch other instances of similar bugs?
As a concrete example: In Hypothesis’s data serialization layer, one of the invariants is that an attempt to deserialize bad data can only raise a BadData exception. Any other exception is a bug. This invariant is used to guarantee that Hypothesis can always read from old databases - the worst case scenario is that you can’t reuse that data, not that it crashes your tests.
In preparing for the 1.1 release I found an instance where this wasn’t the case. This then caused me to write some generic tests that tried fuzzing the data Hypothesis reads in order to find exceptions that weren’t BadData. I found 5 more.

3. Builds should fail fast.

There’s nothing more frustrating than getting to the end of a long build and then having the trivial thing that you could have found out 30 seconds into the build failing everything.
Linting is a major culprit here. lint should run at the beginning of builds, not at the end. I also have a separate entry in the build matrix which runs only a fast subset of the tests and checks the results for 100% coverage. This means that if I’ve forgot to test some area of the code I find out fast rather than at the end.

4. Principles should never be satisfied by accident.

Suppose your test suite catches a bug in your change with a failing test. Is that test actually the one that should have caught it? This particularly comes up with internal APIs I find. Testing internals is important, and if a bug in an internal was only caught because of a bug it caused in the public API, that internal could use a more specific test.
This also plays well with the faster build step for coverage. By forcing coverage to happen on simple examples it ensures that code is covered deliberately rather than as a result of testing other things. This helps offset the problem that you can often get to 100% coverage without ever really testing anything.
Flaky tests are often an example of something happening by accident. Sometimes it’s just that the test is bad, but often it’s that there’s some hard to trigger condition that could use a dedicated test that hits it reliably.

5. You need both very general and very specific tests.

Tests need both breadth and depth. Broad tests are things like Hypothesis which ensure that your code is well behaved in every scenario (or at least as close to every scenario as they can), but they tend to have very shallow definitions of “well behaved”. Highly specific tests can assert a lot more about what the code should do and as a result are true in a much narrower set of environments.
These tend to catch different classes of bugs.
So far to the best of my knowledge nobody has come up with a way of testing things that achieves both breadth and depth at the same time other than “do a massive amount of work manually writing tests”, but you can do both by writing both kinds of tests, and this is worth doing.

6. Keep trying new ways of testing things.

It’s been my experience that every time I try to come up with a new way of testing Hypothesis I find at least one new bug in it. Keeping on coming up with new and creative ways of testing things is a great way to keep ahead of your users in finding bugs.

7. There is no substitute for hard work.

Unfortunately, quality software is hard. I spend at least as much effort testing Hypothesis as I do writing it, and there’s probably no way around that. If you’re taking quality seriously, you need to be spending as much work on quality as functionality.