David R. MacIver's Blog
Some empirically derived testing principles
Here are some principles I have found useful when testing Hypothesis.
I don’t promise any of these are universally good, all I promise is that
all of them have resulted in my finding bugs that I would otherwise have
missed, and that together they seem to give me a lot more confidence in
my software than I otherwise would have.
1. 100% coverage is mandatory.
This is a slight lie in that I do have the occasionally nocover
or ignore pragma sprinkled around the code, but those generally occur in
places that I’ve tried really hard to cover and it’s just not possible
to hit reliably.
People seem to think that coverage is a bad metric and doesn’t really
tell you a lot about your testing. There’s a line I have here which I
believe is stolen from someone from the sqlite project:
100% coverage tells you nothing, but less than 100% coverage tells you something.
Code that is not tested almost certainly contains bugs. Code that
is tested probably still contains bugs, but the probability is
lower. Therefore, all other things being equal, code with 100% coverage
is likely to contain fewer bugs.
This is essentially operating on the principle that metrics make great
constraints and lousy goals. The goal is not to maximize coverage, the
goal is to test our project well. 100% coverage is the starting point,
not the finish line.
(“Coverage” here means “branch coverage”. In reality it means “The
finest grained coverage metric that you can get out of your tooling and
that it’s meaningful to claim 100% coverage on”)
2. All bugs result in tests.
The core principle of this is hopefully both self-explanatory and
uncontroversial. Step 1 to fixing a bug is to write a test for it.
Otherwise the bug might reoccur.
But it goes deeper. Every bug is actually two bugs: A bug in your
software, and a bug in your test suite.
The former is obvious. A bug in your software is a bug in your software.
The latter is more subtle: It’s a bug in your test suite because your
test suite didn’t find this bug, and that indicates a failure of your
testing.
So. You’ve found a bug. Why didn’t your test suite catch it?
What general failure of testing this indicates, and can you fix that
failure in a way that would catch other instances of similar bugs?
As a concrete example: In Hypothesis’s data serialization layer, one of
the invariants is that an attempt to deserialize bad data can only raise
a BadData exception. Any other exception is a bug. This invariant is
used to guarantee that Hypothesis can always read from old databases -
the worst case scenario is that you can’t reuse that data, not that it
crashes your tests.
In preparing for the 1.1 release I found an instance where this wasn’t
the case. This then caused me to write some generic tests that tried
fuzzing the data Hypothesis reads in order to find exceptions that
weren’t BadData. I found 5 more.
3. Builds should fail fast.
There’s nothing more frustrating than getting to the end of a long build
and then having the trivial thing that you could have found out 30
seconds into the build failing everything.
Linting is a major culprit here. lint should run at the beginning of
builds, not at the end. I also have a separate entry in the build matrix
which runs only a fast subset of the tests and checks the results for
100% coverage. This means that if I’ve forgot to test some area of the
code I find out fast rather than at the end.
4. Principles should never be satisfied by accident.
Suppose your test suite catches a bug in your change with a failing
test. Is that test actually the one that should have caught it?
This particularly comes up with internal APIs I find. Testing internals
is important, and if a bug in an internal was only caught because of a
bug it caused in the public API, that internal could use a more specific
test.
This also plays well with the faster build step for coverage. By forcing
coverage to happen on simple examples it ensures that code is covered
deliberately rather than as a result of testing other things. This helps
offset the problem that you can often get to 100% coverage without ever
really testing anything.
Flaky tests are often an example of something happening by accident.
Sometimes it’s just that the test is bad, but often it’s that there’s
some hard to trigger condition that could use a dedicated test that hits
it reliably.
5. You need both very general and very specific tests.
Tests need both breadth and depth. Broad tests are things like
Hypothesis which ensure that your code is well behaved in every
scenario (or at least as close to every scenario as they can), but they
tend to have very shallow definitions of “well behaved”. Highly specific
tests can assert a lot more about what the code should do and as a
result are true in a much narrower set of environments.
These tend to catch different classes of bugs.
So far to the best of my knowledge nobody has come up with a way of
testing things that achieves both breadth and depth at the same time
other than “do a massive amount of work manually writing tests”, but you
can do both by writing both kinds of tests, and this is worth
doing.
6. Keep trying new ways of testing things.
It’s been my experience that every time I try to come up with a new way
of testing Hypothesis I find at least one new bug in it. Keeping on
coming up with new and creative ways of testing things is a great way to
keep ahead of your users in finding bugs.
7. There is no substitute for hard work.
Unfortunately, quality software is hard. I spend at least as much effort
testing Hypothesis as I do writing it, and there’s probably no way
around that. If you’re taking quality seriously, you need to be spending
as much work on quality as functionality.