David R. MacIver's Blog
Hypothesis short term road map
As some of you might be aware, I authored a python testing library called hypothesis.
It’s basically quickcheck for Python. I first wrote it within about a month of my learning python, so it has some... eccentricities (I still maintain that the fact that it has its own object model with multiple dispatch is totally legit, but I admit that the prototype based inheritance feature in it was probably misguided), but I flatter myself into thinking it’s not only pretty good but is by some small margin possibly the most advanced library of its ilk.
It’s also pretty neglected. I haven’t actually released a version of it in just over a year. Part of this is because I was pretty excited about testmachine and was considering it as the successor to hypothesis, but then I didn’t do any work on testmachine either. Basically, for a variety of reasons 2014 was a pretty rough year and I didn’t get much done on my open source work during it.
Despite this neglect, people seem to have started using hypothesis. I’ve got a bunch of issues opened, a few pull requests, and pypi stats are telling me it’s had more than 500 downloads in the last month.
And so, I appear to have failed to avoid success at all costs, and will instead embrace success and resume development on hypothesis.
So here’s the plan.
In the next few days there will be a point release of hypothesis. It will not be large - it will clear up a few bugs, maybe have one or two minor enhancements, improve python 3 support, and generally demonstrate that things are moving again.
In the next month or two there are a variety of larger features I want to work on. Some of them are pretty damn exciting, and if I get all of them working then I’m going to remove the weasel words from the above and simply say that flat out hypothesis will be the most advanced library of its kind.
In rough order of least to most exciting, I’m going to be working
on:
Logging
This is super unsexy and to be honest I will probably not be inspired
enough to work on it immediately, but I very much want it to get done as
it’s both important and something people have actually asked for:
Hypothesis needs to report information on its progress and what it’s
tried. It’s important to know what sort of things hypothesis has tried
on your code - e.g. if it’s only managed to generate a very small number
of examples.
Some sort of top level driver
Hypothesis is right now fundamentally a library. If you want to actually
use it to write tests you need to use it within pytest or similar.
Continued usage like this will 100% continue to be supported,
encouraged and generally considered a first class citizen, but I would
like there to be some sort of top level hypothesis program as well so
you can get a more tailored view of what’s going on and have a better
mechanism for controlling things like timeouts, etc.
Improved data generation
The data generation code is currently both ropy and not very
intelligent. It has two discrete concepts - flags and size - which
interact to control how things are generated. I want to introduce a more
general and unifying notion of a parameter which gives you much more
fine grained control over the shape of the distribution. This should
also improve the coverage a lot, make this more easily user
configurable, and it may even improve performance because currently data
generation can be a bit of a bottleneck in some cases.
Merging the ideas from TestMachine
Testmachine is pretty awesome as a concept and I don’t want it to die.
It turns out I have this popularish testing library that shares a lot in
common with it. Lets merge the two.
One thing that I discovered with TestMachine is that making this sort
of thing work well with mutable data is actually pretty hard, so this
will probably necessitate some improved support around that. I suspect
this will involve a bunch of fixes and improvements to the stateful
testing feature.
Remembering failing test cases
One of the problems with randomized testing is that tests are inherently
flaky - sometimes a passing test will become a failing test, sometimes
vice versa without any changes to the underlying code.
A passing test becoming a failing test is generally fine in the sense that it means that the library has just found an exciting new bug for you to fix and you should be grateful.
A failing test becoming a passing test on the other hand is super annoying because it makes it much harder to reproduce and fix.
One way to do this is to have your library generate test cases that can be copied and pasted into your non-randomized test suite. This is the approach I took in testmachine and it’s a pretty good one.
Another approach that I’d like to explore instead is the idea of a test database which remembers failing test cases. Whenever a small example produces a failure, that example should be saved and tried first next time. Over time you build up a great database of examples to test your code with.
This also opens the possibility of giving hypothesis two possible run
modes: One in which it just runs for an extended amount of time looking
for bugs and the other in which it runs super quickly and basically only
runs on previously discovered examples. I would be very interested in
such an approach.
Support for glass-box testing
Randomized property based testing is intrinsically black box. It knows
nothing about the code it’s testing except for how to feed it
examples.
But what if it didn’t have to be?
American Fuzzy Lop is a
fuzz tester, mostly designed for testing things that handle binary file
formats (it works for things that are text formats too but the verbosity
tends to work against it). It executes roughly the following
algorithm:
- Take a seed example.
- Mutate it a bit
- Run the example through the program in question.
- If this produces bad behaviour, output it as a test case
- If this produces a new interesting state transition in the program, where a state transition is a pair of positions in the code with one immediately following the other in this execution, add it to the list of seed examples.
\- Run ad infinitum, outputting bad examples as you go
This produces a remarkably good set of examples, and it’s 100% something
that hypothesis could be doing. We can detect state transitions using coverage and generate
new data until we stop getting new interesting examples. Then we can
mutate existing data until we stop getting new interesting examples from
that.
This sort of looking inside the box will allow one to have much greater confidence in hypothesis’s ability to find interesting failures. Currently it just executes examples until it has executed a fixed number or runs out of time, which is fine but may mean that it stops too early.
The existing behaviour will continue to remain as an option - initially switched on by default, but once this is robust enough eventually this will become the default option. The old behaviour will remain useful for cases where you want to e.g. test C code and thus can’t get reliable coverage information.
This would also work well with the test database idea, because you
could prepopulate the test database with minimal interesting
examples.
Probably some other stuff too
e.g. in the above I keep getting the nagging feeling that hypothesis
needs a more general notion of settings to support some of these, so I
will likely be doing something around that. There’s also some code clean
up that really could use doing.
It’s currently unclear to me how long all of this is going to take and whether I will get all of it done. Chances are also pretty high that some of these will turn out to be bad ideas.
If any of you are using or are considering using hypothesis, do let me know if any of these seem particularly exciting and you’d like me to work on them. I’m also open to suggestions of other features you’d like to see included.
Comments
On the new data generation in hypothesis | David R. MacIver on 2015-01-08 11:41:33:
[…] mentioned in the Hypothesis short term road map that one of the things I wanted to do was improve the data […]
The pain of randomized testing | David R. MacIver on 2015-07-17 09:03:58:
[…] you can combine two of the features I have planned for Hypothesis together to be a much better solution. You can run Hypothesis in what I think of as “fuzzer […]