David R. MacIver's Blog: Generating provably solvable instances for testing solvers

Generating provably solvable instances for testing solvers

Do you wanna test a SAT solver? (It doesn't have to be a SAT solver)

17 June 2026

There’s a technique in property-based testing that almost nobody seems to know about but me. There are a variety of reasons for this, but most of them are that nobody really seems to innovate in what sorts of properties to test, and this one only works painlessly in Hypothesis (and Hegel). It’s useful any time you have something that looks like a “solver” or “optimiser” - something that takes a problem description, and outputs some solution to that problem. For example, scheduling a meditation retreat.

Testing a SAT solver never misses solutions

To take the central example of such problems, suppose that you want to test a SAT solver, a completely normal and reasonable thing that normal and reasonable people do.¹

What are some reasonable property-based tests you might want to write for this?

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import MySolver


@st.composite
def sat_problems(draw):
    n_variables = draw(st.integers(1, 100))
    literals = st.integers(1, n_variables) | st.integers(-n_variables, -1)
    clauses = st.lists(literals, min_size=1, unique_by=abs)
    return draw(st.lists(clauses, unique_by=frozenset, min_size=1))


@settings(verbosity=Verbosity.verbose)
@given(sat_problems())
def test_every_model_solves_the_problem(problem):
    solver = MySolver(problem)
    assume(solver.solve())
    model = solver.get_model()
    assert model is not None
    model = set(model)
    for clause in problem:
        assert not set(clause).isdisjoint(model)

That is, we construct our problem in DIMACS CNF - every clause is represented as a list of integers, with -n meaning that n is assigned false, and n meaning n is assigned true - and pass it to the solver. If the solver says that the SAT problem is solvable, then we get a model out of it. This is a list of each variable, positive if the variable is set true, negative if it is set false.² Our test asserts that every clause is satisfied - that is, one of the literals in it also appears in the model.

This is a perfectly reasonable test, but it should raise some concerns.

Firstly, we’ve got that assume in there which is an assumption based on the results of calling the solver. That’s not intrinsically bad, but whenever we’ve got an assume like that we should worry that there’s a decent chance that Hypothesis is going to yell at us about filtering too much.³

But the real worry I’d like to point out here is this: We’ve tested a weaker property than we’d ideally like to. This property essentially says that when the SAT solver says a problem is solvable, it’s right.⁴ But we’d like to check the other direction too: When the solver says a problem isn’t solvable, it’s right. How might we do that?

One easy answer to this is differential testing. Take some other SAT solver, and whenever our solver claims a problem is unsolvable, check if the other SAT solver agrees. If it doesn’t, there’s a bug in one of the solvers.

If we didn’t have another SAT solver though,⁵ how might we do that?

Well, we just write a more specific generator, one that only generates solvable SAT problems. This sounds hard, until you realise the trick, which results on the following very deep and important insight: Every solvable problem has a solution.

This is useful because you can generate that solution first, and work backwards from there.

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import MySolver

@st.composite
def solvable_sat_problems(draw):
    n_variables = draw(st.integers(1, 100))
    solution = [
        draw(st.sampled_from((i, -i))) for i in range(1, n_variables + 1)
    ]
    literals = st.integers(1, n_variables) | st.integers(-n_variables, -1)

    @st.composite
    def clauses(draw):
        base = draw(
            st.lists(st.sampled_from(solution), min_size=1, unique_by=abs)
        )
        used = set(map(abs, base))
        unused_literals = literals.filter(lambda n: abs(n) not in base)
        extra = draw(st.lists(unused_literals, unique_by=abs))
        clause = base + extra
        return draw(st.permutations(clause))

    return draw(st.lists(clauses(), unique_by=frozenset, min_size=1))


@settings(verbosity=Verbosity.verbose)
@given(solvable_sat_problems())
def test_every_model_solves_the_problem(problem):
    solver = MySolver(problem)
    assert solver.solve()
    model = solver.get_model()
    assert model is not None
    model = set(model)
    for clause in problem:
        assert not set(clause).isdisjoint(model)

The strategy here is a bit more complicated, so let me walk you through it: First, at the start, we generate a solution to the SAT problem. This is just an assignment of the variables.

Then when we generate a clause, we do it in two parts: First we generate the part of the clause that our chosen solution satisfies, then we generate some extra list of literals whose variables are not used in that solution, and then just for good measure we mix up the clauses.⁶ This means that, by construction, every clause is satisfied by our chosen solution, and as a result every problem generated by this is necessarily solvable.

We then change our existing test to just replace the assume with an assert, because now all our problems are solvable by construction.

This test is complete, in the sense that every solvable SAT problem can in theory be generated by this,⁷ because every SAT problem has a satisfying assignment, and if we start by generating some such assignment then we can always generate the target SAT problem. This means that if a SAT solver passes the logically-infinite version of this test then we know that it solves every solvable SAT problem.

It doesn’t however handle the other side of this, that is that whenever a problem is unsolvable the SAT solver says it’s unsolvable. I don’t actually have a very good answer for how to write those tests other than differential testing.⁸ There are a few interesting properties, e.g. if you add clauses to an unsolvable SAT problem it should stay unsolvable,⁹ but nothing very satisfying.

Still, I think the solvable case is often the more interesting side in practice.

Extending this to optimisers

Another problem domain in this class is optimisers, which have to give you not just a solution but the best solution. This technique has a natural extension there: We generate a solution, then we generate a problem to which that is a solution, and then we run the optimiser and see how well it does. If it does worse than our starting problem, that’s a bug.

For example, consider the knapsack packing problem: I have an integer capacity, and a list of integer weights, and I want to fit the maximum amount of stuff into my knapsack.

from hypothesis import assume, given
from hypothesis import strategies as st
from my_solver import KnapsackSolver

@st.composite
def knapsack_problems_with_score(draw):
    solution = draw(st.lists(st.integers(min_value=0), min_size=1))
    extra_capacity = draw(st.integers(min_value=0))
    capacity = sum(solution) + extra_capacity
    rest = draw(st.lists(st.integers(min_value=extra_capacity + 1), min_size=1)))
    weights = draw(st.permutations(solution + rest))
    return weights, capacity, sum(solution)

@given(knapsack_problems_with_score())
def test_always_beats_known_achievable_score(problem):
    weights, capacity, to_beat = problem
    solver = KnapsackSolver(weights, capacity)
    solution = solver.get_solution()
    assert sum(solution) >= to_beat

Unlike the SAT solver, the knapsack solver always succeeds - it can just return the empty list of weights - but the question is how well it does. We don’t know the optimal solution here, but we do know a solution, and the optimal solution has to be at least as good as that.¹⁰

Why this is Hypothesis-model-specific

The big guarantee that the Hypothesis property-based testing model gives you is that you can write generators and everything will more or less just work. They will shrink reasonable well, and you will never be given a shrunk test case that couldn’t have come from the generator.

This sort of test-case generation where you first generate something and then you do dependent generation on it is the classic place where other approaches to deriving shrinkers just straight up doesn’t work, and if you try to do type-based shrinking on the result then you will lose your guarantee that the problem is solvable. For example if you generate [[-1], [2]] and then try to shrink it to [[-1], [1]], now you’ve got an insoluble SAT problem. This will never happen in Hypothesis, but will absolutely happen in QuickCheck.

I think there’s something further than this though, which is that people have got very used to the idea that you are trying to write property-based tests whose generators range over the set of all valid inputs to your program. This sort of technique where you construct more specialised generators that are designed to explore only a subset of the valid inputs (and as a result can guarantee stronger properties hold there) seems underexplored, and I suspect there is a lot of easy low-hanging fruit for new properties lurking in it as a result.

As a reminder, a SAT solver is a tool that takes some boolean formula and tries to find an assignment of its variables that causes it to return true. Essentially all SAT solvers operate in CNF, “conjunctive normal form”, which means that they are expressed as a big and over a set of ors. e.g. a & (b | !c)
↩︎
This is the standard PySAT API, but is also just a very common way of representing this.
↩︎
It doesn’t in this case. This strategy almost always generates solvable SAT problems. That stops being the case if we set the max number of variables smaller, but it’s still always basically fine.
↩︎
Technically it says that when it says the problem is solvable, the model it claims solves it is right, which is a slightly stronger property.
↩︎
Mostly because this is an instance of a general class of problem rather than because we might realistically not have a SAT solver
↩︎
We could equally well sort by variable or something, I just wanted to make sure we avoided putting all the base clauses first.
↩︎
In practice, the distribution of problems could definitely use some work.
↩︎
Which can be against a guaranteed-correct brute force version if you restrict the number of variables to be small.
↩︎
This is a form of metamorphic testing.
↩︎
You can also regard this as a straightforward solver for the decision problem of “Does this problem admit a solution with at least this score?”
↩︎