David R. MacIver's Blog
How I've been using Claude Code
I wrote a comment on lobste.rs about how we’re using Claude Code on Hegel. Various people have asked me to turn it into a blog post. This isn’t exactly that, as I’d like to talk about how I’ve been using Claude Code more broadly, but it should have the content people were asking for.
Claude use in Hegel
To start with, a disclaimer: None of this is official Antithesis policy. We’ve been running a bit ahead of the pack on how we’ve been using Claude in Hegel, because it’s open source code (and thus we’re somewhat less concerned about it being leaked because it being leaked is, well, what it’s for) and greenfield so small enough for the team working on it to keep on top of the code. Everyone else in the company is being much more cautious in their adoption of agentic coding than we are with Hegel.
Anyway, we’ve been using Claude Code a lot. I’d say that something in the region of 90% of code I’ve “written” for Hegel has been written by a Claude, at least in its first draft. Sometimes the second draft is me going in and tearing all the bullshit Claude has done apart and fixing it properly. More often though it’s me telling Claude what to fix, or me going in and doing a targeted rewrite of some particularly egregiously wrong bit.
Although some people seem to have decided that the words mean “anything made with an LLM”, this is not vibecoding and little to none of it is slop. We’ve reviewed the code ourselves, and heavily dictated its design. Hegel was extremely not “Claude, make me a property-based testing library”. I designed the protocol, Liam and I designed the API together, but we got Claude to do a lot of the actual line-by-line writing of the code.
This has gone great. Hegel works really well, and we have been able to develop it much faster than we ever would by hand.
But if we’d just gone down the “Claude, make me a property-based testing library”, it would have been a disaster. Claudes can probably just about port minithesis, and we’re unironically hoping that a future model will be good enough to port enough of Hypothesis for us to rewrite hegel-core in Rust, but most of the early work on Hegel where we gave a Claude a bit of a free reign was a complete mess. e.g. all the original protocol implementation is hand-written by me, because I gave a Claude a detailed spec of the protocol and it decided that it was too complicated and it would take a more “pragmatic” solution and shoved every message down the control stream.
Instead, we have a standard pull-request workflow, where a human reviews all the code. Actually, two humans, because first the person making the pull request reviews all the code that a Claude wrote on their behalf. We still catch places where Claude comically fucked something up and we failed to catch it. Often this is my fault: I’m very used to code reviewing in a high trust environment, where I can trust that the person who wrote the code is basically competent and well-intentioned and was actually trying to succeed. This means that I’m looking for a different set of things than you need when reviewing AI, or even junior human, code - high level misunderstandings and problems with the design rather than “took a shortcut that completely undermines the entire point of the feature”.
Also, prior to the code being written, we’ve decided more or less what we want the code to do. We don’t necessarily know how to achieve it, but we know what we want the API to be, and we know roughly how we want it to achieve that.
A decent recent example is the better output printing pull request I made. It massively improves the quality of hegel-rust’s test output. I probably couldn’t have written this code. I’m OK at Rust, but I’m shit at Rust macros. Claude, in contrast, is pretty good at Rust macros. So I figured out enough to know what I needed to do (define the API and the way the macro rewrites it), told Claude to do it, and then went through all the ways it could go wrong and made sure there were tests for them, spotted a few more edge cases that neither I nor the Claude had thought of, and am genuinely pretty happy with the code (and, more importantly, delighted with the results of running it) despite, if I’m honest, not fully understanding all the macro code.
Speaking of, testing. One of the things that ensures Claude doesn’t go completely off the rails is making sure that the code is actually tested, and we review those tests if anything more thoroughly than we review the code.
In order to ensure there’s enough testing, we set minimum coverage to 100%. I basically think there’s no good reason to have untested code in a project with AI working on it.
Unfortunately, Claude disagrees. It’s become a bit of a running joke that I’m the guy who is constantly yelling at Claude to write tests and yes I really mean 100%. On hegel-rust, coverage is currently moderately short of 100% because I discovered that in the early days before we were as careful about reviewing as we are now, Claude had decided that it wasn’t pragmatic to enforce 100% coverage, and lowered the number. Normally it lowers the number to 98% or something. In this case it lowered it to uh… 30%. We’ve not fully fixed it yet and have introduced a ratchet script that forces the number of uncovered lines down to zero.
Originally rather than the ratchet, I tried to get a Claude to just fix the testing, but there was so much slop in the tests it wrote that I eventually gave up on getting those mergeable, and decided the ratchet was the better option. We’ll gradually fix the coverage over time, because the number can mostly only go down, and as we work on a particularly area of code we’ll refactor it towards testability at the same time.
BTW one thing you will notice on the ratchet is that it’s got explicit instructions that it is not allowed to increase the ratchet or edit the script unless a human explicitly says so. That works… most of the time. Once you’ve fenced Claude in enough that it actually has to get to 100% coverage, it will still often just decide that testing something is too hard and just try to exclude it instead. I’ve not found a better solution than human review yet, but I’m still working on it.
In general, writing quality code using Claude is this constant battle and balancing act. It enables you to do so much more - every single project I’ve used Claude on has far better setup and infrastructure than almost anything I worked on pre-Claude - but requires a level of constant vigilance to actually get the outcomes you want, and that vigilance will often slip. Sometimes you can automate it, and Claude is actually very good at helping you automate it, but it still requires a great deal of human attention.
I don’t feel like I’m doing less work as a result of using Claude on Hegel - if anything I’m working harder than I otherwise would have - but I’m definitely getting a lot done for that work.
Other projects
I’ve had a variety of other projects I’ve done with Claude, of varying degrees of quality. Most of them I’ve abandoned, because the results were not good enough, or because I lost interest after playing around with it for a while. There have been a few success stories though.
One is this website! I suppose technically it’s too early to call that a success story, but what determines whether I abandon it is really whether I want to keep writing it. The actual software is, at worst, fine, and I definitely wouldn’t have completed the migration or got the new design working without Claude support. I hate doing web development - it’s not my forte, and I don’t want to invest the time and effort required to be good at it. Also, the actual conversion process involved lots of fiddly little details that I could absolutely have done myself but was goign to keep procrastinating on indefinitely if I didn’t.
Shrinkray has had some big refactors and a new UI using Claude, and that’s proven pretty great. The UI has improved greatly as a result of doing this.
I’ve also done some of the inevitable “I got a Claude to write me tools for using more Claude” work that comes with getting into AI agents. My current best attempt is pr-manager which is a tool for helping me keep on top of my many open pull requests. I don’t recommend it to anyone else, but it’s worked great for me.
There are also some partial success stories:
- I wrote a slay the spire mod. It lets you try out different decks in different fights. I like it conceptually, I found it sortof useful, but it was definitely still buggy when I stopped working on it and now Slay the Spire 2 is out and I’m not very interested in resuming work on it. I think there’s a decent chance that if Megacrit don’t add a similar feature (which I’ve no reason to believe they will) I’ll port this mod to StS2 when there’s a good modding story, but I’m not currently motivated to do this.
- A dynamic random sampler. This is a datatype from a paper that I’ve wanted a solid implementation of for ages. It’s a way of sampling from a discrete random distribution where you can efficiently update the weights. I think the implementation Claude has produced is really solid as far as I can tell - I keep coming up with new ways to test it and it keeps passing them. This required a degree of yelling at Claude and making it fix problems, but I haven’t actually done more than skim the implementation code. This is only a partial success because I haven’t actually had a use case for it since writing it, but I’ll report back when I do.
- I ported redo to rust. I don’t really have a use case for this if I’m honest, I was just curious if it would work, and to the best of my ability to tell it worked great. I’ve always liked the idea of redo, but didn’t really want to use a codebase that hadn’t been maintained in 7 years and only ran on Python 2. Turns out, I still don’t want to use redo even after that, but I’ve given it significant consideration. I’m somewhat tempted to ask Claude to rebuild the build system for this website on top of redo-rs.
I’ve had a bunch of other side-experiments that ended up not going anywhere. e.g. I wrote a tool for training perfect pitch which worked pretty well but mostly lost interest in the project. I was working on a game of exploring a solar system with realistic gravity, but ended up a bit too nerd-sniped by trying to get the details working before I lost interest. These sorts of things are normal for creating software projects, but Claude has definitely accelerated the process.
A fortuitously timed case study in bugfinding
While I was writing this, cfbolz reported a bug in shrinkray without a reproducer, so here is a log of how I used Claude Code to fix it.
I pointed a Claude at the issue and asked it to diagnose it and write me some tests reproducing the problem. Its diagnosis was mostly wrong - it pointed it enough in the right direction to find the problem, but it very clearly didn’t understand what was actually going on - and its tests were mostly bad, but they did succesfully reproduce the problem, which would have been tedious for me to do by hand.
I think it was also useful to explain to it why it was wrong and made it much more obvious to me what the bug was, like a sort of advanced mode reverse rubber duck.
In the end I kept one of the two tests it wrote, after significant rewriting, and didn’t even try to get it to suggest a fix and just wrote one myself once I understood the problem.
I still count this as a win, because I really didn’t want to figure out how to reproduce this bug, and would absolutely have put off working on this without that reproducer.
I then discovered a bunch of human errors:
- The branch protection rule set was not set up correctly to prevent merging when status checks fail. I discovered this because I clicked merge with failing status checks without noticing.
- As a result of this, the build was apparently already failing on main and I hadn’t noticed.
- Also, I’d failed to push because of some history rewriting and didn’t notice, and as a result merged an earlier version of the PR.
Altogether, not my finest hour, but that’s what I get for distractedly working on an issue when I’m meant to be writing a blog post.
I then pointed Claude (using pr-manager) at the pull request and told
it to make it green. Turns out, one of the problems we were seeing was
the result of a part of my fix that I’d tried early on, discarded as not
relevant, and left in (I’d changed a call from is_reduction
to is_interesting because I thought the non-determinism of
the changing test case might be the problem after Claude claimed it was.
It wasn’t). It fixed that, and then discovered that the coverage job was
also failing.
It then did what Claudes always do with problems: Declared it a pre-existing failure unrelated to its changes. In its defence, this was true for this particular instance, and was the right thing for it to investigate. Also the problems turned out to be my fault.
One of the previous failures on main is that a previous Claude had written a bunch of imports inside functions. Claudes fuckin’ love doing this, and it’s baffling to me how much they do it despite being repeatedly told not to, so there’s a lint that catches that. Apparently, and I’m not entirely sure why, there were still some in main, so I’d fixed those. Except apparently somehow in fixing these (and I really can’t blame Claude for this, I must have done some sort of over-eager deletion, but I’m a little baffled as to how) I’d deleted some of the tests in that file which were previously covering those lines.
At this point I reverted the file to its previous state and did what I should have done before: Asked a Claude to write an autofix script using libcst and ran ‘just format’.
At this point, I had a green branch, and I merged it. Eventually, the build completed on main, and… the autorelease failed, because I’d made the branch ruleset stricter. I needed to change the the autorelease script script to use a GitHub app so that I could add branch protection rules. Claude walked me through the manual bit, made the changes needed to the release, issued a new pull request, and once that was green, I merged it.
The release failed again, because I’d gotten “app secret” and “private key” confused and configured the wrong one. I figured this out by pointing Claude at the error and it told me what I’d done wrong. I fixed the problem, reran the job, and finally shrinkray released.
Anyway, this story probably doesn’t convince you that I’m a very serious software engineer whose opinions on software quality you should take seriously. That’s OK. I’m mostly not trying to convince you of anything in this post, I’m just telling you what I do.
I would say in my defence that this was unusually bad for me, and that I don’t normally ship production fixes interspersed with writing a blog post and cleaning the house. But it’s certainly representative of me on a bad day. I don’t work on software correctness tools and processes because I’m natively good at making things that work, I work on them to corral the disorganised little chaos monkey that I am when left to my own devices.
I would also say that it was about 4 hours from bug report to bug fix, while I was doing multiple other things, and the project was left in a better state on multiple axes than it was at the start of this process.
And I do think in this whole scenario, Claude was clearly a net positive. It provided me with a good reproducer, acted to some degree as an external source of state about what was going on, and helped me ratchet my processes to be higher quality with the new formatter. I could 100% have done this without Claude, but I probably wouldn’t have done it today without Claude, and I definitely wouldn’t have had the patience or inclination to do the GitHub workflows wrangling and would probably have just turned off branch protections without Claude to do that bit for me.
Some general reflections on using Claude
It’s definitely changed my working habits. Sometimes for good, sometimes for bad.
One of the things I’ve noticed is that it makes my productivity much more robust to environments that disrupt my attention. This is good because I’ve started working full-time in an office again. I need to spend some amount of focused time in short bursts when trying to figure something or review some code, but because I’m orchestrating Claude work in conversation, I can externalise a lot more of the thinking state and this is very helpful.
Unfortunately this also means that I often feel the need to work while doing other things because it makes me so much better at multi-tasking. You saw this in the above scenario: All of that was done interspersed with writing this piece and also cleaning the house. It very much wasn’t getting my focused time, and in the beforetimes it just wouldn’t have got my time at all.
Once recently I even caught myself waking up and doing some work from bed, which is somewhat unprecedented for me.
I think there are a couple factors leading into this, and some of it is just getting over-excited with a new toy and is already normalising. The other is that it does feel a bit like a continuous partial attention game and as a result keeps drawing my attention back to it.
Some of it is also just that I’ve got a bit obsessed with my work recently. It happens. It’s mostly a good thing. It will settle down in a few months.
In general, I’m not super worried about this at the moment and expect I’ll figure out healthy habits and ways of working with it as time goes on, but I’m not there yet.
Agent etiquette
A lot of people have had very negative experiences with AI coding. Mostly I think they’ve had very negative experiences with other people’s AI coding. I’m not surprised, but it hasn’t been my experience. Part of why it’s not been my experience is that I have, for the most part, either been using this stuff reasonably carefully (and working with people I trust to do the same) or on solo projects, and another part is that I treat it more like a tool than an author. Certainly it’s written a lot of code for me, and sometimes when it’s written a lot of code for me all at once it’s even gone well, but ultimately I’m responsible for the code it produces.
I think a lot of the negative experiences people have are from coworkers that are not taking that responsibility. I do not have a high opinion of people doing this, and I think they should be held to that responsibility whether they take it or not.
A few months ago, after my latest round of calling Claude a lazy little shit for how it was behaving with regards to coverage, someone said I was being very harsh and asked if I would say these things about it if it were a person. I answered that I would be much harsher with a person that behaved like Claude did, and instead of calling it a lazy little shit, I would using words like “fundamentally untrustworthy” and “when are we going to fire them?”.
And I basically think that if you’re regularly submitting agent-authored code without vetting it yourself, you should probably be held to this standard too.
So, I think the first part of using agentic coding on a team is this: You’ve got to read the code yourself before anyone else does. You should review it at least as thoroughly as if a junior coworker had written it before you hand it off to anyone else. If you’re not doing that, you’re causing problems.
This doesn’t mean that you need to fully understand the code well enough to have written it yourself. There are bits it’s more OK to gloss over. e.g. I’m bad at reading build code. This has absolutely caused problems, and probably will continue to cause problems, but I think it’s still ended up a lot better than the alternative.
In general I think the big thing with code written by agents is that you need to decide your slop tolerance level. One-off scripts, tools for just yourself, things that will never run in production, these can be a lot sloppier than anything you definitely 100% need to work, and you should focus your time and energy more on the latter.
One consequence of this is that yes, you will spend a lot more time reviewing code. First “your own” that the agent produced, then again your coworkers’ code. I don’t think you can really skip this with the current generation of agents. Yes, you can get agents to review other agents’ code. I think it’s even a good idea to do so. I also think it doesn’t decrease the amount of work you need to spend on code review much, it just increases the quality of the end result, because it’s definitely not good enough that you can afford to skip a human check.
I think you should use it anyway
A lot of people are really put off by agentic coding because of their bad experiences with it being used badly, and I do agree that it is currently very easy to use badly and somewhat hard to use well, but I think it’s worth it and will only get more pervasive and, hopefully, better from here.
One of the things I keep finding is that agentic coding, while not (yet?) the miracle software factory that its proponents want it to be, really is transformative in a bunch of key ways. Principally:
- A lot of things that you always knew would be good to do for your project but were just a bit too much of a pain in the ass are things that an agent can more or less one shot.
- It offers specific workflows that would have just been somewhat magical if you’d looked at them a few years ago.
Note the absence of (3) it’s really good at writing code and you should definitely use it to write all your code. I’ve found that worthwhile, but it’s far more clearly a trade-off than the more mundane use cases even if it’s the one that everyone is super excited about.
Here are what I think of as things that are obviously worth everyone’s time today to try:
- Rebase this branch onto main for me
- Fix the build on this PR
- Write me a script to do quality task X in this codebase (e.g. custom linting rules, custom formatting, script for parsing coverage output)
- Sort out my project infra (e.g. write me a justfile for common tasks, set up github actions)
- Port this code from the old deprecated method to the new supported method
- Read the code and look for out of date documentation
- Here’s a bug report, write me a test that reproduces it
- Investigate XYZ problem
Many of these won’t work 100% of the time, but even when they fail they will probably give you a good starting point for succeeding.
I also think that they’re very good for coding tasks that you otherwise just won’t do, and that are better done badly than done well. e.g. the website port, but also for me anything to do with github actions is done only the greatest of protests, and the fact that Claude makes it easy is a genuinely huge upgrade to my quality of life when developing.
A lot of the complaints about generative AI are that they’re taking over the most creative bits of our job. They can do, if you let them, and sometimes it’s worth letting them, but with agentic coding in particular I think they’re actually very good at removing the drudge work from our jobs. Maybe they’ll take our jobs eventually, and there are a whole bunch of things I’m worried about, but I don’t think we fix that by ignoring the genuine improvements that are here today and hoping they’ll go away.
Some of the build infrastructure is definitely closer to slop, but I think that’s OK.
↩︎Mostly by arguing a lot.
↩︎Yes, yes, except the Python thing. We know the Python thing is weird. But also the Python thing is 0% Claude’s fault, that was entirely on me, and I still stand by it as the right call.
↩︎Every time a Claude says it’s being pragmatic, you know shenanigans are about to occur.
↩︎If you’re curious what this means, here’s the protocol reference. But the short answer is that Hegel multiplexes many logical connections across a single actual transport layer, so as to allow lots of really cheap short-lived connections and easily support concurrency. It also has a single central stream which you’re supposed to use sparingly mostly for messages that let you know when a new test has started. That is not how Claude set it up.
↩︎A little too much perhaps. I don’t know how we’re going to get anything nearly as good in the other languages.
↩︎I have read all the macro code, and convinced myself that it looks like it’s doing the right thing, but I certainly couldn’t reproduce it myself without a lot of work. This is a crucial detail of “human review everything” that we’re still finding the right balance on: It doesn’t mean that you necessarily understood all the code. You could require that, but I’m not sure it would be the right trade off. It’s certainly caused us some problems that we don’t though. Especially in hegel-go, as I’m much worse at Go than Rust.
↩︎Or, at least, that’s the ideal. There is always a temptation to LGTM on tests. I try to resist it. Liam probably does a better job of resisting it than I do.
↩︎As an aside about this… one of the things I’m constantly surprised by when writing in languages that aren’t Python is how many things in Python I took for granted are just far better than the alternatives, especially in the testing ecosystem.
To me, “100% coverage” has the standard meaning of “Tool reports 100% branch coverage, assertions and other unreachable constructs excluded, nocov comments allowed only in extreme circumstances and subject to justification in a comment and at review time”. This is straightforward inn Python. In every other language, I’ve run into at least one of:
- “100% coverage” isn’t actually a thing, because coverage reports all sorts of lines as uncovered that can’t possibly be covered (e.g.structural elements)
- No branch coverage
- No configurable exclusions
- No exclusions at all!
Which means that basically every non-Python project I’ve wanted to do this on I’ve ended up getting Claude to write a custom scripts for parsing coverage output because the built in tooling was not good enough for asserting the invariant that I want.
↩︎Honesty compels me to admit that I’ve received more bug reports with shrinkray since doing this, but also many of those bug reports are in code that I don’t think Claude has touched, so it’s unclear to me how much this is increased usage, possibly partly because it’s nicer software now, and how much is that shrinkray has genuinely got buggier. It was never the most reliable software in the first place.
↩︎Which was 100% in code that I wrote, and that has been sitting happily there for well over a year without anyone triggering it, which is evidence for the “increased usage” theory.
↩︎Part of the problem here is that GitHub’s new fancy branch protection rules are insane and make it remarkably difficult to say “Yes obviously I don’t want to be able to merge with any failing status checks” and require you to manually add each one. Further, they require you to manually add each one without providing you a list to click on. Fuck sake. Anyway probably what happened is that I set up these rules before the first PR run with the new workflows and as a result couldn’t add them at that time.
↩︎Also, previously, written by Claude
↩︎The first edition violated the repo’s zizmor checks.
↩︎Antithesis is very into this. Also the London office sucks right now. It was fine when we were a third of the size that we are, but now we’ve grown. We’re moving offices in a few weeks and I expect to need this feature of it much less.
↩︎I don’t normally allow my laptop in my bedroom, but sometimes when I can’t sleep well I watch some videos - usually slay the spire streamers - to fall asleep to, and I assume that was what happened the previous night.
↩︎e.g. cookie clicker or, ironically, Universal Paperclips.
↩︎e.g. I more or less buy the argument that taking away the drudge work is quite bad for juniors, though I think it’s More Complicated Than That.
↩︎