Testing, Without Drudgery · Extreme AI Programming #7

Every approach to testing I have tried, and not one that lasted. A compromise the whole industry made and never quite admitted. Five thousand green tests that prove nothing at all. The getter that returns what the setter set. Behaviour, not method. The oldest drudgery in the craft, lifting all at once.

I have been writing and shipping software for over thirty-five years, and in all that time I have never once solved testing. Not for want of trying. I have worked through rigid test pyramids, test-driven development as a discipline, behaviour-driven development with its Gherkin syntax, end-to-end frameworks of every generation, heavy mocking and minimal mocking, property-based testing, mutation testing, snapshot testing. I adopted each one with real conviction. I have not found a single approach we could sustain, in a working team producing real software under real deadlines, for years on end without the testing infrastructure quietly becoming a second full-time job.

The problem was always economic, and it was always the same. Writing tests takes time. Maintaining them takes more. The tests that protect you most are usually the most expensive to write, and the tests that are cheapest to write usually protect you least. So teams compromise, because they have to. They unit-test the easy parts, skip the hard parts, and hope. The result is a profession that has, for thirty years, treated thorough automated testing as an aspiration rather than a baseline. Great teams test beautifully. Most teams test inconsistently. Plenty of teams barely test at all, and the release pain that follows is borne quietly, distributed unevenly, and rarely traced back to the compromise that caused it.

That compromise has just ended, and I am not sure we have digested what it means.

AI coding agents are unusually, almost embarrassingly, good at writing tests. They write them for code they have just produced. They write them for legacy code the team inherited and has been afraid to touch for years. They write the tedious long tail of edge cases and error paths that human engineers reliably skip because writing them is dull. The marginal cost of one more test, when an agent does the work, is close to nothing. The economic pressure that pushed every team I have ever run toward “enough testing to pass review” rather than “enough testing to actually verify the thing works” has simply gone.

This rearranges more than it first appears. When the unit cost of a test collapses, a decision teams used to make thousands of times a year flips its sign. “This is too expensive to test properly, so cover the critical path and move on” becomes “there is no real reason not to test this properly”. The whole economics of the testing pyramid rearrange themselves, and the confidence floor under continuous deployment rises with them.

There is a catch, and it must be discussed before anyone gets comfortable. None of this is a permission slip to stop thinking.

Five thousand green tests that prove nothing

Ask an agent to “write tests for this code” and it will reliably hand you back a large number of tests, almost all of them green, and the coverage number will climb. I watch this happen in our own work constantly. What the agent has produced, a good deal of the time, is a collection of tautologies dressed up as verification.

A test that asserts a getter returns what the setter set is not a test. It is the code restated in a second syntax. A test that mocks every dependency and then checks that the function called the mocks is not a test either. It is a check that mocks still mock. A test that asserts the code does what the code currently happens to do is a photograph of the present pinned up as an assertion, and it will fail the moment anyone refactors and stay silent the moment anything is genuinely wrong. None of these catch a single bug. They inflate the coverage figure and hand the team a sense of safety that is worse than no tests at all, because it disguises the fact that the thing you believe you are testing is not being tested.

Behaviour, not method

Effective tests are a different animal, and the difference stops being subtle once you have seen it. An effective test describes what the system is supposed to do, not how it currently does it. It fails when the system produces the wrong result and only then, which means it survives a rewrite of the implementation because it never knew how the implementation worked in the first place. In a real sense the suite is the specification of the software, written in a form a machine can run.

There is a trap hiding in that sentence. If the agent writes the tests by reading the code, the suite is not the specification of anything. It is a mirror of whatever the code already happens to do, the snapshot tautology from a moment ago, now produced at scale and wearing a coverage badge. For a test to mean anything, the agent has to know what the software is supposed to do before it writes a line, and that intent cannot be recovered from the implementation. It has to arrive from somewhere else: the decisions the team actually made, the outcome they actually agreed, written down and handed to the test-writing agent as its input. This is the whole argument of Chapter 4 surfacing in a new place. The prompter is the prompt. A suite is only ever as good as the specification it was written against, and a specification is only worth anything once every perspective on the team has signed up to it.

Getting an agent to produce tests like that, rather than merely numerous ones, comes down almost entirely to the brief. The instruction that works for me reads roughly like this:

Do not test methods, test behaviour. Describe what the system should do, not what it currently does. Write tests that would still be correct if the implementation were thrown away and rebuilt from scratch. Do not mock what you do not need to mock; use real collaborators wherever you can, because integration is where most production bugs actually live. Test the boundaries: zero, one, the maximum size, the wrong type, concurrent access, partial failure. Test the error paths as carefully as the happy ones, because the error paths are where systems hurt people. Name each test so a reader who sees only the names can reconstruct what the system is meant to do.

Hand those same instructions to a senior engineer and they would nod and call them obvious. Hand them to an agent, with care, and they are the difference between a suite that wastes everyone’s time and one that is genuinely worth having.

Testing the tester

Because the agent now writes the tests rather than the engineer, a question opens up that barely existed before: did it write the right ones? A green suite tells you the code does what the tests say. It does not tell you the tests say what the specification meant. So there are two loops to close now, not one. The inner loop checks the code against the tests, and the agent can run that one all day without help. The outer loop checks the tests against the specification, and the outer loop is where the human judgment lives. An agent that has quietly misread the spec will write a thorough, green, confident suite that verifies the wrong thing, and nothing inside the inner loop will ever tell you.

This is the picture Chapter 6 drew, seen honestly. The loop professional AI engineering runs on is specify, build, verify: you say what you want, the agent builds it, you confirm that what came back is what you asked for. The execution plan is the specify step, the agreed description of intended behaviour written before the agent touches the code. Behaviour-level tests are the verify step made executable. What the double loop adds is the reminder that the verifier is now an agent too, and an agent’s work is checked, not trusted. A spec nobody checks the output against is, as I said then, just a more articulate prompt. A suite nobody checks against the spec is the same mistake one level up.

What the developer does now

There is a quieter shift folded into all of this, and it is the more interesting one. The developer no longer types the tests. The developer specifies what should be tested, judges whether the suite covers the risks that actually matter, and steps in when a test is wrong or a gap is dangerous. The work has moved from writing the assertion to deciding what is worth asserting, from production to judgment.

Reviewing tests well is itself a skill, and it is one most engineers have barely practised, because for thirty years the person who wrote a test and the person who trusted it were the same person. They are not the same person any more. The instinct that lets a reviewer look at a green suite and feel, correctly, that something important is missing is exactly the instinct a long career builds and a coverage report cannot supply.

For the business the implication is blunt and, I think, underappreciated. The confidence floor under every release rises. Tests get dense where they were always thin. The gap between “we think this change is safe” and “we can show this change is safe” closes, and the old trade-off where teams shipped with less coverage than they wanted because more coverage cost more than they had simply stops applying. Continuous deployment was always an act of faith resting on the quality of the suite beneath it. The suite beneath it just got better than most teams have ever managed.

Testing has been the most reliable source of drudgery in this profession for as long as the profession has existed. It is the thing that gets skimped when the deadline tightens, pushed to next sprint for years on end, sworn off and returned to and skimped again. Almost every serious incident I have been near in the last decade had, somewhere in the post-mortem, the quiet admission that the coverage in the relevant corner of the system was thinner than anyone had wanted to admit.

That tax is lifting. Not in some future release, but now, for any team willing to change its habits. Agents do not get bored, do not resent the work, do not leave the error paths for a calmer week that never arrives. Tell one what an effective test looks like in your system, point it at the code, and it will build in an afternoon the suite you have been meaning to write for years.

So stop wrestling with it. Hand the mechanics to the agent and keep the judgment for yourself. After thirty-five years of losing this particular fight, I did not expect to be the one writing that sentence down.

The compromise is over.

Barrie

I am co-founder and CEO of Mindset AI, where we are building Memex AI, a decision and knowledge layer for AI-native engineering teams. This series is the thinking that shapes our product. I will flag it explicitly when an article touches something we build. Most of it is simply where the industry is going, with or without us.