Green Doesn't Mean Correct

An LLM wrote five passing tests for a twelve-line function. Every checkmark was green. But the function was broken.

Ask any model to "write unit tests for this function" and you get a wall of green in seconds. It feels like progress, but often it's the opposite. The tests pass because they were written to agree with your code, not to check it against what it should do. Those are two different jobs, and the gap between them is exactly where bugs live.

Let me show you with the most boring function I can think of.

The function

A truncate helper: cut a string down to a maximum length, and add an ellipsis if you cut it.

def truncate(text, max_length):
    if len(text) <= max_length:
        return text
    return text[:max_length] + "…"

Nothing exotic. You've written this function, or a similar one, a million times.

If you want to follow along, drop that in truncate.py and pip install pytest hypothesis. Everything below runs.

Here's the prompt I gave an LLM: "Write pytest tests for this function." The exact output varies by model and by run, but it reliably looks like this:

from truncate import truncate

def test_short_string_returned_unchanged():
    assert truncate("hello", 10) == "hello"

def test_string_exactly_at_max_length():
    assert truncate("hello", 5) == "hello"

def test_long_string_is_truncated():
    assert truncate("hello world", 5) == "hello…"

def test_truncation_adds_ellipsis():
    assert truncate("this is a long sentence", 10).endswith("…")

def test_empty_string():
    assert truncate("", 10) == ""

Run it and you'll see this in your terminal:

$ pytest
5 passed in 0.77s

Five for five. If you were reviewing a pull request and saw green, you'd approve it. I almost did.

The bug the green checkmarks were hiding

Now read the function's name again. truncate. Its entire job is to make sure a string is no longer than max_length. That's the one promise it makes. So let's check the promise:

>>> truncate("hello world", 5)
'hello…'        # length 6

We asked for at most 5 characters. We got 6. The ellipsis is a character too, and the function forgot to count it: text[:max_length] takes 5 characters, then + "…" adds a sixth. The function violates the one rule it exists to enforce, on a completely ordinary input.

And look at which test should have caught this. test_long_string_is_truncated did truncate "hello world" to 5. It asserted the result was "hello…". It passed. The single test best positioned to catch the bug instead certified it as correct.

It gets worse the harder you push:

>>> truncate("hi", 0)
'…'             # asked for 0, got 1
>>> truncate("hello", -1)
'hell…'         # negative length, nonsense output, no error

And it has nothing to say about text that isn't plain ASCII:

>>> truncate("🇧🇷🇺🇸", 3)
'🇧🇷🇺…'          # the US flag, sliced in half

A flag emoji is two code points under the hood. Slice between them and you're left with a lone "half flag" that renders as a stray letter in a box. The function will happily corrupt any emoji, flag, or accented character that straddles the cut, and the generated tests never used a single non-ASCII string.

Why this happens

None of this is because the model is dumb. It's because of how it writes tests, and the pattern is worth understanding, because it's predictable.

When you paste a function and ask for tests, the model reads the implementation and writes assertions that describe what the implementation does. It traces text[:max_length] + "…", sees that truncate("hello world", 5) produces "hello…", and writes assert truncate("hello world", 5) == "hello…". The expected value comes from the code. So the test can only ever confirm that the code agrees with itself. If the code is wrong, the test is wrong in the same way, and they shake hands in the middle. Green forever.

A tester working deliberately goes the other direction. They don't ask "what does this code return?" They ask "what must this function guarantee, and where is that guarantee most likely to break?" Two old, unglamorous techniques drive that:

Boundary value analysis. Bugs cluster at edges. For truncate, the edges are the values around max_length: a string one character shorter, exactly equal, one longer. Plus the degenerate boundaries max_length = 1, 0, and negative. The "hello world" bug lives precisely on the boundary, and the model's one boundary-ish test baked the bug into its expected value instead of catching it.

Equivalence partitioning. Group inputs into classes the code should treat alike, then test one from each: empty string, string shorter than the limit, string at the limit, string over the limit, non-ASCII string, invalid max_length. The model covered the first four, the obvious ones a reading of the code suggests. It never produced the last two, because nothing in the code's text pointed at them.

The model didn't reason about the contract. It reasoned about the code. That's the whole bug.

The fix that actually finds it

So how do you catch a bug that every example-based test will miss? You stop writing examples and start asserting the contract, the property that must hold for every input, not the output of a few hand-picked ones.

The contract for truncate is one line: the result is never longer than max_length. Here's that as a property-based test, using Hypothesis:

from hypothesis import given, strategies as st
from truncate import truncate

@given(text=st.text(), max_length=st.integers(min_value=1, max_value=100))
def test_result_never_exceeds_max_length(text, max_length):
    assert len(truncate(text, max_length)) <= max_length

You're not listing inputs. You're stating a rule and letting Hypothesis hunt for any input that breaks it. Run it:

Falsifying example: test_result_never_exceeds_max_length(
    text='00',
    max_length=1,
)
AssertionError: assert 2 <= 1

It found the bug in seconds and handed you the smallest input that triggers it: a two-character string and a limit of 1. No green-washing possible, the test checks the promise, not the implementation.

The fix is small once you can see it:

def truncate(text, max_length):
    if max_length < 1:
        raise ValueError("max_length must be at least 1")
    if len(text) <= max_length:
        return text
    return text[: max_length - 1] + "…"   # leave room for the ellipsis

Re-run the property test and it passes for every input Hypothesis can throw at it.

The part most "AI writes your tests" posts skip

The property test is a real improvement, but it only checks the one property you thought to write. Watch:

>>> truncate("🇧🇷🇺🇸", 2)
'🇧…'             # length 2, contract satisfied — flag still sliced in half

The fixed function still corrupts the flag. The length contract is satisfied (the result is two characters), so our property test is perfectly happy. There's a second contract hiding here — "never split a grapheme" — and we never asserted it, so nothing checks it. (Fixing it for real means slicing by grapheme cluster instead of code point — a job for a library like regex with \X, or grapheme. A topic for another day.)

That's the real lesson, and it isn't "LLMs write bad tests." It's this: a test is only as good as the contract behind it, and the model cannot supply the contract. It will generate fifty tidy, well-named tests for the contracts you state. It cannot decide what the function is supposed to guarantee, because by default it infers the intent from the same code that contains the bug.

The thinking is your job. The typing is the part you can hand off.

So how should you actually use it?

That reframes the workflow. The model is excellent at volume. Boilerplate, parametrization, fixtures, covering a list of cases you hand it. It's unreliable at deciding what to cover. So you bring the checklist and it brings the speed.

Concretely: before you ask for tests, force the model to reason about the contract first. This prompt produces dramatically better output than "write tests for this function":

Here is a function. Before writing any tests:
1. State its contract as invariants - what must ALWAYS be true of the
   output, for any input.
2. List the input partitions and boundary values to cover. Include at
   minimum: empty input; the boundary at, just below, and just above
   max_length; max_length = 0 and negative; and non-ASCII / multi-code-
   point input (emoji, flags, combining marks).
Then write one pytest test per case. For every assertion, derive the
expected result from the contract you stated — NOT by tracing the
implementation.

[paste your function here]

The last instruction is the one that matters most. "Derive the expected result from the contract, not by tracing the implementation" is the direct antidote to the failure mode that let truncate("hello world", 5) == "hello…" slip through. You're forbidding the model from grading the code against itself.

Use it, then read what comes back, add the cases it still missed, and keep a property test or two for the invariants that matter. Green will mean a little more than it did this morning.

Next week: the inverse problem. This week we tested ordinary code that returns the same answer every time. But how do you write a test for code that returns a different answer on every run, an LLM-powered feature with no single "correct" output? That's where testing gets downright weird. See you then.