The Bottleneck Moved: Code Review for the AI Era

Table of Contents

TL;DR: AI flipped code review economics - writing collapsed, reading didn’t. Five shifts to keep your reviews honest, with a PR template and a reviewer prompt you can adopt this week.

The Asymmetry #

Two-line prompt. Eight hundred lines of code. Ten seconds to type, two hours to review.

Writing got cheap. Reading didn’t. The bottleneck moved.

Two stacked timelines: pre-AI shows a long writing bar and a short review bar; post-AI shows a short writing bar and a long review bar. — Pre-AI vs post-AI: writing collapsed, review didn’t.

What’s worth keeping, updating, and inventing - inside.

Three forces, up close #

Code review was a load-bearing practice for teams that invested in it. AI made it that way for everyone. Three forces now stack in every AI-authored PR queue, each one alone justifying more investment in review:

Volume - AI scales authoring; review didn’t scale with it.
Fluency trap - the code reads right even when it isn’t.
Intent drift - prompt and diff diverge with nothing automatic to close the gap.

Force 1: Volume - the economics flipped #

Pre-AI, writing took days; review took minutes to hours. A careless review was cheap relative to the change being reviewed - so LGTM ✅ culture, lightweight nits (the nitpick: prefix from Conventional Comments), and the unwritten rule that big PRs get rubber-stamped after a polite scroll all made sense. Cycle time was dominated by everything upstream¹ of review.

That ratio inverted. A coding agent produces in an afternoon what a team used to produce in a week. Writing collapsed; review didn’t. Stamping a PR that took seconds to author but needs hours to understand isn’t review - it’s approval theatre.

I’ve seen this in every team that’s gone heavy on AI tooling: bigger PRs land faster, but a bigger fraction get rewritten or reverted within a fortnight. That’s not productivity - that’s skipped review surfacing downstream².

Force 2: The fluency trap - looks right, isn’t #

AI writes confidently - syntactically perfect, idiomatically clean, well-named, well-typed code. It also:

imports modules that don’t exist and calls functions that were never defined
writes tests that match the implementation, not the spec
catches the wrong exception in the wrong layer - confidently

Language models generate the most plausible continuation of text, not the most correct one. Trained on enough good code, they produce output that reads like good code. Call it the fluency trap: the reviewer’s brain rewards what looks competent, and “looks competent” is the cheapest thing a model can do.

The old signal - “compiles and tests pass” - was meaningful when humans wrote both. With AI authorship, that signal is much weaker.

Force 3: Intent drift #

When a human writes code, intent and implementation live in the same head. When an LLM writes code, intent lives in the prompt, implementation in the diff, and nothing automatic closes the loop. The PR description summarises what the author thought they asked for; the code does what the model interpreted the ask as. The gap is invisible to CI, invisible to tests written from the same prompt, visible only to a human asking: did this solve the problem we actually have?

That question is what code review is for. AI just made the answer harder to take for granted.

What survives, what shifts #

The principles of code review, however, survived. Three of them still hold:

Improve overall code health - every PR is a vote on whether the codebase is more maintainable next week.
Catch defects before merge - the reviewer is the last filter between intent and production.
Share knowledge - review is how senior judgment transfers across a team.

For the principles themselves, Google’s code review guide remains the cleanest primer I know. What changed is where in the cycle review has to happen. QA learned this lesson long ago and gave it a name: shift-left - move quality activities earlier, where defects are cheapest to catch. AI made code review’s version of that shift unavoidable: the intent now lives upstream of the diff, so review has to follow it there.

The five shifts below are what shift-left looks like when the artifact under review isn’t the code, but the decisions that produced it. They’re updates to the practice, not departures from the principles.

Shift 1: Review the decisions, not the diff #

Pre-AI, intent, design, and implementation lived in the same head. The author chose the goal, the architecture, and the code in one continuous thought; the diff was the trace.

Post-AI, those layers split. Intent lives in the prompt. Design lives in the model’s interpretation. Implementation lives in the diff. A reviewer who only reads the diff is reviewing the cheapest layer - the one the model can regenerate with a different prompt.

The expensive questions are upstream: was the right thing asked for, and does the resulting design belong in this codebase? That’s shift-left applied to review: catch defects where they’re cheapest, which now means at the spec, not the diff. The cheap questions are about implementation choices the model already made for free - swap a prompt, you get different ones.

The shift: spend review time on the decisions, not the diff that follows from them. Read the spec before reading the code. Then ask: would I have asked for this? And: would I have made this design decision myself? If both answers are yes, the implementation review is a quick verification, not the headline.

Shift 2: Demand a decision log #

A classical PR description tells reviewers what changed. A good one also tells them why. In the AI era, “why” is no longer something the author can carry in their head and leak through the code - because the head and the code are different entities. The reasoning has to live somewhere on disk.

Every non-trivial PR should ship with a short decision log. Three sentences, sometimes one. Problem solved. Approach chosen. Alternatives rejected. If the author can’t answer those, the PR isn’t ready for review - there’s nothing for a reviewer to disagree with.

This is not new advice. ADRs (architecture decision records) have been around for years. AI just made skipping them stop being self-correcting - the model writes the code without needing the log, but the reviewer has no shared head to recover the reasoning from. Decision logs went from helpful-but-optional to non-negotiable.

A minimal version for your PR template:

## Decision log
**Problem:** <one sentence>
**Approach:** <one sentence>
**Alternatives rejected:** <one line each>

## AI assistance
- [ ] None
- [ ] Scaffolded (AI wrote drafts, human refined)
- [ ] Mostly AI (review with elevated scrutiny)

Paste into .github/pull_request_template.md. The cost is one extra section the author fills in; the gain is something for the reviewer to actually evaluate.

The shift: require a written decision log on every non-trivial PR. Problem solved. Approach chosen. Alternatives rejected. Three sentences minimum. If the author can’t write them, the PR isn’t reviewable - send it back, not because the diff is wrong, but because there’s nothing to review against.

Shift 3: Use AI to review AI #

If AI authoring shifts the bottleneck to review, the obvious move is to put leverage on the reviewer’s side too. This isn’t a rhetorical point. It’s how I get through review queues that would otherwise drown me.

A few concrete uses:

Summarise the diff before reading it. Pipe the PR into your assistant of choice and ask for a 5-bullet summary plus two lists: “things mentioned in the description but not in the diff” and “things in the diff but not in the description”. The second list is where bugs hide.
Find hotspots. Ask the assistant to flag the three files with the most logical complexity or the highest risk. Read those first, with full attention.
Generate review questions. “What assumptions does this code make about its callers? Where is input validated? Which exceptions are silently swallowed?” These are the questions a senior reviewer asks. Get them on the page before you start reading.

A starting prompt that gets ~80% of the way for me:

Act as a senior code reviewer skeptical of AI-authored code. Given the
PR description and diff:

1. **5-bullet summary**, one bullet per logical concern.
2. **Mismatch list:**
   (a) claimed in description, not in diff;
   (b) in diff, not in description;
   (c) exceeds the PR title's scope.
3. **Risk hotspots:** three files most likely to contain bugs, each with
   a one-line reason (concurrency, error handling, auth, external API
   contract, hallucinated symbol).
4. **Open questions:** three things a reviewer should ask that the
   description doesn't answer.

Be terse. Cite file:function. No flattery, no filler.

Pipe in the diff and the description. Read what comes back. Then read the diff yourself.

Two warnings. First, the AI summary may lie. It can confidently miss a hidden Redis-client rewrite, just as confidently as the author did. Use it as a starting checklist, not as a substitute for reading. Second, this is reviewer leverage, not reviewer replacement. Stamps from an AI assistant are not LGTM. Your name is on the approval.

The shift: put AI on the reviewer’s side too. Summarise the diff before reading, find hotspots, generate the questions a senior reviewer would ask. Then read every line yourself - AI summary is a starting checklist, not a substitute. Stamps from an assistant are not LGTM; your name signs the approval.

Shift 4: Make tests disagree with the code #

A useful test can fail. That’s the whole property. If a test cannot disagree with the code it covers, it isn’t a test - it’s a transcription of the code into assertion form.

AI-written tests can lose this property in subtle ways. A natural pattern: the test re-uses the same module constant or helper that the code itself uses, then computes the expected value from it. The test passes for any implementation that follows the formula, including a wrong one, as long as it’s consistently wrong. CI is green; the test proved nothing.

Tactics that restore the property:

Anchor expected values to the spec, not the implementation.
- discount(100, 20) == 80.0 is a test.
- discount(100, 20) == 100 * (1 - 20 / 100) is an echo.
Write tests as given/when/then. The shape forces a separation of setup, action, and assertion.
```
Given: a price of 100 and a 20% discount
When: discount is applied,
Then: the result is 80.
```
Implementation-leakage into the assertion becomes visible at a glance.
Mutation-test critical paths. If you can’t tell whether a test would catch a bug, run a tool that breaks the code and watches what fails. mutmut for Python, Stryker for TS. A surviving mutation = a test that didn’t catch the bug. Fix the test.
Raise the bar when one agent writes code and tests in the same session. Treat the suite as one artifact under review, not two cross-checking each other.

A concrete example:

# src/billing/constants.py
TAX_RATE = 0.21  # updated by the team last quarter

# src/billing/total.py
from billing.constants import TAX_RATE

def total_with_tax(items: list[dict]) -> float:
    subtotal = sum(item["price"] for item in items)
    return round(subtotal * (1 + TAX_RATE), 2)

# tests/test_total.py  - BAD: imports the same TAX_RATE and recomputes
from billing.constants import TAX_RATE
from billing.total import total_with_tax

def test_total_with_tax():
    items = [{"price": 100}, {"price": 50}]
    expected = round(sum(i["price"] for i in items) * (1 + TAX_RATE), 2)
    assert total_with_tax(items) == expected

If someone silently changes TAX_RATE to 0.18, the test passes. The bug ships.

# tests/test_total.py  - GOOD: anchored against the spec, not the constant
from billing.total import total_with_tax

def test_total_with_tax():
    # given two items priced 100 and 50
    items = [{"price": 100}, {"price": 50}]
    # when tax is applied at the regulated 21% VAT
    # then the total is 181.50
    assert total_with_tax(items) == 181.50

The good version can disagree with the code. That’s the only thing that makes a test worth running.

The shift: every test must be able to disagree with the code. Anchor to the spec, not the constants. Given/when/then. Mutation-test critical paths. A test that can’t fail isn’t a test.

Shift 5: Reviewing is the new authoring #

Pre-AI, “senior” meant “writes good code.” The PR was the signal: the senior knew which abstractions to reach for, which corners to refuse, which tests to write. The author carried the judgment; the diff was its evidence.

Post-AI, anyone can ship a senior-looking PR. The model carries syntactic seniority - good names, idiomatic structure, well-placed error handling. What it can’t carry is judgment about whether the thing should exist, whether it belongs here, whether the trade-offs fit this codebase, this team, this quarter.

That judgment now lives in the review.

The implication is uncomfortable: senior engineers don’t author less, but they review more - and with the weight they used to put into authoring. Reviewer time stops being the cheap slot. Reviewer authority - the right to send a PR back, ask for a rewrite, refuse a merge - becomes a first-class organisational function.

If your review queue is one person’s after-hours job, you have a staffing problem, not a process one. The most experienced engineer should be reading more PRs than they write.

The shift: treat reviewer time as load-bearing, not as the cheap slot in the schedule. Senior judgment now lives in the review, so senior engineers should read more PRs than they write. Reviewer authority - the right to refuse a merge - becomes a first-class organisational function, not a soft preference.

What it adds up to #

The five shifts answer the three forces, with overlap. Force 1 (volume) is met primarily by Shifts 3 + 5; Force 2 (the fluency trap) by Shift 4; Force 3 (intent drift) by Shifts 1 + 2 - though Shifts 1 and 2 also cut volume and catch fluent-but-wrong choices upstream.

The model writes the code. You decide whether it ships.

That’s the whole thing. Every shift in this post serves that one sentence. Read the spec, demand the log, leverage your own AI, anchor the tests, staff the queue - so that when you say LGTM, it still means something.

The bottleneck moved. Move with it.

Upstream = work that happens earlier in the development pipeline: the spec, the prompt, the design call. Borrowed from the flow metaphor - defects caught upstream cost less to fix. ↩︎
Downstream = work that happens later: CI, deploy, production, incidents, churn. Defects that escape upstream review surface here, more expensive each step. ↩︎