CAMERON WESTLAND

How might we incentivize a Proof-of-Work Code Review?

Jul 12, 2025

TL;DR

Make PR submitters show proof of work.
Build on old rituals like code walk-throughs.
Experiment with small PRs, code screencasts, and AI video checks to realign incentives.
These could prevent teams from getting buried in review debt.

A Thought Experiment

In "The Externality of AI Velocity," I wondered if the real issue with AI-generated code is the hidden cost dumped on reviewers. Speed feels great. Ownership stays messy.

I’m still chewing on how we balance that. If AI generation continues gaining traction, then we're going to have to solve these problems. Part of the solution will be steering the generation to be optimized for review (ask your model to do small changes/PRs). Part of the solution will be using AI to review and analyze the code. I bet if we don't experiment with these, review queues will explode as AI floods repos with unowned code.

What we used to do

Back in the 1980s, Microsoft developers hauled dot-matrix printouts into conference rooms. One person would walk through the code line by line while others grilled them. It was clunky, but it forced authors to own their work.

As one early Microsoft engineer recalled, "When I started at Microsoft in the late 80s, I could count on enjoying a code review every couple of months. Everyone on the team would print out what they'd been working on and go into the conference room."

No hiding.

Today, AI can generate 500 lines of TypeScript in seconds. Reviewers are the bottleneck. How do we make sure authors share that load? For teams buried in review debt, this matters. Bad incentives turn code review into a grind, slowing everything down.

Some ideas across time

I have been thinking about this as a progression. Start with what we already know works. Adapt it to the AI world. Then experiment with something new.

Here are three possibilities, framed as past, present, and future. I am curious how they might play out, especially for easing that review burden.

Conventional practice: Limit your pull requests.

Small PRs are already a best practice on many teams. We enforce them to keep diffs digestible. In an AI-generated world, this feels even more important. Big patches can hide sloppy outputs.

For example, I've set up a Claude code GitHub action to flag and suggest improvements on any PR larger than 500 lines. My thinking is that it could reinforce good habits automatically. Does this actually cut review time, or does it add overhead?

Adapting an existing practice: Screencast your changes.

Some developers already attach Loom videos to PRs, like demoing the product feature. That is a start. What if we normalized walking through the code itself, cursor on screen, explaining what changed, why it matters, and any risks?

One developer on my team does videos for product demos with every PR. Adapting that to include code walkthroughs could catch issues early and signal real ownership. Is this too much for tiny fixes? Could it build trust in AI-assisted work?

Futuristic experiment: Automated sanity check on the video.

Feed the screencast to a model like Gemini 2.5 Pro. It transcribes, summarizes, and verifies if the explanations match the diff, checking function names, file mentions. (I've been experimenting with this for product demos, and so far it seems way better than just an audio transcript)

Low confidence? Send it back for a redo. This could triage videos, so humans only watch coherent ones. It feels a bit Black Mirror, but for teams swamped by reviews, it might free up time for the tough stuff. What if the model gets fooled by polished fakes or misses nuances?

Counter-pressures to watch

No idea is perfect. People might game these. Fake small PRs by splitting poorly, scripted videos with synthetic voices, mismatched explanations that slip past the model.

That's a common refrain in organizations that ask for smaller PRs, but in practice, I still think submitting smaller PRs is almost always going to be better.

Even imperfect splits encourage incremental thinking over monolithic dumps. We could build detections, like spotting no cursor movement or symbol mismatches.

The aim is not zero fakes. It is making honest work easier than cheating.

Open questions

Have you adapted practices like this for AI code? Did small PRs or videos help with ownership? What about automated checks? Tried them yet?

Let me know what broke or what scaled.

I do not have all the answers. These are just experiments. If we realign incentives, maybe review stops feeling like a tax.

What are you trying?