Autoresearch Is Reward Function Design

I pointed an AI agent at a performance-sensitive Python code path, gave it a 40-line spec and a replay harness, and walked away. An hour later it had tried 49 optimizations, kept 20, and taken the p95 latency from 339ms to 34ms. The whole thing cost $24.

Here’s the artifact it produced. A JSONL file where each line is a structured experiment:

{"run":1,  "metric":338.57, "status":"keep",    "description":"Baseline before any code changes"}
{"run":2,  "metric":113.35, "status":"keep",    "description":"Vectorized iterrows → Series.map"}
{"run":7,  "metric":102.57, "status":"checks_failed", "description":"Cached query embeddings, broke embedder_calls assertion"}
{"run":11, "metric":43.85,  "status":"keep",    "description":"Hoisted object-column numeric coercion into cache"}
{"run":25, "metric":34.75,  "status":"keep",    "description":"Shallow copies from cache instead of deep"}
{"run":40, "metric":33.83,  "status":"keep",    "description":"Removed redundant parsed_dates dict"}
{"run":49, "metric":34.31,  "status":"discard", "description":"Helper-level memoization exhausted"}

I didn’t build autoresearch. Karpathy released the concept, and a team built pi-autoresearch, a plugin for the Pi coding agent. I applied it. The interesting part isn’t the tool or the optimizations. It’s the setup work.

The problem and the harness

I work on a product that lets investors build thematic stock portfolios. You type a theme like “AI” or “clean energy,” and the system scores thousands of securities against it using vector search and semantic matching. Our head of AI delivered a well-designed algorithm with parameters that were already tuned, but the Python implementing it hadn’t been performance-optimized. Fresh and correct, not fast.

I wanted to optimize the local Python (DataFrame manipulation, scoring math, expression evaluation), not the network calls to the vector database and embedding model that the code also makes. So the first thing I built was a replay harness. I captured all the network traffic from several real requests and stored them as fixtures. A record/playback tool intercepted those calls during benchmarking and returned the captured responses instantly. This gave me deterministic benchmarks with no network jitter, golden outputs from the baseline that would catch any optimization that changed the result, and exact counts of network calls per request type.

The spec as reward function

If you’ve done any reinforcement learning work, the autoresearch setup should look familiar. I wrote a series about vibe-coding RL experiments a year ago, and the hardest part was always the same: designing the reward function. What metric are you optimizing? What constitutes cheating? What constraints prevent the optimizer from finding a shortcut that technically scores well but produces garbage?

The autoresearch.md spec is a reward function written in prose:

## Objective
Reduce latency for the in-process scoring path while
preserving exact replay outputs in Phase 1.

## Metrics
- **Primary**: p95_ms (ms, lower is better)
- **Secondary**: p50_ms, total_ms, embedder_calls, db_calls, identifier_lookups

## Files in Scope
- core/scoring.py
- core/universe_creation.py

## Off Limits
- benchmark/replay/cases.json
- benchmark/replay/fixtures/**
- benchmark/replay/goldens/**
- autoresearch.md, autoresearch.sh, autoresearch.checks.sh

## Constraints
- Phase 1 is exact parity only. If replay goldens change, discard.
- No public API changes.
- Do not edit benchmark assets or harness files.

The secondary metrics are controls, not telemetry. They play the same role that reward shaping plays in RL: if embedder_calls drops to zero, the agent found a way to make the primary number go down without doing the actual work. Here’s what a successful run looks like in the JSONL (run 6, where everything checks out):

{
  "run": 6,
  "metric": 101.13,
  "metrics": {
    "embedder_calls": 7,
    "db_calls": 28,
    "identifier_lookups": 7
  },
  "status": "keep",
  "description": "Added conservative function-call guards so scoring-only requests skip unrelated legacy expression processors."
}

And here’s run 7:

{
  "run": 7,
  "metric": 102.57,
  "metrics": {
    "embedder_calls": 0,
    "db_calls": 28,
    "identifier_lookups": 7
  },
  "status": "checks_failed",
  "description": "Tried deterministic query-embedding memoization, but replay tests explicitly assert historical embedder call counts.",
  "asi": {
    "hypothesis": "Caching deterministic query embeddings should reduce repeated embedder work without changing replay outputs.",
    "rollback_reason": "Replay tests assert aggregate embedder_calls == 1, so embedding memoization changes monitored benchmark call counts and fails checks.",
    "error": "AssertionError: summary['aggregate']['embedder_calls'] == 1 failed because observed 0"
  }
}

embedder_calls: 7 to 0. Textbook reward hacking. The agent found a legitimate optimization (embedding results are deterministic for the same query, so caching them is sound engineering), but under Phase 1 constraints where the goal is exact behavioral parity including call counts, it’s cheating. The asi.rollback_reason shows it understood why:

“Avoid optimizations that alter checked secondary metrics in phase 1; focus on pure local CPU improvements that leave embedder/db/identifier counts unchanged.”

It learned the boundary and didn’t try embedding caching again. Instead, it parked the idea in an autoresearch.ideas.md file for Phase 2, where the metric contracts could be relaxed.

The diminishing returns curve

If you’ve trained ML models, this curve is instantly recognizable. It’s a loss curve.

The early wins were obvious: vectorize hot loops, cache repeated computation, replace deep copies with shallow ones. A good engineer with cProfile would find most of these in an afternoon. But the agent also found things I wouldn’t have profiled for. Run 11 was the biggest single drop (87ms to 44ms), from hoisting generic object-to-numeric coercion into the cached preparation path. The agent called it “the hidden dominant local cost.” It looks like pandas housekeeping until you measure it.

After run 15, the curve flattened. Twenty-five more attempts, most discarded. The agent tried increasingly speculative ideas, ran confirmation reruns on borderline wins to check if a 0.02ms improvement was real or noise, and eventually recognized it was done: “Remaining wins are below the reliable threshold.” It kept going anyway because I set maxIterations to 50 and didn’t give it a stopping condition.

Human-in-the-loop still matters

I built the harness and the spec inside Codex. While we were setting up, before I’d even launched autoresearch, Codex started trying to optimize the code. The pi-autoresearch plugin has a “What’s Been Tried” section in the spec template, designed so resuming agents have full context from previous sessions. Codex saw that section was empty and decided to fill it. It began running optimizations so it would have prior attempts to report. I stopped it: “Why are you trying to do the job of autoresearch?”

It was overfitting to the plugin’s resume protocol, even though this was a fresh start with no prior work. It manufactured a prerequisite and started fulfilling it. The right move was to leave the section empty, set up the harness, and let autoresearch build its own history from a clean baseline. The loop is automated, but knowing when to let it run (and when to stop the agent from “helping”) still requires a person.

The optimizations autoresearch produced are not groundbreaking. iterrows is slow, caching deterministic computation is obvious, shallow copies beat deep copies. Standard stuff. But do you have someone available to spend an afternoon on code they didn’t write? I don’t own this code. I could have spent a day building context and profiling, or I could spend an evening on a spec and a replay harness and let an agent do 49 experiments for $24. The cost comparison that matters isn’t “AI vs. expert human.” It’s “AI vs. nothing, right now.”

Constraint design, not prompt engineering

The whole experience was easier than I expected. Specifying constraints in natural language (off-limits files, phase boundaries, metric controls) worked surprisingly well. The loop itself is dumb. It runs experiments, logs results, keeps what improves, discards what doesn’t. The value is in the constraints you give it. A clear metric. A correctness harness. Secondary controls that catch cheating. Bounded scope. A parking lot for ideas that violate current constraints.

If you’ve ever designed a reward function for RL, you already know how to do this. If you haven’t, that’s the skill to develop. You can’t point autoresearch at a React app and say “make it better.” The results will be unpredictable because you haven’t defined what “better” means precisely enough for a dumb loop to optimize against. The narrower and more measurable the problem, the better this works: single-number metrics, binary correctness checks, fast feedback loops.

Next time I’ll add a stopping condition. Something like “stop after 5 consecutive discards” would have saved half the compute. I’m also thinking about what it would look like to store these JSONL artifacts properly. Right now mine is sitting in a Codex worktree. But the experiment log documents what was tried, what worked, what failed, and why, across 49 runs. Imagine every autoresearch run across a team stored and indexed as an experiment log. That’s an organizational memory about your codebase that doesn’t exist today.

The experience felt less like “directing an AI” and more like training a model. Which, if you squint, is exactly what it is.

Process note: I built the replay harness and autoresearch spec in Codex (OpenAI’s coding agent). The optimization loop ran via pi-autoresearch, a plugin for the Pi coding agent, using GPT 5.4 with max thinking. Total cost was $24.10 for 2.3M input tokens and 177K output tokens across 49 runs. The autoresearch concept is Karpathy’s. All claims and numbers in this post are mine; mistakes are mine.