CAMERON WESTLAND

LLM Evals Are Just Tests. Why Are We Making This So Complicated?

Aug 09, 2025

I watched our AI completely melt down during a CTO demo last month.

The model started confidently mixing up "deep research" (an existing feature) with "thematic analysis scans" (the new feature I'd just shipped). Every response made the confusion worse. The CTO's face went from interested to puzzled to concerned. I sat there watching my careful prompt engineering unravel in real-time.

After the demo, the CTO asked the question I dreaded: "How can we make sure this doesn't happen again?"

I couldn't promise anything. I was vibe-checking my system prompts, shipping when they "felt right." That's when it hit me—I already know how to prevent regressions. I do it every day with unit tests.

The Revelation That Changed Everything

Think about this: You don't use a third-party platform for unit tests. Why would you need one for LLM evals?

This question haunted me after that demo disaster. I started evaluating all the LLM observability platforms—Helicone, LangSmith, Logfire, LangFuse, Lunary. They're all solving problems I didn't actually have. I already had vitest running my tests. I already had GitHub Actions for CI. I already had everything I needed.

The only difference between testing calculateTotal() and testing generateResponse() is that one returns numbers and the other returns text. Both are just functions that need verification.

Why This Actually Works

Here's what I've learned after hundreds of prompt changes:

Developers write evals when it's just "add another test file"
No new tools to learn, no special access to request. It's just `npm test`.
Regressions get caught when metrics are in PR comments
That red arrow is impossible to ignore during code review.
The system improves when developers own it
Every developer can add evals for their features. It's just tests.

I haven't had a prompt regression reach production since April.

Starting Simple (Week 1)

My first eval was embarrassingly basic:

It failed immediately. Of course it did—LLMs aren't deterministic.

The Non-Determinism Problem (Week 2)

Here's what actually works:

Run it with npm test. That's it. No special infrastructure needed.

Making It Visible (Week 3)

Developers started complaining: "The eval failed but I can't see why without digging through CI logs."

Fair point. So I made tests write simple JSON files:

Then added a GitHub Action to post them as PR comments. Here's what every PR shows now:

10 test suites. 40+ metrics. Every PR. Those red arrows catch regressions before they ship.

The Game Changer: Time Series Comparisons (Week 4)

"What was the score last week?" someone asked.

I discovered GitHub artifacts are basically a free time-series database with 30-day retention:

Now every PR automatically compares against main branch. Look at that scorecard above—those red indicators for "Theme Generation Quality" dropping by 5%? That's exactly the kind of regression that would have saved me from the CTO demo disaster.

What I Actually Built

After many months of iteration, here's my entire "eval platform":

Test files next to my code (like any other tests)
A 39-line scorecard writer
A 204-line markdown converter
A 128-line GitHub Action

No new dashboards. No new logins. No vendor lock-in.

The Practical Bits That Matter

LLM-as-judge pattern:

Performance tracking:

Dataset management: Test fixtures in your repo. You already do this.

Prompt versioning: Database configs. A few days of work, not a platform purchase.

Start Tomorrow Morning

Pick one LLM behavior that matters to you. Write a test for it—just a regular test in your existing test suite. Run it 10 times. Check the success rate.

That's it. You've started.

The infrastructure will evolve naturally as you need it. You'll add scorecards when you want visibility. You'll add comparisons when you want trends. But start with just a test.

The Uncomfortable Truth

We've been overthinking this. The infrastructure you need for great AI products already exists in your codebase. It's the same infrastructure you use for all your other code.

You don't need to buy a platform to solve a problem you can test your way out of.

P.S. - Yes, there are valid use cases for specialized eval platforms. But you won't know what you actually need until you've run your own evals for a while. Start with tests. When you hit a real limitation, solve that specific problem. You probably won't hit as many as you think.