Aug 09, 2025
I watched our AI completely melt down during a CTO demo last month.
The model started confidently mixing up "deep research" (an existing feature) with "thematic analysis scans" (the new feature I'd just shipped). Every response made the confusion worse. The CTO's face went from interested to puzzled to concerned. I sat there watching my careful prompt engineering unravel in real-time.
After the demo, the CTO asked the question I dreaded: "How can we make sure this doesn't happen again?"
I couldn't promise anything. I was vibe-checking my system prompts, shipping when they "felt right." That's when it hit me—I already know how to prevent regressions. I do it every day with unit tests.
Think about this: You don't use a third-party platform for unit tests. Why would you need one for LLM evals?
This question haunted me after that demo disaster. I started evaluating all the LLM observability platforms—Helicone, LangSmith, Logfire, LangFuse, Lunary. They're all solving problems I didn't actually have. I already had vitest running my tests. I already had GitHub Actions for CI. I already had everything I needed.
The only difference between testing calculateTotal()
and testing generateResponse()
is that one returns numbers and the other returns text. Both are just functions that need verification.
Here's what I've learned after hundreds of prompt changes:
I haven't had a prompt regression reach production since April.
My first eval was embarrassingly basic:
It failed immediately. Of course it did—LLMs aren't deterministic.
Here's what actually works:
Run it with npm test
. That's it. No special infrastructure needed.
Developers started complaining: "The eval failed but I can't see why without digging through CI logs."
Fair point. So I made tests write simple JSON files:
Then added a GitHub Action to post them as PR comments. Here's what every PR shows now:
10 test suites. 40+ metrics. Every PR. Those red arrows catch regressions before they ship.
"What was the score last week?" someone asked.
I discovered GitHub artifacts are basically a free time-series database with 30-day retention:
Now every PR automatically compares against main branch. Look at that scorecard above—those red indicators for "Theme Generation Quality" dropping by 5%? That's exactly the kind of regression that would have saved me from the CTO demo disaster.
After many months of iteration, here's my entire "eval platform":
No new dashboards. No new logins. No vendor lock-in.
LLM-as-judge pattern:
Performance tracking:
Dataset management: Test fixtures in your repo. You already do this.
Prompt versioning: Database configs. A few days of work, not a platform purchase.
Pick one LLM behavior that matters to you. Write a test for it—just a regular test in your existing test suite. Run it 10 times. Check the success rate.
That's it. You've started.
The infrastructure will evolve naturally as you need it. You'll add scorecards when you want visibility. You'll add comparisons when you want trends. But start with just a test.
We've been overthinking this. The infrastructure you need for great AI products already exists in your codebase. It's the same infrastructure you use for all your other code.
You don't need to buy a platform to solve a problem you can test your way out of.
P.S. - Yes, there are valid use cases for specialized eval platforms. But you won't know what you actually need until you've run your own evals for a while. Start with tests. When you hit a real limitation, solve that specific problem. You probably won't hit as many as you think.