Evals Are Underrated (And Usually Wrong)

If I had to pick the single most underinvested area in AI engineering right now, it would be evaluation.

Most teams ship a prompt, eyeball a few outputs, and call it done. Then they wonder why the model regresses when they switch versions, or why it works great in the demo but fails in prod.

What makes a good eval

A good eval is:

Fast enough to run on every change — if it takes 4 hours, nobody runs it.
Grounded in real failure modes — don't write evals for things that never break.
Honest about what it can't catch — evals give false confidence if you don't understand their blind spots.

The proxy problem

The hardest part isn't building the eval harness — it's picking the right metric. Most metrics are proxies for what you actually care about, and proxies break.

BLEU scores don't tell you if the summary is useful. Exact match doesn't tell you if the answer is right. Even human preference ratings drift based on who's rating and what day it is.

You're always measuring a proxy. The question is how close that proxy is to the thing you care about.

What I've found works

Behavioral tests over metrics — "does this always return valid JSON?" is more reliable than "does this score > 0.8 on ROUGE?"
Real user data — build your eval set from actual failures, not synthetic examples.
Separate evals by concern — one eval for correctness, one for format, one for safety. Mixing them makes failures hard to diagnose.

I'm still figuring out a lot of this. But I think getting evals right is the difference between AI systems that actually improve over time and ones that just drift.