On Building Things That Ship

There's a version of AI engineering that never ships anything. Endless benchmarking, architecture comparisons, ablations on ablations. It produces knowledge but no product.

There's another version that ships constantly but never learns anything. Move fast, break things, patch the breakage, repeat.

I've been trying to find the middle path.

What "shipping" actually requires

Shipping a model-based system requires things that pure research doesn't care about:

Latency budgets — users notice 2 seconds. They won't forgive 8.
Failure modes at the tail — 99th percentile behavior matters when you have real users.
Observability — if you can't see what the model is doing in prod, you're flying blind.
Reversibility — the ability to roll back a prompt change or model version without a crisis.

None of these show up in papers.

What research gives you

But pure engineering without research taste produces brittle systems. You need to understand why something works — not just that it does — or you can't generalize when the inputs change.

The teams I've seen do this well treat the model as a component with known properties, not a magic box. They know what the model is good at, what it's bad at, and where its failure modes cluster.

The practice

Concretely: I try to build things fast enough that I can learn from real usage, but slow enough that I understand what I'm building. That usually means:

Ship a rough version early to get signal.
Don't optimize prematurely, but don't ignore systematic failures.
Write the eval before you write the feature.

It's not a formula. It's a posture.