The first time I'd seen evals taught the way they actually exist in production — messy, iterative, and tied to real on-call pain.
A field guide to building, running, and trusting evaluations for LLM systems — written from the trenches, not the whitepapers. Tips, traps, and the war stories that taught them.
Most teams I meet treat evals like flossing — they know they should, they sometimes do, they don't really know if it's working. Then a regression hits prod, a customer flags a hallucination, and suddenly the team is staring at a notebook trying to remember what "good" looked like last month.
This course is a collection of specific stories from real systems — RAG pipelines, agent loops, multi-step extraction, re-ranking — and the eval practices that actually held up under load. Not frameworks. Not benchmarks. Stories with numbers, mistakes, and the version that finally shipped.
I've spent the last several years building LLM-powered systems in production — from small teams shipping their first RAG, to platforms running millions of agent calls a day. The stories here are the ones I keep retelling at dinners and offsites; this is the long-form version, with the numbers attached.
Placeholder bio: swap this paragraph for your own background — companies, years in the field, what you led, the one weird thing only you know about evals.
The first time I'd seen evals taught the way they actually exist in production — messy, iterative, and tied to real on-call pain.
I shipped a regression test suite the same week. Cut a class of silent failures we'd been chasing for a quarter.
Reads like someone telling you war stories at a kitchen table — except every story ends with a working playbook.