NEW Cohort 02 — stories from production are open

Evals are how
your AI gets better.

A field guide to building, running, and trusting evaluations for LLM systems — written from the trenches, not the whitepapers. Tips, traps, and the war stories that taught them.

// 00. Premise

Most teams I meet treat evals like flossing — they know they should, they sometimes do, they don't really know if it's working. Then a regression hits prod, a customer flags a hallucination, and suddenly the team is staring at a notebook trying to remember what "good" looked like last month.

This course is a collection of specific stories from real systems — RAG pipelines, agent loops, multi-step extraction, re-ranking — and the eval practices that actually held up under load. Not frameworks. Not benchmarks. Stories with numbers, mistakes, and the version that finally shipped.

// 01. Stories

Six lessons from production LLM systems.

06 entries · ~4h read
// 02. Teacher
Author note

Written by someone who has shipped this, broken this, and fixed it again.

I've spent the last several years building LLM-powered systems in production — from small teams shipping their first RAG, to platforms running millions of agent calls a day. The stories here are the ones I keep retelling at dinners and offsites; this is the long-form version, with the numbers attached.

Placeholder bio: swap this paragraph for your own background — companies, years in the field, what you led, the one weird thing only you know about evals.

// 03. Notes from readers

What students said after the first cohort.

03 selected

The first time I'd seen evals taught the way they actually exist in production — messy, iterative, and tied to real on-call pain.

A. — Senior ML Engineer
[placeholder] · scale-up, fintech

I shipped a regression test suite the same week. Cut a class of silent failures we'd been chasing for a quarter.

B. — Staff Engineer
[placeholder] · consumer AI product

Reads like someone telling you war stories at a kitchen table — except every story ends with a working playbook.

C. — Engineering Lead
[placeholder] · early-stage AI startup