Session 2

Find the failures that matter

Move from pass rate to error taxonomy so the team knows which product problem to fix first.

CaseRAG agent traces
FormatTrace review
OutputFailure clusters

What this session solves

A single score does not tell you what to ship. This session shows how to read traces, group failures, and connect each failure type to a product or engineering decision.

The case is a RAG agent with noisy traces and unclear failure modes. You will separate retrieval problems, reasoning problems, policy misses, and UX gaps.

Agenda

  1. Read traces without getting lost in edge cases.
  2. Cluster errors by user impact, not by model symptom.
  3. Mine examples for prompts, datasets, and retrieval tests.
  4. Decide when a metric is useful and when it is hiding risk.