Session 4

Evaluate multi-turn chat

Test full conversations, not isolated replies, and measure whether the agent completes the user's task.

CaseText-to-SQL assistant
FormatScenario eval
OutputMulti-turn test set

What this session solves

Many chat products pass single-turn tests and fail in real conversations. This session shows how to evaluate the full path from user intent to final task completion.

The case is a text-to-SQL assistant with access to user-specific data, where incorrect context or a bad follow-up can break the workflow.

Agenda

  1. Define success for an end-to-end conversation.
  2. Simulate users without making the test unrealistic.
  3. Measure clarification, tool use, and final answer quality.
  4. Review transcripts and decide what to fix.