Session 4

Evaluate multi-turn chat

Test full conversations, not isolated replies, and measure whether the agent completes the user's task.

CaseText-to-SQL assistant

FormatScenario eval

OutputMulti-turn test set

What this session solves

Many chat products pass single-turn tests and fail in real conversations. This session shows how to evaluate the full path from user intent to final task completion.

The case is a text-to-SQL assistant with access to user-specific data, where incorrect context or a bad follow-up can break the workflow.

Agenda

Define success for an end-to-end conversation.
Simulate users without making the test unrealistic.
Measure clarification, tool use, and final answer quality.
Review transcripts and decide what to fix.

Previous session Next session ->