STARTS AUGUST 10 - applications open

00days

00hours

00minutes

00seconds

Build reliable evals for AI agents in 5 weeks

Apply now

1. Why it matters

If you cannot define quality, you cannot ship AI safely. Learn how to set metrics, build datasets, inspect failures, and make launch decisions from data.

2. Instructor

Andrey Kiselyov, Head of Product.

He has shipped AI systems in production, from early RAG prototypes to platforms handling millions of agent calls. At Revolut, he owned AI feature launches.

The program uses real product cases: metrics, failure modes, tradeoffs, and rollout decisions.

LinkedIn →

3. Outcomes

Leave with an eval workflow your team can use.

now

You do not have production data yet

after the program

You create synthetic datasets and start measuring during prototyping

now

The pass rate looks fine, but failures keep shipping

after the program

You read traces, cluster errors, and pick the highest-impact fix

now

Each new AI surface needs a new eval plan

after the program

You reuse one toolkit across text, images, and multi-turn chat

now

PMs and stakeholders do not trust the metrics

after the program

You tie evals to product decisions and launch criteria

4. Syllabus

5 live sessions. 5 production cases.

5 sessions · 7.5 hours

Define quality for a Q&A agent

Case: help-center search. Turn product requirements into a first eval when you have no live traffic.

Find the failures that matter

Read traces, cluster errors, mine few-shot examples, and improve prompts without chasing one fake perfect metric.

Evaluate image generation

Case: custom images for bank cards. Build safety rubrics and use VLM-as-judge for visual outputs.

Evaluate multi-turn chat

Case: a text-to-SQL assistant. Simulate users, build synthetic personas, and measure more than pass rate.

Roll evals into production

Set team workflows, review gates, stakeholder reporting, legal checks, and common anti-patterns.

Session breakdown

Concrete objectives, cases, and outputs for each session.

Session 1 · Q&A agent

Define quality before you pick metrics.

Objective

Turn product requirements into eval criteria.

Case

Help-center Q&A with no live traffic.

Output

First dataset, scoring rubric, and baseline report.

Session 2 · Error analysis

Find the failures that change the product.

Objective

Move from pass rate to error taxonomy.

Case

RAG agent with noisy traces and unclear failures.

Output

Trace review workflow, failure clusters, and fix priority.

Session 3 · Image generation

Evaluate safety and visual quality separately.

Objective

Separate policy checks from subjective quality checks.

Case

Custom images for payment cards.

Output

Safety rubric, visual judge prompt, and review protocol.

Session 4 · Multi-turn chat

Test conversations, not single replies.

Objective

Measure task completion across full conversations.

Case

Text-to-SQL assistant with user-specific data.

Output

Synthetic users, multi-turn scenarios, and metrics beyond pass rate.

Session 5 · Production rollout

Make evals part of release management.

Objective

Turn evals into launch gates and weekly operating rhythm.

Case

AI feature with stakeholder, legal, and launch risk.

Output

Review gates, owner map, and reporting cadence.

5. Who should apply

For teams shipping AI features.

AI, ML, and backend engineers

Build automated eval pipelines, test multi-step agents, measure RAG quality, and catch regressions before release.

AI product managers

Connect user outcomes to model metrics, define error taxonomies, and give engineering clear launch criteria.

Tech leads and engineering managers

Pick the eval stack, set team rituals, reduce manual review cost, and make evals part of release management.

6. Reviews

What past students said

I came for production details on RAG and agent training. The instructors were clearly practitioners.

Anton Shelin

Program: AI Agents

I had a specific multi-agent architecture question. The instructor reviewed it directly and saved me weeks of experiments.

Alexander Yartsev

Program: AI Agents

I went from not understanding LangChain to building a RAG assistant over internal docs.

Pavel Razuvaev

Program: LLM

The assignments were hard in the right way. They turned vague LLM knowledge into a working model.

Anton Shelin

Program: LLM

7. Trusted by

Teams represented in past cohorts

8. Apply now

A 5-week program for production AI evals.

5 live sessions
5 production cases: Q&A, RAG, images, chat, text-to-SQL
Hands-on assignments with feedback
Recordings and materials
Invoice and reimbursement support

$1,000 one-time payment

Apply now →

Starts August 10, 2026 · 5 weeks

Learn affordably

Not useful? Get a refund

Email us in the first two weeks for a full refund. After that, we refund unused sessions.

Use your company budget

We provide a receipt, invoice, and a short reimbursement email.

Expense it →

9. FAQ

Frequently asked questions

Do I need to code?

Yes. You need basic Python and working knowledge of LLM APIs such as OpenAI.

How much time does it take?

Plan for 4-6 hours per week for 5 weeks. It works with a full-time job.

Does this apply outside OpenAI?

Yes. The eval patterns work with OpenAI, Anthropic, Llama, Mistral, and other model stacks.

Will the sessions be recorded?

Yes. Every live session is recorded. You can watch later in your time zone.

Can my company pay?

Yes. We provide a receipt, company invoice, and a short reimbursement email.

What if it is not useful?

Email us in the first two weeks for a full refund. After that, we refund unused sessions.