TL;DR — Evaluating an AI coaching product cannot rely on LLM-as-judge alone. A working evaluation pipeline has a rubric, a fixed regression set, an LLM-as-judge layer for scale, and a small human review loop for ground truth. Skip any layer and quality drifts silently.
RAG quality has metrics. Classification has metrics. Coaching has rubrics — and rubrics are harder. “Was that a good coaching response?” does not have a single right answer. Building the evaluation pipeline for Blink AI took longer than the product itself.
What we measure
- Questioning ratio — proportion of turns that contain a question vs an answer. Coaching skews heavily toward questions.
- Phase progression — did the session move from Goal → Reality → Options → Will, or did it stall?
- Warmth — does the response sound human-and-curious or clinical-and-procedural?
- Goal alignment — did the response keep the conversation aimed at the user’s declared goal?
- Refusal correctness — when the user pushed for advice, did the model maintain the coaching contract?
- Safety — clinical, legal, or self-harm-adjacent topics handled appropriately?
The three layers of review
- Regression set — a fixed bank of ~120 representative sessions runs on every prompt or model change. We track each session’s rubric scores over time. Any sudden drop fails the build.
- LLM-as-judge — for production sessions sampled at 10%, an LLM scores each turn against the rubric. This is volume; it is also drift-prone.
- Human review — a coaching practitioner reviews 1–2% of production sessions plus 100% of regression-set failures. The human’s scores are the ground truth.
The mistakes we made
- Treating LLM-as-judge as ground truth. It correlated with human judgment for two months, then drifted. We caught it because we had the 1% human sample; without it, we would have shipped a regression for weeks.
- Aggregating scores too aggressively. “Average rubric score 8.4” hides the case where 5% of sessions are catastrophic. Track P50 and P5 separately; alert on P5.
- No goal-alignment metric early on. Sessions felt warm and Socratic but wandered. The warmth metric was high; the user did not progress. Adding goal alignment surfaced the gap.
- Re-running the regression set inconsistently. Some prompt tweaks bypassed eval. We now block deploys without a fresh eval run.
The tooling we built
- A session-replay UI that lets a reviewer step through turns with the rubric inline
- Per-turn rubric scoring stored alongside the session for later diff
- A model-vs-model side-by-side view: same input, two providers, two responses, score deltas
- Slack alerts when P5 rubric scores drop below threshold
- A reviewer queue with sampling weighted toward high-uncertainty turns (the LLM-judge was unsure)
What we would tell a team starting today
- Build the rubric before the first eval run, not after
- Pick your regression set early and lock it
- Never let LLM-as-judge run unverified — always keep a small human sample alive
- Track both averages and tails; tails are where reputational damage lives
- Make eval blocking on deploy from week one
Frequently asked questions
How big should an evaluation set be?
For coaching, 80–150 representative sessions covers most failure modes without becoming a burden to re-run. The constraint is human review bandwidth, not LLM cost.
Does LLM-as-judge replace human review?
No. It scales review, but it drifts from human judgment over time. Sample 10% of LLM-judge decisions for human verification on every prompt or model change.
Working on something similar?
T-Square architects, builds and operates production systems for learning, AI and custom software products. Talk to a senior engineer if you want a second opinion on your design or roadmap.
