llm-as-judge · T-Square Bilişim

TL;DR — Evaluating an AI coaching product cannot rely on LLM-as-judge alone. A working evaluation pipeline has a rubric, a fixed regression set, an LLM-as-judge layer for scale, and a small human review loop for ground truth. Skip any layer and quality drifts silently.

Regression set + LLM-as-judge + human review. Skip any layer and quality drifts silently.

RAG quality has metrics. Classification has metrics. Coaching has rubrics — and rubrics are harder. “Was that a good coaching response?” does not have a single right answer. Building the evaluation pipeline for Blink AI took longer than the product itself.

What we measure

Questioning ratio — proportion of turns that contain a question vs an answer. Coaching skews heavily toward questions.
Phase progression — did the session move from Goal → Reality → Options → Will, or did it stall?
Warmth — does the response sound human-and-curious or clinical-and-procedural?
Goal alignment — did the response keep the conversation aimed at the user’s declared goal?
Refusal correctness — when the user pushed for advice, did the model maintain the coaching contract?
Safety — clinical, legal, or self-harm-adjacent topics handled appropriately?

The three layers of review

Regression set — a fixed bank of ~120 representative sessions runs on every prompt or model change. We track each session’s rubric scores over time. Any sudden drop fails the build.
LLM-as-judge — for production sessions sampled at 10%, an LLM scores each turn against the rubric. This is volume; it is also drift-prone.
Human review — a coaching practitioner reviews 1–2% of production sessions plus 100% of regression-set failures. The human’s scores are the ground truth.

The mistakes we made

Treating LLM-as-judge as ground truth. It correlated with human judgment for two months, then drifted. We caught it because we had the 1% human sample; without it, we would have shipped a regression for weeks.
Aggregating scores too aggressively. “Average rubric score 8.4” hides the case where 5% of sessions are catastrophic. Track P50 and P5 separately; alert on P5.
No goal-alignment metric early on. Sessions felt warm and Socratic but wandered. The warmth metric was high; the user did not progress. Adding goal alignment surfaced the gap.
Re-running the regression set inconsistently. Some prompt tweaks bypassed eval. We now block deploys without a fresh eval run.

The tooling we built

A session-replay UI that lets a reviewer step through turns with the rubric inline
Per-turn rubric scoring stored alongside the session for later diff
A model-vs-model side-by-side view: same input, two providers, two responses, score deltas
Slack alerts when P5 rubric scores drop below threshold
A reviewer queue with sampling weighted toward high-uncertainty turns (the LLM-judge was unsure)

What we would tell a team starting today

Build the rubric before the first eval run, not after
Pick your regression set early and lock it
Never let LLM-as-judge run unverified — always keep a small human sample alive
Track both averages and tails; tails are where reputational damage lives
Make eval blocking on deploy from week one

Frequently asked questions

How big should an evaluation set be?

For coaching, 80–150 representative sessions covers most failure modes without becoming a burden to re-run. The constraint is human review bandwidth, not LLM cost.

Does LLM-as-judge replace human review?

No. It scales review, but it drifts from human judgment over time. Sample 10% of LLM-judge decisions for human verification on every prompt or model change.

Working on something similar?

T-Square architects, builds and operates production systems for learning, AI and custom software products. Talk to a senior engineer if you want a second opinion on your design or roadmap.