rag · T-Square Bilişim

TL;DR — RAG works in demos. It breaks in production in six predictable ways: bad chunking, embedding drift, retrieval recall collapse, prompt bloat, hallucination on missing context and evaluation that lies. Each is fixable; together they take months to surface organically.

Each stage of a RAG pipeline has its own failure mode. Build the guardrails before they bite in production.

1. Chunking that breaks meaning

Naive fixed-size chunking — every 1024 characters, no overlap — cuts sentences in half and splits answers across chunks. The retrieval finds one half; the LLM hallucinates the other. Fix: respect sentence/paragraph boundaries, use overlap of ~10%, and use semantic chunkers for structured content like docs and code.

2. Embedding drift over time

You ship with embedding model A. The provider deprecates it. You upgrade to model B. Now your old embeddings are incompatible with new queries embedded by B. Fix: version your embedding model in metadata, rebuild on upgrade, never mix versions in a single index.

3. Retrieval recall silently degrading

The right chunk is not in the top-K for ~20% of queries, and nobody notices because the LLM still produces plausible-sounding answers. Fix: build an evaluation set with ground-truth chunks, measure recall@K weekly, alert when it drops.

4. Prompt bloat

You start with 3 retrieved chunks. To handle edge cases you bump to 10. Then 20. Now every query is 8000 tokens of context, latency tripled, cost quadrupled and the LLM is getting confused by irrelevant retrieved content. Fix: re-rank retrieved chunks (cross-encoder or LLM re-ranking) and keep the prompt tight.

5. Hallucination on missing context

When retrieval finds nothing, LLMs make up a confident answer. Fix: instruct the model to say “I do not have this information” when context is empty, and verify in evaluation that it actually does. Add a guardrail check before returning the answer.

6. Evaluation that lies

LLM-as-judge correlates with human judgment for some tasks and not others. Spot-check 10% of judge decisions with humans, especially after model or prompt changes. The day you stop verifying is the day your evaluation starts drifting.

The pattern that prevents most of this

A real evaluation set with ~100+ representative queries and known-good answers
Continuous retrieval recall + answer quality measurements
Embedding version metadata on every vector
Re-ranking step between retrieval and generation
A guardrail for empty context
A spot-check loop on LLM-as-judge

None of these are exotic. Each is half a day of work. Skipping them is what makes RAG demos that wow stakeholders quietly become RAG production systems that disappoint users.

Frequently asked questions

How do I measure if my RAG is actually working?

Build an evaluation set of ~100 representative queries with known-good answers. Track retrieval recall (was the right chunk in the top-K?), faithfulness (does the answer rely on retrieved context?) and answer quality (rubric or LLM-as-judge). Run it on every prompt or model change.

What chunk size should I use?

Start with 500–800 tokens with 50–100 token overlap. Tune based on your content — structured docs prefer smaller chunks, narrative content prefers larger. Always preserve sentence and paragraph boundaries.

Working on something similar?

T-Square is an independent software engineering studio. We architect, build and operate production-grade systems for learning, AI and custom software products. Talk to a senior engineer if you’d like a second opinion on your architecture or roadmap.