RAG pipelines fail in recognizable ways. When your RAG system is misbehaving, the symptom tells you which stage is broken — and each stage has a specific fix.
~ match the symptom to the stage that's failing ~
Why this structure works
RAG is a pipeline. When output is bad, it's tempting to change everything — switch models, re-chunk, re-embed, rewrite prompts. That's expensive and often makes it worse. The right move is to localize the failure first. The user-visible symptom tells you which stage is at fault. Fix the stage that's failing, not the ones that are already working.
Symptom 1 — Hallucination / wrong citations
The retrieval stage probably worked (chunks came back), but the FM didn't stay grounded to them. Either grounding wasn't enforced (turn on Guardrails grounding check), temperature was too high (drop to 0.0-0.2 for factual Q&A), or the prompt didn't explicitly anchor the model to the retrieved context.
Symptom 2 — Off-topic retrievals
The retrieval stage itself failed. Pure vector search missed something — likely a keyword match or a subtlety the embedding model didn't capture. Hybrid search + reranker (see Pattern 2) usually fixes it. Secondary: chunk size is wrong (too big or too small) or your embedding model isn't tuned for your domain.
Symptom 3 — Cut-off responses
You're hitting the context window limit. The bucket overflowed (see the context window mental model). Cut inputs first — smaller top-k, shorter system prompt, summarized history — before reaching for a bigger-context model, which is expensive.
Symptom 4 — Generic responses
Retrieval is pulling data, but the FM is ignoring it or not leveraging it. Explicit prompt instructions ("use only the provided context," "say 'I don't know' if unsure") plus a couple of few-shot examples usually solve this fast — no infra changes needed.
Debug order: output → retrieval → ingestion
Start debugging from what the user sees and work backwards. Check the final output — does it make sense given the retrieved chunks? Then check the retrieved chunks — are they relevant to the query? Then check the ingestion — are the right documents chunked and indexed? Don't re-ingest until you've verified retrieval is broken.