~ a fixed-capacity bucket for every conversation ~
✓ Fits just fine Input + output under the limit max capacity System prompt ~500 tokens Retrieved chunks (RAG context) ~3000 tokens User question Response (room to grow ✓) ~ empty headroom ~ ~3500 / 8000 tokens used FM has room to generate a full answer ✗ Overflow! Too many chunks, too long a prompt ← LIMIT Overflow! (gets cut off) System prompt ~2000 tokens (too long!) Retrieved chunks user retrieved 20 chunks instead of 3 ~7000 tokens 😱 User question Response gets truncated! 📉 ~10000 / 8000 tokens 💥 Truncation · errors · lost info How to fix it 1 · Prompt compression shorten system prompt 2 · Retrieve fewer chunks k=3 not k=20 3 · Hierarchical chunking smaller chunks retrieved 4 · Context pruning drop old conversation turns 5 · Bigger context model 200K+ window (costly!) 💡 Key insight for the exam: The context window holds everything — system prompt, RAG chunks, conversation history, user question, AND the response. When a question says "response gets truncated" or "model cuts off mid-sentence" → think bucket overflow.

Why this mental model sticks

"Context window is a bucket" isn't technically precise — it's a mental handle. You don't need to remember the exact token count or formula; you need to remember what category of problem overflow is. A bucket fills up, spills over, needs to be managed. That mental image does real work on the exam because it lets you pattern-match fast.

Related

Tree 4: RAG Troubleshooting — symptom #3 (response cut off) maps directly to bucket overflow
Mental Model 1: Embeddings · Mental Model 2: Temperature · Mental Model 3: Prompt Injection · Mental Model 4: Attention