~ darker square = model is paying more attention to that word ~
Scene A: Resolving a pronoun The cat sat on the mat because ___ ? the model pays MOST attention to "cat" when picking the pronoun "it" ← generated Scene B: RAG answer grounding on retrieved context retrieved chunk: refund policy is 30 days user question: What's my refund window? generating ___ strong attention to "30" (answer source) strong attention to "refund" (topic) "30 days" 💡 why this matters for RAG: when the FM generates a grounded answer, it's literally "attending" to specific tokens in your retrieved chunks. noisy chunks = noisy attention = weaker answers. quality matters.

What this unlocks

"Attention" is literal, not metaphorical Inside the transformer, for every token being generated, the model computes a weight on every previous token — a number saying "how much do I care about this word for what I'm generating next." These weights are called attention scores. Darker shading in the diagram = higher weight. The weighted average of prior tokens' representations informs the next token.
Why this explains "lost in the middle" Long contexts have a known failure mode: information in the middle gets ignored more than stuff at the beginning or end. Attention budgets aren't evenly distributed across the context. This is why stuffing 50 chunks into the prompt performs worse than carefully curating 3 chunks — the middle ones just don't get looked at much.
Why chunking strategy matters If a chunk is too big, the FM's attention is diluted across lots of mostly-irrelevant tokens. If a chunk is too small, critical context around an answer is missing. Good chunking (typically 300-800 tokens with some overlap) keeps each chunk focused enough that the FM can attend to it coherently.
Exam angle — why reranking helps Rerankers (like Bedrock Rerank / Cohere Rerank) run a lightweight cross-attention pass between the query and each candidate chunk — essentially asking "how much would the query attend to this chunk?" That's different from (and more accurate than) pure vector similarity. See Pattern 2: Advanced RAG.
You won't compute attention on the exam The exam won't ask you to calculate attention weights. But it tests the consequences: why long contexts degrade, why rerank beats pure vector, why chunking strategy matters, why "just use a 200K context model" often isn't the answer. Those all trace back to attention.

Related

Pattern 2: Advanced RAG — reranking leverages attention-style scoring
Mental Model 1: Embeddings · Mental Model 2: Temperature · Mental Model 3: Prompt Injection · Mental Model 5: Context Window