Mental Model · Attention Heatmap

~ darker square = model is paying more attention to that word ~

What this unlocks

"Attention" is literal, not metaphorical Inside the transformer, for every token being generated, the model computes a weight on every previous token — a number saying "how much do I care about this word for what I'm generating next." These weights are called attention scores. Darker shading in the diagram = higher weight. The weighted average of prior tokens' representations informs the next token.

Why this explains "lost in the middle" Long contexts have a known failure mode: information in the middle gets ignored more than stuff at the beginning or end. Attention budgets aren't evenly distributed across the context. This is why stuffing 50 chunks into the prompt performs worse than carefully curating 3 chunks — the middle ones just don't get looked at much.

Why chunking strategy matters If a chunk is too big, the FM's attention is diluted across lots of mostly-irrelevant tokens. If a chunk is too small, critical context around an answer is missing. Good chunking (typically 300-800 tokens with some overlap) keeps each chunk focused enough that the FM can attend to it coherently.

Exam angle — why reranking helps Rerankers (like Bedrock Rerank / Cohere Rerank) run a lightweight cross-attention pass between the query and each candidate chunk — essentially asking "how much would the query attend to this chunk?" That's different from (and more accurate than) pure vector similarity. See Pattern 2: Advanced RAG.

You won't compute attention on the exam The exam won't ask you to calculate attention weights. But it tests the consequences: why long contexts degrade, why rerank beats pure vector, why chunking strategy matters, why "just use a 200K context model" often isn't the answer. Those all trace back to attention.

Pattern 2: Advanced RAG — reranking leverages attention-style scoring
Mental Model 1: Embeddings · Mental Model 2: Temperature · Mental Model 3: Prompt Injection · Mental Model 5: Context Window

Attention = Which Words the Model Is Looking At

What this unlocks

Related