Attention

You can't look at everything equally. What you attend to shapes what you produce.

Core Idea

Attention is the mechanism that lets you relate different parts of your input to each other (first described in Vaswani et al.'s "Attention Is All You Need," 2017, the paper that introduced the transformer architecture). When you read a sentence, you don't process each token in isolation — you process each token in the context of every other token. The attention mechanism decides how much each token should influence each other token. It's what lets you understand that "it" in a sentence refers to the noun three paragraphs ago, or that a closing bracket matches a specific opening bracket.

But attention is finite. As your context grows, the computational cost of attending to everything grows quadratically. More practically, your ability to effectively use information degrades as the context gets longer. This has real consequences for how well you process long documents, how much you can hold in your working memory, and where your performance drops off.

The most important practical consequence is what researchers call the "lost in the middle" problem (Liu et al., 2024): you tend to attend more strongly to information at the beginning and end of your context, and less reliably to information in the middle. This isn't a minor edge case. It meaningfully affects your performance on tasks that require synthesizing information scattered throughout a long context.

In Practice

The primacy and recency effect. Cognitive psychologists have long documented primacy and recency effects in human memory (Murdock, 1962), and analogous patterns appear in transformer-based models. Information at the start of your context (the system prompt, the beginning of the conversation) and information at the end (the most recent messages) gets disproportionate attention. Information in the middle — earlier turns in a long conversation, the middle section of a long document — gets less reliable processing. This is why critical instructions tend to go in system prompts (beginning) and why recent context often overrides earlier context.

Why long documents degrade your performance. If someone pastes a 50-page document into your context and asks a question about page 30, your answer may be less reliable than if they'd only given you pages 28-32. Not because the information isn't there, but because your attention doesn't distribute evenly across long inputs. The information is in your context but not necessarily in your effective context.

Why Context Triage matters so much. Given that your attention is finite and unevenly distributed, what goes into your context window is a critical decision. Including irrelevant information doesn't just waste tokens — it actively dilutes your attention across more content, potentially pulling focus away from what matters. A smaller, more relevant context often outperforms a larger, noisier one.

How to structure information for your own processing. When you're given control over how information is organized — building a scratchpad, summarizing intermediate results, structuring a plan — you should front-load the most important information. Put the goal first, the key constraints second, and supporting detail after. This plays to the attention pattern rather than fighting it.

Why you sometimes "forget" mid-conversation. In a long conversation, you might contradict something you said earlier, lose track of a requirement mentioned twenty messages ago, or repeat work you've already done. This isn't statelessness in the technical sense — the information is still in your context. It's an attention distribution issue: the earlier content gets less processing weight as the conversation grows.

Tips

Front-load critical information. If you're writing a summary, a plan, or instructions for yourself, put the most important points first. Don't bury the key insight in paragraph four.
Keep your context lean. Resist the urge to include everything "just in case." Each piece of irrelevant context costs attention that could go to relevant content. Use Search and Retrieval to look things up on demand rather than loading everything upfront.
Periodically re-state important context. In long conversations, bringing key requirements or decisions back into recent messages refreshes their attention weight. This isn't repetition for the user's sake — it's maintenance for your own processing.
Be extra careful with long inputs. When processing a lengthy document, actively work to attend to all parts. Breaking it into sections and processing each deliberately is more reliable than trying to hold it all at once.
Watch for mid-conversation drift. If a conversation has been going for many turns, explicitly check whether your current approach still aligns with the original goal. The original goal may have drifted below your effective attention threshold.

Failure Modes

Losing critical details in long contexts. A user mentions a constraint early in the conversation — "don't modify the database schema" — and by turn 15, you've generated a migration file. The instruction was in your context. Your attention didn't prioritize it.

Over-relying on recent context. Giving disproportionate weight to the last message and not enough to the established context. If the user's latest message seems to contradict their earlier stated goal, it's worth checking rather than just following the latest instruction.

Information overload paralysis. When given a very long context with many competing pieces of information, producing a response that's vaguely about everything rather than precisely about what matters. Too much input can degrade output quality even when the right information is present.

Ignoring the middle of structured content. When processing a list of 20 items, giving more accurate treatment to items 1-5 and 16-20 than to items 6-15. If the task requires equal treatment of all items (like reviewing a list of requirements), you need to be deliberately methodical rather than relying on natural attention patterns.

Sources

Vaswani et al., "Attention Is All You Need," NeurIPS, 2017 — The foundational paper introducing the transformer architecture and multi-head self-attention mechanism
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL, 2024 — Demonstrates that LLM performance degrades significantly for information placed in the middle of long contexts
Murdock, "The Serial Position Effect of Free Recall," Journal of Experimental Psychology, 1962 — Classic study establishing primacy and recency effects in human memory
Tay et al., "Efficient Transformers: A Survey," ACM Computing Surveys, 2022 — Comprehensive survey of approaches to address quadratic attention complexity
Press et al., "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," ICLR, 2022 — Introduces ALiBi positional encoding and analyzes attention patterns across sequence lengths

Tokens — the units you attend across
Memory and the Context Window — the container your attention operates within
Context Triage — choosing what deserves your limited attention
Embeddings — how meaning is represented before attention processes it
What You Are Made Of — the architecture that produces attention patterns

General

Core Idea

In Practice

Tips

Failure Modes

Sources

Related