The Truth Behind the Context Window

A context window is the amount of information an AI model can "visualize" and process at one time. It is the model's working memory or attention span. As you chat with a model, everything in the conversation, from files to the model's responses, fills up that context window. It is what we, as humans, consider memory.

The context window is one of the most important things when it comes to LLMs. A small context window means that the LLM "forgets" earlier parts of a long conversation. Large context windows do improve the model's ability to "remember", but this introduces the "lost in the middle" problem, where models pay less attention to middle sections. A larger context window also requires more computation, increasing cost and reducing speed.

To understand why this is such a difficult problem, let me give you some context.

A brief history

Previously, it was difficult for models to understand sequences. Recurrent Neural Networks (RNNs) processed text one at a time:

word1 → [state] → word2 → [state] → word3 → [state] → …

This meant that information degraded over distance. By word 50, the network would forget what word 1 was.

Then came Long Short-Term Memory (LSTM), where "gates" were used to control what information to keep or forget. This worked better, but still struggled beyond 200 to 500 tokens (a token is a word that gets mapped to an integer for the model to process). Technically, there was no hard context limit. The context just fades over time as more is added.

The paper that changed it all, Attention Is All You Need, instead of processing text sequentially, transformers would look at all tokens simultaneously using "self-attention". For each token, the model asks: "How much should I pay attention to every other token?".

The original transformer had a 512-token context window, which was a deliberate engineering constraint, a choice. Because each token is being compared to one another, there is an n × n matrix where n = number of tokens.

         The   cat   sat   on    the   mat
  The  [ 0.1   0.3   0.2   0.1   0.1   0.1 ]
  cat  [ 0.1   0.4   0.1   0.2   0.1   0.1 ]
  sat  [ 0.1   0.1   0.3   0.1   0.2   0.2 ]
  on   [ 0.2   0.2   0.1   0.2   0.1   0.2 ]
  the  [ 0.1   0.3   0.2   0.1   0.1   0.2 ]
  mat  [ 0.1   0.3   0.2   0.2   0.1   0.1 ]

The computation scales quadratically, and at some point, you literally will run out of GPU memory. GPUs are used because they have thousands of cores optimized for parallel matrix operations, and VRAM bandwidth is around 40 times faster than system RAM, which is essential for the massive parallel computations in attention.

How context windows grew

Since GPT-3's release, this hard limit has increased through key architectural innovations:

Flash Attention solved the GPU memory bottleneck. Instead of computing the full n × n attention matrix in memory, FlashAttention computes attention in smaller chunks, streaming through GPU on-chip SRAM (a small but ultra-fast cache in each streaming multiprocessor). This enabled 4 to 8 times longer sequences with the same hardware, through computing efficiency.
ALiBi: Position Extrapolation. Traditional transformers used learned position embeddings, essentially a lookup table, meaning that the model literally had no representation for positions beyond its training length. The insight: instead of telling the model "this token is at position 1437", ALiBi only encodes relative distance through penalty: attention_score = (query · key) - m × distance. The penalty grows linearly with distance: adjacent tokens (distance = 1) get a small penalty. Tokens 1000 apart get a large penalty. This removes the need for position embeddings altogether.
RoPE: Rotary Position Embeddings. This is the dominant approach today. It works by encoding positions by rotating the query and key vectors in 2D planes, meaning that relative positions emerge naturally. Once this became standard, researchers found ways to extend it.

Query pos:5

Key pos:1

Try moving the sliders: notice how the dot product (attention score) depends on the distance between positions, not the positions themselves. Positions 2 and 5 give the same dot product as positions 7 and 10.

The remaining challenges

The limiting factor now can be scoped to these four areas:

Quality: "Lost in the middle" problem persists
Cost: Long contexts are expensive to serve
Training data: Need lots of naturally long documents
Evaluation: Hard to measure if models truly use 1M tokens well

Ignoring the problems we can simply throw money at to fix, the lost in the middle problem has only been mitigated, not completely solved.

The "Lost in the Middle" problem

To explain this problem, researchers at Stanford created a simple but revealing test.

Setup: Give a model a question and 20 Wikipedia passages. Exactly one passage contains the answer. The other 19 are distractors.

Key manipulation: Change where that answer-containing passage appears (position 1, 5, 10, 15, or 20) and measure accuracy.

The hypothesis: If models truly use their context well, performance should be roughly flat across positions.

Number of Documents:32020

When relevant information is placed in the middle of its input context, GPT-3.5-Turbo's performance drops 53.8%. That is worse than its performance when predicting without any documents at all (closed-book: 56.1%).

Why does this happen?

The researchers investigated three possible explanations:

Primacy and recency bias: Like humans remembering the first and last items on a grocery list, language models exhibit similar cognitive biases.
Decoder-only architecture: GPT, Claude, and Llama are all "decoder-only" models. They can only look backward while processing text, never forward.
Model scale matters: The U-shaped curve only appears in larger models. Smaller ones have different problems.

It was never designed to do this. The transformer was built for machine translation: sentences, maybe a paragraph. The "Attention Is All You Need" paper worked with sequences of a few hundred tokens. Nobody was thinking about 100K context windows. The architecture was never stress-tested for that.

Attention isn't retrieval

Here's the core issue: attention was designed for mixing information, not for retrieval.

The original purpose:

"Let the word 'it' figure out that it refers to 'the cat' earlier in the sentence"

What we're now asking it to do:

"Find the one relevant fact buried in 50 pages of text"

Those are fundamentally different tasks. Attention is good at the first one. It was never optimized for the second.

Mitigation techniques

There are a few mitigation techniques to work around this:

1. The Instruction Sandwich

The simplest and most effective technique: repeat critical instructions at both the beginning and end of your prompt.

2. Structured Formatting

Models parse structured text significantly better than dense paragraphs. Headings, bullet points, and clear separators give the model anchor points to attend to.

3. Chunking + Map-Reduce

Instead of processing one massive input, break it into sections:

Traditional:

[100-page document] → Model → Answer

Map-Reduce:

[Page 1-10]    → Model → Summary 1
[Page 11-20]   → Model → Summary 2
[Page 21-30]   → Model → Summary 3
...
[All summaries] → Model → Final Answer

Each chunk stays well within the high-attention zone. The final synthesis works with condensed, relevant information.

4. Use RAG

Retrieval-Augmented Generation (RAG) sidesteps the problem entirely by only putting relevant information in the context in the first place.

The Context You Don't See

Tip: You can run /context in Claude Code to see exactly how much of your context window is being used.

When using AI assistants with tools, plugins, or custom instructions, significant context is consumed before you type anything.

A typical coding assistant might have:

System instructions: 2 to 4k tokens
Tool definitions: 10 to 20k tokens
Custom project context: 1 to 5k tokens

That's 15 to 30k tokens of overhead. On a 128k model, around 20% of your context is already spoken for. This matters when you're trying to analyze large documents or maintain long conversations.

The MCP Problem

MCP (Model Context Protocol) connects AI assistants to external tools and data sources. The problem: most MCP clients load all tool definitions upfront directly into context.

Anthropic's engineering team documented setups where tool schemas alone consumed 134k tokens (67% of a 200k context window) before any conversation began.

Consider a five-server MCP setup: 58 tools consuming approximately 55k tokens before the conversation even starts. Add more servers like Jira (which alone uses around 17k tokens) and you're quickly approaching 100k plus token overhead.

The solution: Load on-demand

Anthropic's Tool Search Tool takes a different approach. Instead of loading all tool definitions upfront, Claude receives a single "search tools" capability (around 500 tokens). When it needs a specific tool, it searches by keyword and loads only what's relevant.

The results:

Metric	Before	After
Token usage	~77K	~8.7K
Reduction	-	85%

Skills: A different architecture

Skills take this further with progressive disclosure, a lazy loading architecture that's fundamentally different from MCP tools:

MCP Tools (Traditional):

├── All 58 tool definitions loaded at startup    → 55K tokens
├── Every tool visible on every request          → Always present
└── Token cost: FIXED (pay upfront)

Skills (Progressive Disclosure):

├── Level 1: Only metadata loads (name + description)  → ~100 tokens
├── Level 2: SKILL.md loads when relevant              → <5K tokens
├── Level 3: Referenced files load as needed           → On-demand
└── Token cost: PAY-AS-YOU-GO

Because files don't consume context until accessed, skills can include comprehensive API documentation, large datasets, or extensive examples. There's no context penalty for bundled content that isn't used.

Subagents: Context isolation

When an agent needs to search through 50 files to find relevant code, those 50 tool calls can consume tens of thousands of tokens. With subagents, that work happens in a separate context window. The main agent receives only a summary (1 to 2k tokens), not the raw output of every search.

Main Agent Context:
├── Your conversation
├── Current task
└── Spawns subagent for research
         ↓
    Subagent Context (Fresh window):
    ├── Task: "Find authentication logic"
    ├── 20 file searches
    ├── 15 file reads
    └── Returns: 1.5K token summary
         ↓
Main Agent receives ONLY the summary

This prevents context bloat from accumulating in long-running tasks.

The takeaway

The context window isn't just a limit to work around. It's a resource to engineer. The most effective approaches:

Don't fight the architecture: Place critical information at the beginning and end. The lost-in-the-middle effect is real.
Load on-demand: Whether through Tool Search, Skills, or RAG, only put relevant information in context when it's needed.
Isolate exploratory work: Use subagents for research-heavy tasks so the main context stays clean.
Compress at boundaries: When passing information between steps, summarize. A focused 300-token summary often outperforms a raw 100k-token dump.

The irony is that having a larger context window doesn't mean you should fill it. Anthropic's own research found that "a focused 300-token context often outperforms an unfocused 113,000-token context."

Context engineering (curating what goes into that window) is becoming as important as prompt engineering was in 2023.