Skip to main content

Command Palette

Search for a command to run...

A Bigger Context Window Doesn't Mean Better Quality

Published
5 min read

When a company says their model has 1 million tokens of context, everyone assumes bigger is better. I thought the same thing. But after running LLMs in production for a while, I found out it's not that simple.

Having 1M tokens of context and using them well are two different things. And the difference can cost you thousands of dollars a month.


What context window means in practice

It's the model's working memory for a single call. Everything it can see at once: system prompt, conversation history, documents you sent, and space for the output.

200K tokens is roughly 150,000 words. In a normal conversation that's about 100 to 120 exchanges before you hit the limit. But if you're pasting code, logs, and docs, it drops fast. A 500-line Python file can eat 3,000 to 5,000 tokens by itself.

Here's how the main models compare:

Different numbers. But quality doesn't grow with context size. Not the way you'd expect.


Lost in the Middle

There's a known problem in LLM research called "Lost in the Middle". The original paper by Liu et al. (2023) showed that models pay more attention to the beginning and end of the context. Information in the middle gets ignored more often.

If you send 50 documents and ask a question answered in document 27, the model is more likely to miss it than if the answer was in document 1 or document 50.

This isn't a bug. The model learns during training that the start (system prompt, first instructions) and the end (latest question) matter most. This pattern holds regardless of window size.

Does 1M tokens fix this?

No. I expected it would, but it doesn't.

Think of it like a bigger warehouse with the same bad organization. More space, same problem finding things. With 1M tokens the middle is just a bigger middle where things get lost.

What actually helps are specific training techniques and better attention architectures. Qwen3.5 uses something called Gated Delta Networks which handles this better than standard attention. Each new model generation improves, but none of them fully solve it.

The best fix is still on your side: put the important stuff at the beginning and end of your prompt. Anthropic's own prompt engineering docs cover some of these patterns.


Three scenarios

Short context, all relevant. You send 2K tokens of code and ask for a fix. The model sees everything clearly. Best quality you'll get.

Long context, information everywhere. You send 100K tokens of an entire codebase and ask about a bug. The model needs to connect a problem in file A with a dependency in file B and a config in file C. Attention gets thin. Quality drops.

Long context but organized. Same 100K tokens but with clear headers, relevant info at the top and bottom, and specific instructions. Quality goes back up.

The third scenario shows what matters. It's not the size, it's how you organize what goes in.


Cost comparison with real numbers

Here's a concrete example. An internal AI assistant that answers questions using company docs.

Strategy A: send everything. You put 3 full documents in the context, 50K tokens total. With Claude Sonnet at \(3/M input tokens (pricing), each call costs \)0.15. Takes 5 to 8 seconds. Quality is ok but the model has to filter a lot of noise.

Strategy B: use RAG. You use a vector database (like PostgreSQL with pgvector) to pull only the 5 most relevant chunks. 5K tokens total. Same call costs $0.015. Takes 1 to 2 seconds. Quality is often better because the model only gets the useful parts.

Strategy B costs 10x less, responds 3 to 4x faster, and usually gives better answers.

Scale it to 1,000 calls per day:

  • Strategy A: \(150/day, \)4,500/month

  • Strategy B: \(15/day, \)450/month

Same quality or better. One tenth of the cost.


The practical rule

Big context is insurance, not strategy. You want the ability to process 200K tokens when you need it. A full codebase review, a long regulatory document. But for normal operations, 5 to 15K tokens of well-picked information will give you better results, faster, and cheaper.

This is also why a Qwen3.5-9B with 262K context running locally can be really useful for focused tasks. If you send it clean, relevant context, the quality gap between it and a top model gets much smaller. Good context selection levels the field.

Next article I'll connect all of this to structured prompts and model routing, and show how they save money in practice.


Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.