A Bigger Context Window Doesn't Mean Better Quality
When a company says their model has 1 million tokens of context, everyone assumes bigger is better. I thought the same thing. But after running LLMs in production for a while, I found out it's not that simple.
Having 1M tokens of context and using them well are two different things. And the difference can cost you thousands of dollars a month.
What context window means in practice
It's the model's working memory for a single call. Everything it can see at once: system prompt, conversation history, documents you sent, and space for the output.
200K tokens is roughly 150,000 words. In a normal conversation that's about 100 to 120 exchanges before you hit the limit. But if you're pasting code, logs, and docs, it drops fast. A 500-line Python file can eat 3,000 to 5,000 tokens by itself.
Here's how the main models compare:
Claude Opus 4.6: 1M tokens
Qwen3.5-397B: 262K, can extend to 1M
Qwen3.5-9B: 262K
MiniMax M2.5: around 200K
Different numbers. But quality doesn't grow with context size. Not the way you'd expect.
Lost in the Middle
There's a known problem in LLM research called "Lost in the Middle". The original paper by Liu et al. (2023) showed that models pay more attention to the beginning and end of the context. Information in the middle gets ignored more often.
If you send 50 documents and ask a question answered in document 27, the model is more likely to miss it than if the answer was in document 1 or document 50.
This isn't a bug. The model learns during training that the start (system prompt, first instructions) and the end (latest question) matter most. This pattern holds regardless of window size.
Does 1M tokens fix this?
No. I expected it would, but it doesn't.
Think of it like a bigger warehouse with the same bad organization. More space, same problem finding things. With 1M tokens the middle is just a bigger middle where things get lost.
What actually helps are specific training techniques and better attention architectures. Qwen3.5 uses something called Gated Delta Networks which handles this better than standard attention. Each new model generation improves, but none of them fully solve it.
The best fix is still on your side: put the important stuff at the beginning and end of your prompt. Anthropic's own prompt engineering docs cover some of these patterns.
Three scenarios
Short context, all relevant. You send 2K tokens of code and ask for a fix. The model sees everything clearly. Best quality you'll get.
Long context, information everywhere. You send 100K tokens of an entire codebase and ask about a bug. The model needs to connect a problem in file A with a dependency in file B and a config in file C. Attention gets thin. Quality drops.
Long context but organized. Same 100K tokens but with clear headers, relevant info at the top and bottom, and specific instructions. Quality goes back up.
The third scenario shows what matters. It's not the size, it's how you organize what goes in.
Cost comparison with real numbers
Here's a concrete example. An internal AI assistant that answers questions using company docs.
Strategy A: send everything. You put 3 full documents in the context, 50K tokens total. With Claude Sonnet at \(3/M input tokens (pricing), each call costs \)0.15. Takes 5 to 8 seconds. Quality is ok but the model has to filter a lot of noise.
Strategy B: use RAG. You use a vector database (like PostgreSQL with pgvector) to pull only the 5 most relevant chunks. 5K tokens total. Same call costs $0.015. Takes 1 to 2 seconds. Quality is often better because the model only gets the useful parts.
Strategy B costs 10x less, responds 3 to 4x faster, and usually gives better answers.
Scale it to 1,000 calls per day:
Strategy A: \(150/day, \)4,500/month
Strategy B: \(15/day, \)450/month
Same quality or better. One tenth of the cost.
The practical rule
Big context is insurance, not strategy. You want the ability to process 200K tokens when you need it. A full codebase review, a long regulatory document. But for normal operations, 5 to 15K tokens of well-picked information will give you better results, faster, and cheaper.
This is also why a Qwen3.5-9B with 262K context running locally can be really useful for focused tasks. If you send it clean, relevant context, the quality gap between it and a top model gets much smaller. Good context selection levels the field.
Next article I'll connect all of this to structured prompts and model routing, and show how they save money in practice.
Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.
