Guil Silva — AI Security Infrastructure

A Bigger Context Window Doesn't Mean Better Quality

guirgsilva — Thu, 09 Apr 2026 23:00:00 GMT

When a company says their model has 1 million tokens of context, everyone assumes bigger is better. I thought the same thing. But after running LLMs in production for a while, I found out it's not that simple.

Having 1M tokens of context and using them well are two different things. And the difference can cost you thousands of dollars a month.

What context window means in practice

It's the model's working memory for a single call. Everything it can see at once: system prompt, conversation history, documents you sent, and space for the output.

200K tokens is roughly 150,000 words. In a normal conversation that's about 100 to 120 exchanges before you hit the limit. But if you're pasting code, logs, and docs, it drops fast. A 500-line Python file can eat 3,000 to 5,000 tokens by itself.

Here's how the main models compare:

Claude Opus 4.6: 1M tokens
Qwen3.5-397B: 262K, can extend to 1M
Qwen3.5-9B: 262K
MiniMax M2.5: around 200K

Different numbers. But quality doesn't grow with context size. Not the way you'd expect.

Lost in the Middle

There's a known problem in LLM research called "Lost in the Middle". The original paper by Liu et al. (2023) showed that models pay more attention to the beginning and end of the context. Information in the middle gets ignored more often.

If you send 50 documents and ask a question answered in document 27, the model is more likely to miss it than if the answer was in document 1 or document 50.

This isn't a bug. The model learns during training that the start (system prompt, first instructions) and the end (latest question) matter most. This pattern holds regardless of window size.

Does 1M tokens fix this?

No. I expected it would, but it doesn't.

Think of it like a bigger warehouse with the same bad organization. More space, same problem finding things. With 1M tokens the middle is just a bigger middle where things get lost.

What actually helps are specific training techniques and better attention architectures. Qwen3.5 uses something called Gated Delta Networks which handles this better than standard attention. Each new model generation improves, but none of them fully solve it.

The best fix is still on your side: put the important stuff at the beginning and end of your prompt. Anthropic's own prompt engineering docs cover some of these patterns.

Three scenarios

Short context, all relevant. You send 2K tokens of code and ask for a fix. The model sees everything clearly. Best quality you'll get.

Long context, information everywhere. You send 100K tokens of an entire codebase and ask about a bug. The model needs to connect a problem in file A with a dependency in file B and a config in file C. Attention gets thin. Quality drops.

Long context but organized. Same 100K tokens but with clear headers, relevant info at the top and bottom, and specific instructions. Quality goes back up.

The third scenario shows what matters. It's not the size, it's how you organize what goes in.

Cost comparison with real numbers

Here's a concrete example. An internal AI assistant that answers questions using company docs.

Strategy A: send everything. You put 3 full documents in the context, 50K tokens total. With Claude Sonnet at $3/M input tokens (pricing), each call costs $0.15. Takes 5 to 8 seconds. Quality is ok but the model has to filter a lot of noise.

Strategy B: use RAG. You use a vector database (like PostgreSQL with pgvector) to pull only the 5 most relevant chunks. 5K tokens total. Same call costs $0.015. Takes 1 to 2 seconds. Quality is often better because the model only gets the useful parts.

Strategy B costs 10x less, responds 3 to 4x faster, and usually gives better answers.

Scale it to 1,000 calls per day:

Strategy A: $150/day, $4,500/month
Strategy B: $15/day, $450/month

Same quality or better. One tenth of the cost.

The practical rule

Big context is insurance, not strategy. You want the ability to process 200K tokens when you need it. A full codebase review, a long regulatory document. But for normal operations, 5 to 15K tokens of well-picked information will give you better results, faster, and cheaper.

This is also why a Qwen3.5-9B with 262K context running locally can be really useful for focused tasks. If you send it clean, relevant context, the quality gap between it and a top model gets much smaller. Good context selection levels the field.

Next article I'll connect all of this to structured prompts and model routing, and show how they save money in practice.

Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.

MiniMax M2.5 and Qwen3.5: The Open-Weight Models Worth Knowing About

guirgsilva — Mon, 06 Apr 2026 23:00:00 GMT

In the last article I showed the SWE-Bench numbers. Open-weight models are basically tied with the proprietary ones now. Two models stood out to me: MiniMax M2.5 and Qwen3.5.

Here's what I found out about them.

MiniMax M2.5

MiniMax is a Chinese startup, founded in 2022. Not part of a big tech company like Alibaba or Google. They built models for text, audio, video, and music. You might know their Hailuo Video product.

The M2.5 is a Mixture-of-Experts model. 230 billion parameters total but only 10 billion active per call. That's why the price works. You get the intelligence of a big model but pay for a small one. Input costs $0.30 per million tokens, output $1.20. You can check their API pricing here.

What's different about it

Two things I noticed.

First, the training. M2.5 was trained with reinforcement learning in over 200,000 real environments. Not static data. Actual code repos, browsers, office apps. The model learned by doing things, not just reading about them. One behavior that came out of this is what MiniMax calls "Architect Mindset". Before writing any code, the model breaks down the problem and plans the structure. It thinks about design before it starts coding. This wasn't programmed in, it just appeared during training. You can read more about it in their release blog.

Second, speed. M2.5 finished the SWE-Bench evaluation 37% faster than the previous version and matched Claude Opus 4.6 in speed. They also offer two API versions: regular and highspeed, same quality but lower latency.

The catch

Independent tests from OpenHands show M2.5 is strong at building apps from scratch and fixing issues. But it sometimes forgets to follow formatting instructions. In one test it pushed to the wrong branch. Good at coding, not as precise as Claude at following complex instructions.

The weights are on HuggingFace, MIT license. You can deploy it privately and fine-tune it.

Qwen3.5

Qwen is maintained by the Qwen team at Alibaba Cloud. All models are Apache 2.0 licensed, which means free for commercial use. The family goes from 0.8B to 397B parameters. The full model list is on their GitHub repo.

One thing to note: Lin Junyang, the technical lead who ran the Qwen3.5 development, left Alibaba in early 2026. Alibaba says they'll keep investing in open source but it's worth watching.

Model sizes

The lineup covers different use cases:

0.8B and 2B for phones and edge devices
4B for lightweight agents, 262K context
9B is the sweet spot for laptops
27B scored 0.724 on SWE-Bench, needs an A100 or a Mac with lots of RAM (model card)
35B-A3B is MoE with only 3B active, very efficient (model card)
397B-A17B is the flagship, 262K context that can extend to 1M

The architecture is different from standard Transformers. Qwen3.5 combines Gated Delta Networks with MoE, which makes it faster and uses less memory. The models are also natively multimodal, trained on text, images, and video from the start.

Running it on your laptop

Ollama is the easiest way. The 9B model is the right size for consumer hardware:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Download and run (~5.4GB)
ollama run qwen3.5:9b

Works on laptops with 16GB of RAM. The quantized version loses less than 1% quality compared to full precision.

Smaller options:

ollama run qwen3.5:4b   # ~2.5GB
ollama run qwen3.5:2b   # ~1.5GB

Once running, Ollama gives you an OpenAI-compatible API at localhost:11434:

curl http://localhost:11434/api/chat \
  -d '{"model": "qwen3.5:9b", "messages": [{"role": "user", "content": "Hello!"}]}'

You can also use LM Studio if you prefer a GUI, or llama.cpp for full control.

Why this matters

Cost. Coding tasks can go to M2.5 at $0.30/$1.20 instead of Claude at $5/$25. The price difference absorbs the small quality gap.

Compliance. Open-weight models inside your VPC means data never leaves your infrastructure. If you deal with PCI, SOC2, or HIPAA, this solves a real problem.

Speed of testing. Two commands and you have a working model locally. No account, no API key, no cost. You can test ideas before committing to anything.

Next article I'll talk about context windows and why bigger doesn't mean better. This one surprised me when I first looked into it.

Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.

Open-Weight vs Proprietary: What SWE-Bench Verified Is Telling Us in 2026

guirgsilva — Sat, 04 Apr 2026 19:31:16 GMT

I was looking at the SWE-Bench Verified leaderboard last week and the numbers surprised me. The gap between proprietary and open-weight models is almost gone. Not in some academic test. In actual bug fixing on real GitHub repos.

I want to break down what I saw and why it matters if you're running AI in production.

What SWE-Bench Verified Actually Tests

500 real problems from GitHub issues. Django, scikit-learn, sympy. Real bugs, real repos. The model gets the source code and has to generate a patch. Then automated tests check if the fix works.

No multiple choice. No tricks. Either the patch passes or it doesn't. Score goes from 0 to 1. A score of 0.80 means 400 out of 500 problems solved.

You can read the original paper here. This is the closest thing we have to measuring what tools like Claude Code and Cursor actually do at work.

The Numbers

Here's what the leaderboard looks like in April 2026:

Claude Opus 4.5 (Anthropic): 0.809, costs $5.00 / $25.00 per 1M tokens
Claude Opus 4.6 (Anthropic): 0.808, costs $5.00 / $25.00
Gemini 3.1 Pro (Google): 0.806, costs $2.50 / $15.00
MiniMax M2.5 (MiniMax): 0.802, costs $0.30 / $1.20
GPT-5.2 (OpenAI): 0.800, costs $1.75 / $14.00
Claude Sonnet 4.6 (Anthropic): 0.796, costs $3.00 / $15.00
Qwen3.6 Plus (Alibaba Cloud): 0.788
Qwen3.5-27B (Alibaba Cloud): 0.724, runs locally

Average across all 80 models is 0.627. The top ones are well above that. But look at the gap between first and fourth place.

0.7 Points and 17x Price Difference

Claude Opus 4.5 at 80.9%. MiniMax M2.5 at 80.2%. That's 0.7 points. Almost nothing.

But the price? 17x cheaper on input. 21x cheaper on output. If you're making thousands of API calls per day, that's the difference between $4,500 and $450 per month. Same ballpark quality.

And then there's Qwen3.5-27B at 0.724. A 27 billion parameter model solving 72% of real software problems. Running on a laptop. Two years ago this didn't exist.

One Thing to Keep in Mind

All 80 results on the leaderboard are self-reported. The companies that built the models published their own scores. Nobody verified them independently. Each company can use different scaffolding and agent frameworks, so the comparison isn't perfectly fair.

Also SWE-Bench is Python only. Doesn't test Terraform, CloudFormation, PowerShell, Go, or anything in the infrastructure world.

But it's still the best proxy we have for real coding ability. And the numbers are hard to ignore.

What I Take From This

The time when proprietary models had a clear edge in coding is over. The edge is tiny now, and the cost difference is huge.

For anyone building AI systems in production, the question isn't "which is the best model" anymore. It's about using different models for different tasks and optimizing cost without losing quality.

In the next article I'll go deeper into the two open-weight models that caught my attention: MiniMax M2.5 and Qwen3.5.

Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.