Open-Weight vs Proprietary: What SWE-Bench Verified Is Telling Us in 2026
I was looking at the SWE-Bench Verified leaderboard last week and the numbers surprised me. The gap between proprietary and open-weight models is almost gone. Not in some academic test. In actual bug fixing on real GitHub repos.
I want to break down what I saw and why it matters if you're running AI in production.
What SWE-Bench Verified Actually Tests
500 real problems from GitHub issues. Django, scikit-learn, sympy. Real bugs, real repos. The model gets the source code and has to generate a patch. Then automated tests check if the fix works.
No multiple choice. No tricks. Either the patch passes or it doesn't. Score goes from 0 to 1. A score of 0.80 means 400 out of 500 problems solved.
You can read the original paper here. This is the closest thing we have to measuring what tools like Claude Code and Cursor actually do at work.
The Numbers
Here's what the leaderboard looks like in April 2026:
Claude Opus 4.5 (Anthropic): 0.809, costs \(5.00 / \)25.00 per 1M tokens
Claude Opus 4.6 (Anthropic): 0.808, costs \(5.00 / \)25.00
Gemini 3.1 Pro (Google): 0.806, costs \(2.50 / \)15.00
MiniMax M2.5 (MiniMax): 0.802, costs \(0.30 / \)1.20
GPT-5.2 (OpenAI): 0.800, costs \(1.75 / \)14.00
Claude Sonnet 4.6 (Anthropic): 0.796, costs \(3.00 / \)15.00
Qwen3.6 Plus (Alibaba Cloud): 0.788
Qwen3.5-27B (Alibaba Cloud): 0.724, runs locally
Average across all 80 models is 0.627. The top ones are well above that. But look at the gap between first and fourth place.
0.7 Points and 17x Price Difference
Claude Opus 4.5 at 80.9%. MiniMax M2.5 at 80.2%. That's 0.7 points. Almost nothing.
But the price? 17x cheaper on input. 21x cheaper on output. If you're making thousands of API calls per day, that's the difference between $4,500 and $450 per month. Same ballpark quality.
And then there's Qwen3.5-27B at 0.724. A 27 billion parameter model solving 72% of real software problems. Running on a laptop. Two years ago this didn't exist.
One Thing to Keep in Mind
All 80 results on the leaderboard are self-reported. The companies that built the models published their own scores. Nobody verified them independently. Each company can use different scaffolding and agent frameworks, so the comparison isn't perfectly fair.
Also SWE-Bench is Python only. Doesn't test Terraform, CloudFormation, PowerShell, Go, or anything in the infrastructure world.
But it's still the best proxy we have for real coding ability. And the numbers are hard to ignore.
What I Take From This
The time when proprietary models had a clear edge in coding is over. The edge is tiny now, and the cost difference is huge.
For anyone building AI systems in production, the question isn't "which is the best model" anymore. It's about using different models for different tasks and optimizing cost without losing quality.
In the next article I'll go deeper into the two open-weight models that caught my attention: MiniMax M2.5 and Qwen3.5.
Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.
