Skip to main content

Command Palette

Search for a command to run...

CRIT Framework and Model Routing: Spending Less on AI Without Losing Quality

Published
5 min read

In the previous articles I covered the SWE-Bench numbers, the open-weight models worth knowing, and why bigger context doesn't mean better output. Now I want to connect everything with two things I use in production: the CRIT framework for prompts and model routing for cost.

The idea is simple. What matters now isn't which model you pick. It's how you use it.


The CRIT Framework

CRIT stands for Context, Role, Instructions, Task. It's a way to structure prompts so the model knows exactly what you want. If you want to learn more about prompt structuring, Anthropic's prompt engineering guide is a good starting point.

  • Context is the environment. Stack, constraints, background info.

  • Role is the perspective you want the model to take.

  • Instructions are the rules. What to do, what not to do.

  • Task is the actual output you need.

Why it works

When you send something like "look at this code, there's a bug, help me, it's on ECS, needs PCI compliance, uses Terraform...", the model spends time figuring out what's context, what's a rule, and what's the actual task. That's wasted attention.

With CRIT each part is clear. The model doesn't have to guess. It goes straight to the work.

It also saves money

This is the part people miss. CRIT isn't just about better prompts. When you're forced to split things into sections, you cut the repetition. Information that would show up three times in a messy prompt shows up once, in the right place.

Fewer tokens in, lower cost. Less noise, better answers. Organized input, more consistent output.

A good CRIT prompt on a \(0.30/M model can give you results close to a messy prompt on a \)5/M model. The structure matters more than the price tag.

Example

Without CRIT:

Look at this code, there's a bug causing timeouts in the Lambda.
The function processes SQS messages and calls Bedrock.
Can't expose secrets and needs rollback. Production environment.
[500 lines of code]

With CRIT:

## Context
AWS Lambda processing SQS messages, calling Bedrock API.
Environment: production, PCI-compliant.
Stack: Python 3.12, boto3, Aurora PostgreSQL.

## Role
Senior DevOps Engineer with AWS serverless experience.

## Instructions
- Don't expose secrets or credentials in the code
- Include a rollback strategy
- Keep compatibility with the existing handler
- Explain the root cause before the fix

## Task
Find the bug causing timeouts and write the fix.

[500 lines of code]

The second one is a few tokens longer. But the response is better and more consistent. The model doesn't have to figure out what you want.


Model Routing

With open-weight models performing this well, using one model for everything doesn't make sense anymore. The smart move is routing each task to the model with the best cost-to-quality ratio.

How I think about it

Tier 1, simple tasks. Formatting, summaries, data extraction. Send these to Qwen3.5-9B locally or MiniMax M2.5 at \(0.30/\)1.20. If it's running locally the cost is basically zero.

Tier 2, coding tasks. Bug fixes, feature implementation, code review. MiniMax M2.5 at \(0.30/\)1.20 handles these well. 80.2% on SWE-Bench Verified, close to the top.

Tier 3, hard tasks. Multi-step reasoning, system architecture, things that need the model to follow complex instructions exactly. Claude Sonnet or Opus at \(3-5/\)15-25. Best instruction following, most reliable on long tasks.

If you're using AWS, Bedrock makes routing between models easier since it supports Claude, Llama, and other providers through a single API.

The math

1,000 calls per day:

  • 60% are Tier 1 (600 calls on local model = ~$0)

  • 30% are Tier 2 (300 calls on M2.5 = ~$2/day)

  • 10% are Tier 3 (100 calls on Claude = ~$15/day)

Total with routing: ~\(17/day, \)510/month.

Same 1,000 calls all on Claude Sonnet: ~\(150/day, \)4,500/month.

That's almost 90% less. And the quality is the same or better, because each model is doing what it's best at.


Self-hosting for compliance

If you work in a regulated environment (PCI, SOC2, HIPAA), open-weight models inside your VPC solve the data problem. Nothing leaves your infrastructure.

Qwen3.5-27B and MiniMax M2.5 (10B active parameters) can run on reasonable hardware. For serving in production, vLLM and SGLang are the two main inference frameworks. The Qwen3.5-9B runs on a laptop for testing with Ollama.

The setup that makes sense for regulated environments:

  • Tier 1 and 2 on a self-hosted model in the VPC. Zero data leakage.

  • Tier 3 on Claude via API with compliance contracts. Only for tasks that need it.


Putting it all together

This is what I took from looking at all of this:

Open-weight models are 0.7 points from the leader on real coding problems. Big context windows are good to have but they don't replace good context selection. Structured prompts make a bigger difference than expensive models. And routing tasks to the right model can cut costs by 90%.

The gap between expensive and cheap AI has never been smaller. The advantage now is in how you put the pieces together. RAG, structured prompts, routing. These are the things that actually move the needle.


Guilherme is a Senior Cloud/DevOps Engineer focused on AI infrastructure, building production pipelines in regulated environments.

2 views