Every reasoning step inside an Agntic agent picks its own model — cheap
for trivial work, expensive for hard work, and your own GPU when the data
can't leave the building. Switching models is a config change, not a
refactor. That intention runs through every line of the design.
Used forQuery rewritingSource classificationAnswer gradingTrigger detection
Tier 02 · defaultBalanced
AClaude Sonnet 4.6claude-sonnet-4-6
Used forThe main reasoning loopTool-calling turnsRouter-picked defaultMost user-facing replies
Tier 03Powerful
AClaude Opus 4claude-opus-4
Used forMulti-step planningSubagent synthesisHigh-stakes reasoningWhen quality has to win
Per callEvery call inside the agent declares which tier it wants — config.configurable.modelTier — and the pool routes it. The agent doesn't know which model is bound to each tier. That's the point.
Three tiers, one shared interface. The agent talks to modelTier: 'fast' | 'balanced' | 'powerful'; the pool resolves the actual model behind each label. Swap the bindings; the agent never notices.
Most AI products hardcode one model and let cost run away with usage.
Switching models means a rewrite. We built the version where the agent
and the model are decoupled by design — every reasoning step asks for a
tier, the pool resolves the model, and changing what's behind a
tier is a constructor argument. Cloud, mixed, or fully local: same
agent, same prompts, same workflows.
01 · The tier abstractionThe agent asks for a tier, not a model
Three tiers sit in front of every model in the system: fast,
balanced, powerful. Every call inside
the agent declares which it wants via
config.configurable.modelTier. The pool — a Runnable that
implements the same shape as any single model — resolves the tier and
invokes the underlying call. Calls without a tier get the default
(balanced). The routing is set at the call site, by the step that knows
its own cost/quality tradeoff. The agent code doesn't care which actual
model is bound to each tier, and the prompts don't reference a model
name. That's the point.
Built-in routing6 distinct call-sites · 1 router
Query rewritingRAG-Fusion · 3 variants per turn
Doesn't need a frontier model — paraphrase is a fast-tier task.
Watches every turn — has to be cheap to live in the hot path.
Fast
The main reasoning loopagent-graph · per turn
Router-picked per turn. Tool-calling, response generation, the user-facing voice of the agent.
Balanced
Multi-step planningplanning-agent
A bad plan derails the whole turn. Quality has to win here.
Powerful
Subagent synthesiscombining results across parallel subagents
High-context reasoning over heterogeneous inputs. Powerful tier earns its rate.
Powerful
Every step picks its lane. A typical AI product runs every turn through the most capable model "to be safe." With tiered routing, the cheap model handles the parts that don't need cleverness — and the architecture makes that visible at every call-site.
02 · Observable costTracked, not estimated
Token usage and USD cost are recorded at every call, tagged with
the tier, the underlying model, the session, and a
cost context label — rewrite,
planning, agent_loop, grading,
synthesis. USD rates are baked in for
8 model families across Anthropic and OpenAI — Opus,
Sonnet, Haiku, GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini, GPT-4.1-nano.
Anthropic's server-side web_search tool fee
($10 per 1,000 requests) is tracked separately on top
of token cost. Streaming responses preserve usage tracking via a
snapshot/delta fallback when the SDK doesn't emit usage in the final
chunk. Clients see exactly what every conversation cost and where the
money went.
Built-in rate sheet8 model families · USD per 1M tokens
Modelinputoutput
claude-opus$15.00$75.00
claude-sonnet$3.00$15.00
claude-haiku$0.80$4.00
gpt-4.1$2.00$8.00
gpt-4o$2.50$10.00
gpt-4.1-mini$0.40$1.60
gpt-4o-mini$0.15$0.60
gpt-4.1-nano$0.10$0.40
Anthropic web_search tool
$10.00 / 1K requests
Cost is a fact, not an estimate. Per-call records carry tier, model, context label, tokens in/out, and dollars. The session ledger is the source of truth — and the tool fee for Anthropic's server-side web_search sits on its own line, on top of token spend.
03 · Swap the binding, not the agentModel changes are one line
Because the pool is a Runnable — the same composable
interface every other piece of the framework speaks — swapping the model
behind any tier is a configuration change. Want to move balanced from
Sonnet to a cheaper option for a particular client? Change the binding.
Want to run the powerful tier on Opus for one customer and GPT-4.1 for
another? Different bindings, same agent. Want to run sensitive workloads
entirely on the client's own hardware? Bind a local GGUF model — the
on-device runtime implements the same Runnable interface, so the exact
same agent code runs against it. No fork. No parallel codebase. No
"cloud version vs. on-prem version."
Cloud · AnthropicDefault
Fast→Claude HaikuAnthropic
Balanced→Claude Sonnet 4.6Anthropic
Powerful→Claude Opus 4Anthropic
All three tiers on Anthropic.
Mixed · cost-tunedPer-client
Fast→GPT-4.1 nanoOpenAI
Balanced→GPT-4.1 miniOpenAI
Powerful→Claude Sonnet 4.6Anthropic
Cheap tiers on OpenAI; powerful stays on Claude.
Local · on-premGGUF
Fast→Qwen3 1.7BLocal · GGUF Q6
Balanced→Llama 3.1 8BLocal · GGUF Q5
Powerful→Llama 3.1 70BLocal · GGUF Q4
Same agent. Same prompts. Data never leaves the box.
Three bindings, one agent. Same prompts, same tools, same workflows behind every column. The migration is the binding diagram — not a refactor. This is the part that costs other teams months when their LLM provider raises prices or a customer demands their data never leave their network.
04 · Native web access where it's saferCitations, not scrapers
When the agent needs to search the web, we use Anthropic's native
server-side web_search tool with citation tracking — no
scraping pipeline, no third-party search vendor, sources traceable back
to the exact model call that surfaced them. Buyers who care about data
lineage notice this. The same architecture means a deployment that
can't use any cloud tool can simply not bind one — the local pool runs
without it, and the agent degrades gracefully to vault-only retrieval.
05 · Why we built it this wayThe small piece that makes the promise deliverable
AI consulting is full of vendor-locked, expensive-by-default agent
platforms. We built the version that lets a client start on Anthropic
for speed, move workloads to OpenAI when a contract changes, and run
sensitive workloads on their own hardware — all from the same product.
The model pool is the small piece of infrastructure that makes that
promise actually deliverable.
Tiers, per-call routing, observable cost, Runnable interface, local
GGUF support, native web search. None of the pieces are exotic on their
own. The work, and the value, is in stacking them so that the model
becomes configuration — and the agent stops caring which one it's
talking to.
Worried about model lock-in or cost runaway?
We build AI systems on this pattern for clients who can't afford to be tied to a single provider — and who need on-prem and cloud paths from the same codebase.