Runnable Model Pool — tiered routing, observable cost, cloud or local

Tier 01 Fast

A Claude Haikuclaude-haiku-4-5

Used for Query rewriting Source classification Answer grading Trigger detection

Tier 02 · default Balanced

A Claude Sonnet 4.6claude-sonnet-4-6

Used for The main reasoning loop Tool-calling turns Router-picked default Most user-facing replies

Tier 03 Powerful

A Claude Opus 4claude-opus-4

Used for Multi-step planning Subagent synthesis High-stakes reasoning When quality has to win

Per call Every call inside the agent declares which tier it wants — config.configurable.modelTier — and the pool routes it. The agent doesn't know which model is bound to each tier. That's the point.

Three tiers, one shared interface. The agent talks to modelTier: 'fast' | 'balanced' | 'powerful'; the pool resolves the actual model behind each label. Swap the bindings; the agent never notices.

Most AI products hardcode one model and let cost run away with usage. Switching models means a rewrite. We built the version where the agent and the model are decoupled by design — every reasoning step asks for a tier, the pool resolves the model, and changing what's behind a tier is a constructor argument. Cloud, mixed, or fully local: same agent, same prompts, same workflows.

01 · The tier abstractionThe agent asks for a tier, not a model

Three tiers sit in front of every model in the system: fast, balanced, powerful. Every call inside the agent declares which it wants via config.configurable.modelTier. The pool — a Runnable that implements the same shape as any single model — resolves the tier and invokes the underlying call. Calls without a tier get the default (balanced). The routing is set at the call site, by the step that knows its own cost/quality tradeoff. The agent code doesn't care which actual model is bound to each tier, and the prompts don't reference a model name. That's the point.

Built-in routing 6 distinct call-sites · 1 router

Query rewritingRAG-Fusion · 3 variants per turn

Doesn't need a frontier model — paraphrase is a fast-tier task.

Fast

Source classificationcite-vs-no-cite, vault-vs-web

A simple per-result label. Fast-tier accuracy is sufficient.

Fast

Answer gradingSelf-RAG confidence tagging

Binary-ish judgments at scale. Fast tier so grading isn't its own cost line.

Fast

Trigger detectionmemory commitments, intent classification

Watches every turn — has to be cheap to live in the hot path.

Fast

The main reasoning loopagent-graph · per turn

Router-picked per turn. Tool-calling, response generation, the user-facing voice of the agent.

Balanced

Multi-step planningplanning-agent

A bad plan derails the whole turn. Quality has to win here.

Powerful

Subagent synthesiscombining results across parallel subagents

High-context reasoning over heterogeneous inputs. Powerful tier earns its rate.

Powerful

Every step picks its lane. A typical AI product runs every turn through the most capable model "to be safe." With tiered routing, the cheap model handles the parts that don't need cleverness — and the architecture makes that visible at every call-site.

02 · Observable costTracked, not estimated

Token usage and USD cost are recorded at every call, tagged with the tier, the underlying model, the session, and a cost context label — rewrite, planning, agent_loop, grading, synthesis. USD rates are baked in for 8 model families across Anthropic and OpenAI — Opus, Sonnet, Haiku, GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini, GPT-4.1-nano. Anthropic's server-side web_search tool fee ($10 per 1,000 requests) is tracked separately on top of token cost. Streaming responses preserve usage tracking via a snapshot/delta fallback when the SDK doesn't emit usage in the final chunk. Clients see exactly what every conversation cost and where the money went.

Session ledger sess 5d8e · 1 turn · 9 calls

rewriteRAG-Fusion · 3 variants claude-haiku-4-5 240 → 110 $0.0006

source_classifyper-result labels claude-haiku-4-5 320 → 60 $0.0005

planningmulti-step plan claude-opus-4 1,240 → 320 $0.0426

agent_looptool-calling turn claude-sonnet-4-6 3,820 → 420 $0.0178

agent_loopweb_search · 2 requests anthropic web_search tool fee $0.0200

synthesissubagent merge claude-opus-4 2,100 → 480 $0.0675

gradingSelf-RAG confidence claude-haiku-4-5 680 → 80 $0.0008

tokens 8,400 in · 1,470 out $0.1498

Built-in rate sheet 8 model families · USD per 1M tokens

Modelinputoutput

claude-opus$15.00$75.00

claude-sonnet$3.00$15.00

claude-haiku$0.80$4.00

gpt-4.1$2.00$8.00

gpt-4o$2.50$10.00

gpt-4.1-mini$0.40$1.60

gpt-4o-mini$0.15$0.60

gpt-4.1-nano$0.10$0.40

Anthropic web_search tool $10.00 / 1K requests

Cost is a fact, not an estimate. Per-call records carry tier, model, context label, tokens in/out, and dollars. The session ledger is the source of truth — and the tool fee for Anthropic's server-side web_search sits on its own line, on top of token spend.

03 · Swap the binding, not the agentModel changes are one line

Because the pool is a Runnable — the same composable interface every other piece of the framework speaks — swapping the model behind any tier is a configuration change. Want to move balanced from Sonnet to a cheaper option for a particular client? Change the binding. Want to run the powerful tier on Opus for one customer and GPT-4.1 for another? Different bindings, same agent. Want to run sensitive workloads entirely on the client's own hardware? Bind a local GGUF model — the on-device runtime implements the same Runnable interface, so the exact same agent code runs against it. No fork. No parallel codebase. No "cloud version vs. on-prem version."

Cloud · Anthropic Default

Fast → Claude HaikuAnthropic

Balanced → Claude Sonnet 4.6Anthropic

Powerful → Claude Opus 4Anthropic

All three tiers on Anthropic.

Mixed · cost-tuned Per-client

Fast → GPT-4.1 nanoOpenAI

Balanced → GPT-4.1 miniOpenAI

Powerful → Claude Sonnet 4.6Anthropic

Cheap tiers on OpenAI; powerful stays on Claude.

Local · on-prem GGUF

Fast → Qwen3 1.7BLocal · GGUF Q6

Balanced → Llama 3.1 8BLocal · GGUF Q5

Powerful → Llama 3.1 70BLocal · GGUF Q4

Same agent. Same prompts. Data never leaves the box.

Three bindings, one agent. Same prompts, same tools, same workflows behind every column. The migration is the binding diagram — not a refactor. This is the part that costs other teams months when their LLM provider raises prices or a customer demands their data never leave their network.

04 · Native web access where it's saferCitations, not scrapers

When the agent needs to search the web, we use Anthropic's native server-side web_search tool with citation tracking — no scraping pipeline, no third-party search vendor, sources traceable back to the exact model call that surfaced them. Buyers who care about data lineage notice this. The same architecture means a deployment that can't use any cloud tool can simply not bind one — the local pool runs without it, and the agent degrades gracefully to vault-only retrieval.

05 · Why we built it this wayThe small piece that makes the promise deliverable

AI consulting is full of vendor-locked, expensive-by-default agent platforms. We built the version that lets a client start on Anthropic for speed, move workloads to OpenAI when a contract changes, and run sensitive workloads on their own hardware — all from the same product. The model pool is the small piece of infrastructure that makes that promise actually deliverable.

Tiers, per-call routing, observable cost, Runnable interface, local GGUF support, native web search. None of the pieces are exotic on their own. The work, and the value, is in stacking them so that the model becomes configuration — and the agent stops caring which one it's talking to.

Worried about model lock-in or cost runaway?

We build AI systems on this pattern for clients who can't afford to be tied to a single provider — and who need on-prem and cloud paths from the same codebase.

Book a discovery call →