Our work · Runnable Model Pool

The model and the agent are decoupled by design.

Every reasoning step inside an Agntic agent picks its own model — cheap for trivial work, expensive for hard work, and your own GPU when the data can't leave the building. Switching models is a config change, not a refactor. That intention runs through every line of the design.

Technology showcase Agntic Consulting Platform architecture
Tier 01 Fast
A Claude Haikuclaude-haiku-4-5
Used for Query rewriting Source classification Answer grading Trigger detection
Tier 02 · default Balanced
A Claude Sonnet 4.6claude-sonnet-4-6
Used for The main reasoning loop Tool-calling turns Router-picked default Most user-facing replies
Tier 03 Powerful
A Claude Opus 4claude-opus-4
Used for Multi-step planning Subagent synthesis High-stakes reasoning When quality has to win
Per call Every call inside the agent declares which tier it wants — config.configurable.modelTier — and the pool routes it. The agent doesn't know which model is bound to each tier. That's the point.

Three tiers, one shared interface. The agent talks to modelTier: 'fast' | 'balanced' | 'powerful'; the pool resolves the actual model behind each label. Swap the bindings; the agent never notices.

Most AI products hardcode one model and let cost run away with usage. Switching models means a rewrite. We built the version where the agent and the model are decoupled by design — every reasoning step asks for a tier, the pool resolves the model, and changing what's behind a tier is a constructor argument. Cloud, mixed, or fully local: same agent, same prompts, same workflows.

01 · The tier abstractionThe agent asks for a tier, not a model

Three tiers sit in front of every model in the system: fast, balanced, powerful. Every call inside the agent declares which it wants via config.configurable.modelTier. The pool — a Runnable that implements the same shape as any single model — resolves the tier and invokes the underlying call. Calls without a tier get the default (balanced). The routing is set at the call site, by the step that knows its own cost/quality tradeoff. The agent code doesn't care which actual model is bound to each tier, and the prompts don't reference a model name. That's the point.

Built-in routing 6 distinct call-sites · 1 router
Query rewritingRAG-Fusion · 3 variants per turn
Doesn't need a frontier model — paraphrase is a fast-tier task.
Fast
Source classificationcite-vs-no-cite, vault-vs-web
A simple per-result label. Fast-tier accuracy is sufficient.
Fast
Answer gradingSelf-RAG confidence tagging
Binary-ish judgments at scale. Fast tier so grading isn't its own cost line.
Fast
Trigger detectionmemory commitments, intent classification
Watches every turn — has to be cheap to live in the hot path.
Fast
The main reasoning loopagent-graph · per turn
Router-picked per turn. Tool-calling, response generation, the user-facing voice of the agent.
Balanced
Multi-step planningplanning-agent
A bad plan derails the whole turn. Quality has to win here.
Powerful
Subagent synthesiscombining results across parallel subagents
High-context reasoning over heterogeneous inputs. Powerful tier earns its rate.
Powerful

Every step picks its lane. A typical AI product runs every turn through the most capable model "to be safe." With tiered routing, the cheap model handles the parts that don't need cleverness — and the architecture makes that visible at every call-site.

02 · Observable costTracked, not estimated

Token usage and USD cost are recorded at every call, tagged with the tier, the underlying model, the session, and a cost context label — rewrite, planning, agent_loop, grading, synthesis. USD rates are baked in for 8 model families across Anthropic and OpenAI — Opus, Sonnet, Haiku, GPT-4.1, GPT-4o, GPT-4o-mini, GPT-4.1-mini, GPT-4.1-nano. Anthropic's server-side web_search tool fee ($10 per 1,000 requests) is tracked separately on top of token cost. Streaming responses preserve usage tracking via a snapshot/delta fallback when the SDK doesn't emit usage in the final chunk. Clients see exactly what every conversation cost and where the money went.

Session ledger sess 5d8e · 1 turn · 9 calls
rewriteRAG-Fusion · 3 variants claude-haiku-4-5 240 → 110 $0.0006
source_classifyper-result labels claude-haiku-4-5 320 → 60 $0.0005
planningmulti-step plan claude-opus-4 1,240 → 320 $0.0426
agent_looptool-calling turn claude-sonnet-4-6 3,820 → 420 $0.0178
agent_loopweb_search · 2 requests anthropic web_search tool fee $0.0200
synthesissubagent merge claude-opus-4 2,100 → 480 $0.0675
gradingSelf-RAG confidence claude-haiku-4-5 680 → 80 $0.0008
tokens 8,400 in · 1,470 out $0.1498
Built-in rate sheet 8 model families · USD per 1M tokens
Modelinputoutput
claude-opus$15.00$75.00
claude-sonnet$3.00$15.00
claude-haiku$0.80$4.00
gpt-4.1$2.00$8.00
gpt-4o$2.50$10.00
gpt-4.1-mini$0.40$1.60
gpt-4o-mini$0.15$0.60
gpt-4.1-nano$0.10$0.40
Anthropic web_search tool $10.00 / 1K requests

Cost is a fact, not an estimate. Per-call records carry tier, model, context label, tokens in/out, and dollars. The session ledger is the source of truth — and the tool fee for Anthropic's server-side web_search sits on its own line, on top of token spend.

03 · Swap the binding, not the agentModel changes are one line

Because the pool is a Runnable — the same composable interface every other piece of the framework speaks — swapping the model behind any tier is a configuration change. Want to move balanced from Sonnet to a cheaper option for a particular client? Change the binding. Want to run the powerful tier on Opus for one customer and GPT-4.1 for another? Different bindings, same agent. Want to run sensitive workloads entirely on the client's own hardware? Bind a local GGUF model — the on-device runtime implements the same Runnable interface, so the exact same agent code runs against it. No fork. No parallel codebase. No "cloud version vs. on-prem version."

Cloud · Anthropic Default
Fast Claude HaikuAnthropic
Balanced Claude Sonnet 4.6Anthropic
Powerful Claude Opus 4Anthropic
All three tiers on Anthropic.
Mixed · cost-tuned Per-client
Fast GPT-4.1 nanoOpenAI
Balanced GPT-4.1 miniOpenAI
Powerful Claude Sonnet 4.6Anthropic
Cheap tiers on OpenAI; powerful stays on Claude.
Local · on-prem GGUF
Fast Qwen3 1.7BLocal · GGUF Q6
Balanced Llama 3.1 8BLocal · GGUF Q5
Powerful Llama 3.1 70BLocal · GGUF Q4
Same agent. Same prompts. Data never leaves the box.

Three bindings, one agent. Same prompts, same tools, same workflows behind every column. The migration is the binding diagram — not a refactor. This is the part that costs other teams months when their LLM provider raises prices or a customer demands their data never leave their network.

04 · Native web access where it's saferCitations, not scrapers

When the agent needs to search the web, we use Anthropic's native server-side web_search tool with citation tracking — no scraping pipeline, no third-party search vendor, sources traceable back to the exact model call that surfaced them. Buyers who care about data lineage notice this. The same architecture means a deployment that can't use any cloud tool can simply not bind one — the local pool runs without it, and the agent degrades gracefully to vault-only retrieval.

05 · Why we built it this wayThe small piece that makes the promise deliverable

AI consulting is full of vendor-locked, expensive-by-default agent platforms. We built the version that lets a client start on Anthropic for speed, move workloads to OpenAI when a contract changes, and run sensitive workloads on their own hardware — all from the same product. The model pool is the small piece of infrastructure that makes that promise actually deliverable.

Tiers, per-call routing, observable cost, Runnable interface, local GGUF support, native web search. None of the pieces are exotic on their own. The work, and the value, is in stacking them so that the model becomes configuration — and the agent stops caring which one it's talking to.

Worried about model lock-in or cost runaway?

We build AI systems on this pattern for clients who can't afford to be tied to a single provider — and who need on-prem and cloud paths from the same codebase.

Book a discovery call