arrow_back All posts
· Technical Report · Charles Crepps, Founder

Context and Executability:
Building the Layer Between
AI Models and Real Work

A technical report from Agntic on the two unsolved problems in applied AI, the systems we've built to address them, and where we believe the industry is headed.

Agntic Agntic LLC — Applied AI Research

I. Thirty Minutes on a Monday Morning

There's a task that lives on every knowledge worker's calendar. It's not the hard task — the one that demands creativity or judgment. It's the other one. The one that takes thirty minutes because you have to open four tabs, copy numbers from a spreadsheet, cross-reference a thread from last Tuesday, remember what someone said in a meeting you half-attended, and then write something that sounds like you were paying attention the whole time.

That was weekly reporting. Every Monday morning. Thirty minutes of context assembly disguised as work.

Agntic didn't begin with a business plan. It began with a person staring at that thirty minutes and thinking: this is a retrieval problem, not a writing problem. The writing was easy. Any language model could draft a summary. The hard part was getting the right information in front of the model so it had something worth summarizing.

The tools that were supposed to help

By late 2025, the market was drowning in AI-adjacent productivity tools. Obsidian for notes. Notion for project management. Claude for reasoning. MCP servers to supposedly connect them all together. Each one was impressive in isolation — polished, capable, built by smart people solving real problems.

None of them fit.

Not in a dramatic, broken way. In the quiet way that a key almost fits a lock. Notion couldn't talk to the LLM. The LLM couldn't read the vault. The MCP connectors were brittle enough that maintaining them became its own workstream. And every time a new tool entered the stack, the integration tax grew — more time managing the system than doing the work the system was supposed to accelerate.

It's a peculiar kind of frustration. You can see the potential. You know the model is capable. But the plumbing between your knowledge and the model's reasoning is so leaky that the output is never quite yours. It's generic. It's close but not grounded. It's a hallucination wearing a suit.

The vault experiment

So the tools came out. All of them. And a simpler question took their place: what if the model could just read the vault directly?

No SaaS middleware. No integration layer. Just an Obsidian vault full of the context that actually mattered — meeting notes, SOPs, project timelines, decision logs — and a local language model pointed straight at it.

The Monday morning report went from thirty minutes to five. Not because the model wrote faster. Because it finally knew something. It could cite last week's numbers because it had read the notes. It could reference the right project status because the status lived in the vault. The writing wasn't better — the context was.

The model had always been capable. What it lacked was context — and the ability to do something with it.

That five-minute Monday morning was the founding moment. Not because the time savings were dramatic, but because the shape of the problem suddenly became clear. Every failed SaaS integration, every brittle MCP connection, every hallucinated paragraph — they were all symptoms of the same two missing pieces.

Context and executability

Context. The model needs to know what your business knows. Not the internet's version of it. Yours. Your documents, your decisions, your history, your terminology.

Executability. The model needs to do more than talk. It needs to search, calculate, draft, edit, and produce artifacts — but never autonomously. The human stays in the loop. The model proposes. You decide.

There's an old idea in computing: garbage in, garbage out. We think about it differently. Relevance in, relevance out. The quality of what an AI system produces is a direct function of the quality of context it receives. A model with bad context will give you a confident, fluent, beautifully written wrong answer. A model with good context will give you something you can actually use.

What started as a fix for a Monday morning reporting task became the algorithm behind an entire company.


II. The State of Things

There's a scene that plays out in every company adopting AI right now. Someone on the team discovers ChatGPT, or Claude, or Gemini. They're amazed. They show it to their manager. The manager is amazed. Pilot project gets approved. A chat widget gets embedded somewhere. Three months later, usage is down to the same two people who were already power users, and everyone else has gone back to the way things were.

The models aren't the problem. The models are extraordinary — capable of reasoning that would have been science fiction three years ago. The problem is that a chat window connected to a general-purpose model is, fundamentally, just a fancier search bar. It doesn't know your business. It can't do anything with what it finds. It's a brain in a jar.

Meanwhile, the models themselves are commoditizing at a speed that has caught even the industry off guard. In 2024, there were two frontier-class model families. By early 2026, there are dozens. Open-weight models run on consumer laptops and match what was state-of-the-art eighteen months ago. API prices drop every quarter. The inference layer is becoming what CPU cycles became in the 2000s — commodity infrastructure you buy by the unit.

This has an implication that most of the industry hasn't fully absorbed yet: the sustainable value in AI is not the model. It's what you build around the model.

The first wave of AI products were thin wrappers — a chat window, an API key, and a marketing page. Those companies competed on prompt engineering and UI polish. That competition is effectively over. The model providers themselves now ship better chat interfaces than most startups can build. If your entire product is "we put a nice UI on GPT," you're already dead — you just haven't checked your pulse.

The companies that survive are solving harder problems. Retrieval. Tool orchestration. Knowledge management. Document workflows. Human-in-the-loop approval systems. Deployment infrastructure. The ugly, unglamorous, deeply domain-specific engineering that connects a model to a real business and lets it do real work.

We think of it as the difference between an engine and a car. Everyone's building better engines. Almost nobody is building the car — the chassis, the steering, the brakes. The thing that makes the engine go somewhere useful without killing anyone.


III. Building the Stack

The vault experiment proved the thesis. But an Obsidian vault plugged into a local model is a prototype, not a platform. To make context and executability work at the scale of a real business — hundreds of documents, multiple users, live data — we had to build our own stack from the ground up.

We chose to build rather than wrap. There are established AI frameworks out there. We evaluated them and decided that for our mission, we needed to own every layer. Not out of ego — out of necessity. When the thing you're researching is the plumbing between model and workflow, you can't outsource the plumbing. The framework is open-source, built entirely in JavaScript (ESM, Node.js 22, no build step), and ships two ways: as a CLI agent for terminal-native workflows and as a 3-pane desktop application (React 18 + Tauri 2) for collaborative document work.

Every component implements a composable pattern we call a Runnable — any unit of work that takes an input, produces an output, and can be chained with other Runnables via .pipe(). Prompts, models, parsers, tools, retrievers, and agents are all Runnables. This isn't novel — it's inspired by patterns in LangChain — but owning the implementation means we control every seam.

What follows is the dense part. This is what we've built, how it performs, and what we've learned.


IV. Convergent Retrieval

The first thing that broke was search. The naive approach — embed everything into vectors, search by cosine similarity — works beautifully for conceptual questions. "What's our policy on remote work?" returns the remote work policy. Elegant.

But ask "what are the liability terms in the Whitmore engagement?" and the system falls apart. Vector similarity doesn't understand proper nouns. It doesn't know that "Whitmore" is a client name, not a concept. It returns documents that are semantically similar to liability terms — but not the Whitmore engagement specifically.

Knowledge workers don't ask conceptual questions. They ask specific questions. They want clause 8.4, not "something about liability." They want the Q3 numbers for the Anderson account, not "financial summaries."

We solved this with what we call Convergent Retrieval — a multi-signal search architecture that runs two parallel paths on every query and fuses them into a single ranked result:

  • Dense path — semantic embedding similarity. Handles meaning, paraphrases, conceptual relationships.
  • Sparse path — keyword-level scoring. Handles exact names, identifiers, clause numbers, dates.

The two result sets converge through a rank fusion layer that produces a single normalized relevance score — no manual weight tuning required. The fused results are then diversity-reranked to ensure the model sees breadth rather than five variations of the same paragraph. We call this the ConvergentRank pipeline.

On top of ConvergentRank, the system includes a corrective retry loop: if the initial retrieval is graded as insufficient, the system rewrites the query and retrieves again. It doesn't guess when it can look harder.

Terminology. Convergent Retrieval — our multi-signal search architecture. ConvergentRank — the rank fusion + diversity reranking pipeline. These are the proprietary retrieval layers that differentiate our system from single-signal vector search.

ConvergentRank: the numbers

We tested ConvergentRank against 510 commercial contracts from the CUAD (Contract Understanding Atticus Dataset) — a standard evaluation corpus for legal AI. Contracts are the worst-case scenario for retrieval: dense language, inconsistent formatting, and exact terms that matter enormously.

12
Chunks Retrieved / Query
0.25
Min Relevance Threshold
1,000
Chunk Size (chars)
200
Chunk Overlap (chars)

Embedding is handled locally via a dedicated embedding model (nomic-embed-text-v1.5, quantized to Q4) — no cloud API dependency for indexing. This means the entire knowledge base can be indexed on-premises with zero data leaving the machine.


V. Inference Architecture

We treat inference as a swappable commodity. The same context layer, tool definitions, and retrieval pipeline work identically whether the underlying model is a locally-hosted open-weight GGUF model on Apple Silicon or a frontier-class API from Anthropic or OpenAI.

Local inference: the hardware reality

Our development and benchmarking environment is an Apple M-series machine with 36 GB unified memory. This is representative of the upper range of what a small business deployment looks like — a single machine, no GPU cluster, no cloud infrastructure.

Model Parameters VRAM Throughput Quality Target
Qwen3-1.7B 1.7B ~1.5 GB ~120 tok/s 0.45
Qwen3-8B Q4 8B ~5.5 GB ~35 tok/s 0.72
Qwen3-32B Q4 32B ~19 GB ~7 tok/s 0.91
Figure 2 — Local model tiers tested on 36 GB Apple Silicon. Quality targets are normalized 0–1 scores from our benchmark suite. Flash attention enabled, 16K context window.

The performance curve is steep. The 8B model generates tokens 5x faster than the 32B but produces measurably lower quality on complex reasoning tasks. The 1.7B is blazing fast but unsuitable for anything beyond classification and simple Q&A. This creates the central question for local deployment: which model do you send each query to?

Pareto Routing

Most AI systems throw the largest available model at every query. This is safe and wasteful. A date lookup doesn't need the same reasoning engine as a multi-step contract analysis.

We've built a routing system we call Pareto Routing — named for the efficiency frontier. Each incoming query is scored along three dimensions:

Dimension Range Low Example High Example
Complexity 0.0 – 1.0 Greeting, date lookup Cross-document analysis, planning
Quality Floor 0.3 – 0.9 Internal classification Client deliverable, domain-expert task
Latency Budget 2s – 120s Simple answer Multi-task planning
Figure 3 — Pareto Routing query scoring dimensions. The system selects the smallest model that satisfies all three constraints.

The system selects the smallest model whose quality meets the floor and whose speed fits the latency budget. On a 36 GB machine, this means the 8B model handles ~60% of queries (the simple majority), the 32B handles complex reasoning, and swaps between them are managed to minimize latency overhead (~10s per model change on single-GPU).

Pareto Routing benchmark

Configuration Queries Pass Rate Quality Total Time
8B only 10 100% 100% 217.7s
32B only 10 100% 100% 199.7s
Pareto Routing (mixed) 10 100% 100% 183.4s
Figure 4 — Pareto Routing benchmark. Optimal routing achieves equivalent quality 8% faster by sending simple queries to the smaller model.
Query Complexity 8B Latency 32B Latency Routed To
Simple greeting 0.1 1.0s 1.0s 8B
General knowledge 0.3 5.8s 5.6s 32B
Vault search 0.5 39.3s 28.9s 32B
Tool call (calculator) 0.5 9.9s 10.8s 8B
Document creation 0.7 51.0s 34.3s 32B
Data analysis 0.7 66.0s 78.0s 8B
Web research 0.7 11.5s 14.5s 8B
Figure 5 — Per-query routing decisions. The optimal tier isn't always the larger model — on 3 of 7 complex queries, the 8B is faster with equivalent quality.

Cloud inference

For deployments where latency and reasoning quality outweigh privacy constraints, the same system runs against cloud APIs. We've validated against Claude Sonnet 4.6 (Anthropic) and GPT-4o (OpenAI). The context layer, tool registry, and approval flow are identical — only the inference call changes. In practice, cloud inference eliminates the latency-quality tradeoff entirely: frontier models produce higher quality at lower latency than any local option. The cost is per-token billing and data leaving the premises.


VI. Full-System Benchmarks: CUAD

Retrieval and inference are components. The real test is whether the entire system works end-to-end: query rewriting, retrieval, tool calling, document creation, collaborative editing, and multi-step planning — all coordinated by the agent graph.

We tested against the CUAD corpus — 510 real commercial contracts. We chose contracts because they are the worst-case scenario for AI retrieval: dense legal language, inconsistent formatting, exact terms that matter enormously, and questions that require both precision and reasoning. If the system works on contracts, it works on everything easier.

CUAD v1: known queries

Nineteen tests across every capability in the platform.

19/19
Tests Passed
100%
Avg Quality
322K
Input Tokens
18.8K
Output Tokens
Quality by Category — CUAD v1 (510 contracts, 621s wall time)
Direct Q&A
100%
Structured Output
100%
Tool Calling
100%
Vault Retrieval
100%
Document Creation
100%
Document Editing
100%
Planning
100%
Query Rewrite
100%
Figure 6 — CUAD v1 results. 19 tests, 100% pass rate, 322K input tokens consumed across 510 indexed contracts.

Latency profile

Not every query takes the same time. Structured output returns in under 4 seconds. Document creation takes 42 seconds. Planning takes over 2.5 minutes. Understanding where time goes is critical for user experience design.

Category Avg Latency Avg Input Tokens Avg Output Tokens
Structured Output 3.7s 1,179 47
Tool Calling (7 tests) 18.4s ~11K ~280
Direct Q&A 23.0s ~8.8K ~570
Vault Retrieval (5 tests) 23.7s ~139K ~560
Document Creation 42.3s 147 1,224
Document Editing 46.9s 11,121 1,047
Query Rewrite 49.4s 12,825 681
Planning 156.3s 136,271 2,982
Figure 7 — Latency by category. Planning consumes 136K input tokens because it synthesizes across multiple documents and task decompositions.

The vault retrieval average of 139K input tokens is notable — this includes cases where ConvergentRank injects large context windows from multiple matching documents. The system is deliberately generous with context: we'd rather give the model too much relevant material than too little.

CUAD v2: generalization

A benchmark that uses the same queries you developed against proves tuning, not generalization. We wrote 18 entirely novel queries — none seen during development — targeting edge cases: unusual contract structures, ambiguous references, cross-document comparisons, and tool chains we'd never tested together.

11/18
Tests Passed
77%
Avg Quality
13.2K
Input Tokens
379s
Wall Time
Quality by Category — CUAD v2 Generalization (18 novel queries)
Direct Q&A
100%
Document Creation
100%
Planning
100%
Vault Retrieval
70%
Tool Calling
67%
Structured Output
error
Document Editing
error
Figure 8 — CUAD v2 generalization. "Error" categories are infrastructure failures (timeouts, dropped connections), not reasoning failures.

The 59% → 100% story

The 100% v1 result didn't start there. Our first full-system benchmark, run 48 hours earlier on the same queries, scored 59% pass rate with 69% average quality across 232K input tokens in 453 seconds.

The retrieval layer wasn't finding documents. Tool calls were failing due to build-system issues. The planning module was generating structurally incomplete task decompositions.

System Quality Progression (same query set, same model)
Mar 28 — v0
59%
Mar 30 — v1
100%
Mar 31 — v2 (novel)
77%
Figure 9 — The model was identical across all three runs. Every improvement came from the context and executability layers.

We didn't change the model. The model was identical. Every percentage point of improvement came from ConvergentRank tuning, resolved tool infrastructure, and cleaner prompt architecture. The same brain, with better eyes and better hands, went from failing to perfect.

This is the entire thesis, demonstrated in 48 hours: the model is not the bottleneck. The wiring is.


VII. Document Processing

Real-world knowledge bases are ugly. Scanned PDFs from 2014. Word documents with track changes from three authors. Excel spreadsheets where critical data lives in merged cells. If you can't ingest the ugly stuff, your knowledge base has holes — and holes in context produce hallucinations in output.

Document Format Quality Content Sections Parse Time
Professional resume PDF 75% 2,101 chars 8 11.2s
Engineering spec PDF 75% 3,411 chars 64 9.7s
Financial summary PDF 61% 10,015 chars 9 13.7s
Debt analysis PDF 62% 5,880 chars 13.5s
Complex report PDF 47% 7,544 chars 16.6s
Figure 10 — PDF extraction quality. Well-structured documents with clear headings parse at 75%. Complex layouts with mixed content drop to 47%. Active area of improvement.

The pipeline handles PDF, Word, Excel, PowerPoint, images (via OCR), and plain text. Ingestion is incremental — only changed files are reprocessed. A file watcher triggers re-indexing within seconds of a new document landing in the vault.


VIII. The Executability Layer

Context makes the model smarter. But smarter isn't the same as useful. A model that can tell you the answer but can't act on it is still just a conversation.

The model has access to a curated tool registry. Each tool is scoped to a specific capability — mathematical computation, web search, spreadsheet analysis, PDF operations, vault search, and collaborative document editing. Tools execute in parallel when multiple are called in a single turn.

The most important constraint in the system: the model never modifies a document directly. It proposes a change as a structured diff. The change surfaces in the user's workspace. The user accepts or rejects. Nothing changes without approval. We call this the Proposal Gate — and it is not optional, not configurable, and not something we will compromise on.

Query routing is handled by deterministic heuristics — pattern matching, state inspection, keyword analysis. The model is never asked "what kind of query is this?" because that question costs tokens and adds latency. We call this Zero-Cost Routing. The system already knows.


IX. Where We're Headed

Agntic is a research company first. A consultancy second.

That ordering matters because it defines how we make decisions. A consultancy optimizes for client delivery speed. A research company optimizes for understanding. When those two priorities conflict — and they will — we choose understanding. We'd rather take longer to deploy something we've validated than ship fast and discover the retrieval layer hallucinates on edge cases we didn't test.

The research agenda is the mission itself: how do we give AI models context and executability? That question doesn't have a finish line. It has layers. And we've only built the first ones.

What we're researching next

ConvergentRank works. But it works on a static knowledge base. The next research problem is temporal context — how does the system handle knowledge that changes, contradicts itself across document versions, or becomes stale? A contract amendment from last week should override the original clause. A revised SOP should supersede the one from 2023. The retrieval layer needs to understand not just what is relevant, but when it was relevant.

Pareto Routing works on a two-tier model pool. The next research problem is dynamic tier management — can the system learn from its own routing decisions over time? If a query class consistently exceeds the small model's quality threshold, should the router adapt without manual tuning?

The Proposal Gate works for single-document edits. The next research problem is multi-artifact orchestration — what does human-in-the-loop approval look like when the model is proposing changes across five documents simultaneously as part of a planning task?

How the research ships

Today, our research ships as bespoke deployments. A client's knowledge base is ingested. ConvergentRank handles retrieval. Pareto Routing matches queries to models. The Proposal Gate ensures the human stays in the loop. It ships as a CLI for technical users, or as a white-labeled desktop application under the client's brand.

Whether the patterns we're discovering — Convergent Retrieval, Pareto Routing, Proposal Gate, Zero-Cost Routing — should remain embedded in bespoke systems or become a platform that other developers build on is a question we're deliberately not answering yet. The companies that declared themselves platforms on day one built infrastructure nobody used. The ones that built real systems and noticed the platform emerging from the patterns — those are the ones that became real platforms.

We're building real systems. Publishing our research openly. And paying attention.


X. Monday Mornings

We still think about that first Monday morning. Not because the thirty-minute task was important — it wasn't. Because of what it revealed.

Every knowledge worker has a version of that task. Something that takes too long, not because the work is hard, but because the information is scattered and the tools don't talk to each other. The model could always do the thinking. What it couldn't do was know your context and act on your behalf.

Everything in this paper is real. The benchmarks, the failures, the progression from 59% to 100% in 48 hours. We publish the failures alongside the successes because that's how research works — and because the failures are more interesting than the wins. The 100% tells you the system can work. The 77% tells you where to look next. The 59% tells you where we came from.

This company started as one person solving a Monday morning problem. It became a question we couldn't stop asking: how do we give AI the context to understand and the tools to act?

We're still asking. We'll keep showing our work.

About Agntic. Agntic is an applied AI research company based in Birmingham, Alabama. We build the technology that gives AI models context and the ability to act. Our research ships as bespoke deployments for knowledge workers. The framework is open-source. For inquiries, contact hello@agntic.com.