A vector database is where retrieval starts. This is what enough actually looks like.
Most teams building AI on their documents stop at a vector database. We
build a retrieval architecture that searches over them with four
complementary signals at the same time — hybrid retrieval, query
rewriting, diversity reranking, and confidence tagging — so the agent
stops being confidently wrong.
One query, four lanes, one ranked answer. Each signal catches something the others miss — and the confidence tag is what tells the agent whether it can speak from the vault or needs to look elsewhere.
Any team building AI on top of their documents starts the same way: pick an
embedding model, drop their files into a vector database, and watch the demo
work. Then real users show up. They paste in an error code and get back
something tangentially related. They ask the same question two different
ways and get two different answers. The model speaks with full confidence
about a document that doesn't actually exist. The vector database isn't
wrong — it's just incomplete.
01 · The ceilingWhy a vector database alone keeps falling short
Dense embeddings are extraordinary at meaning. They are mediocre at strings.
Ask a vector index about "ERR_2847" or invoice number "INV-2024-08831"
or the citation "§4.2(c)" and it will happily return passages about
error handling in general. The exact match was right there in the
corpus; the geometry just smoothed it out. Keyword search has the opposite
problem: it nails the literal token and misses the same idea phrased
differently. Most production systems pick one and pay the cost of the other.
On top of that, there's no answer to the question every business actually
cares about: how sure is this?
Vector-only retrieval
"What does ERR_2847 mean?"
Returns "Common error handling patterns", "Debugging guide overview", "Logging best practices." The exact string is in the docs. None of the top hits contain it.
Multi-signal retrieval
"What does ERR_2847 mean?"
BM25 lane snaps to the literal token; the dense lane confirms semantic neighborhood; RRF fuses the two; the exact-match runbook entry surfaces at position one, tagged HIGH.
02 · The four signalsFour ways of asking, fused into one answer
The architecture isn't an ensemble of models stacked on each other. It's
four orthogonal questions asked of the same vault in parallel, then merged
with techniques that already have years of literature behind them. The
point is the combination — each signal catches a class of failure the
others can't.
Signal 01Hybrid retrieval — meaning and keyword
Two retrievers run side by side: a dense-vector search over embeddings and
a BM25 keyword search over the raw text. Their result lists are fused
using Reciprocal Rank Fusion with a constant k=60 — a
drop-in from the IR literature that doesn't need tuning per corpus.
Vectors handle paraphrase; BM25 handles names, error codes, citation
numbers, and any string a customer actually typed. Neither covers both.
Lane A · Dense vectors
Semantic neighbors
Cosine-similar passages by meaning.
01Procurement · vendor terms0.82
02Negotiation tactics overview0.74
03Quarterly contract review0.69
Fused withRRFk = 60
Lane B · BM25 keyword
Literal-token hits
Exact strings, codes, and citations.
01MSA §4.2 "renegotiation"14.2
02"90-day" clause11.9
03Vendor agreement appendix9.6
Reciprocal Rank Fusion (k=60). Both lists vote; rank position matters more than raw score. The exact-match clause that BM25 found at rank 1 outranks the soft semantic hit a vector-only system would have crowned.
Signal 02RAG-Fusion — ask the question three different ways
Before retrieval runs, an LLM rewrites the user's question into three
additional variants — different vocabulary, different framing, same
intent. Hybrid retrieval runs on all four queries (the
original plus three rewrites), and the four result sets are fused
together. The win: documents written in vocabulary the user didn't use
still surface. Someone asks about "firing a vendor"; the docs say
"termination for cause"; the rewrite bridges the gap.
User query
How do we get out of a vendor contract early?
v1What are the termination clauses in our vendor agreements?
v2Conditions under which a vendor MSA can be cancelled before term end.
v3Early exit, breach, and termination-for-cause provisions for suppliers.
Per-variant hybrid retrieval → fused
Original · 12 hits
v1 · 12 hits
v2 · 12 hits
v3 · 12 hits
Fused ranked list14 unique, vocabulary-bridged
One question, four fan-outs. Catches documents whose vocabulary doesn't overlap with how the user phrased their question — the failure mode no amount of embedding tuning can fix.
Signal 03MMR reranking — the top five shouldn't repeat themselves
A common failure of fused retrieval is that the top hits are all close
paraphrases of the same paragraph from the same file. The model gets the
same evidence five times and confidently calls it consensus. Maximal
Marginal Relevance rebalances the final list to trade a little
relevance for diversity, so the answer is grounded in multiple
distinct passages instead of one document quoted back to itself.
Pre-MMR · raw fused listRedundant
01MSA §4.2 — renegotiation triggersSource A
02MSA §4.2 — renegotiation triggers (¶b)Source A
03MSA §4.2 — renegotiation triggers (¶c)Source A
04MSA §4.2 — renegotiation triggers (¶d)Source A
05Procurement playbook — check-insSource B
Post-MMR · rebalancedDiverse
01MSA §4.2 — renegotiation triggersSource A
02Procurement playbook — check-insSource B
03Finance memo — exception processSource C
04Legal brief — auto-renewalSource D
05MSA §4.2 — renegotiation triggers (¶b)Source A
Diversity beats echo. Five copies of the same paragraph at the top of the list is the same as one paragraph — and worse, because the model treats the repetition as agreement.
Signal 04Self-RAG — how sure is this answer?
Every result that survives the funnel is tagged with a confidence label —
HIGH, MEDIUM, or LOW —
derived from the fused score, source agreement, and grounding strength.
The agent uses the tag to decide what happens next: answer directly from
the vault, fall back to a live web search, or ask the user a clarifying
question. This is the signal that prevents the confidently-wrong answer.
Without it, every result looks equally trustworthy.
Topic absent or contradicted in vaultfused = 0.31 · agreement = 0/3
Ask user to clarify
The confidence tag is the dispatcher. The same architecture that finds the answer also says "I don't know" — out loud, with reasons. That's the difference between an assistant that's useful and one that's dangerous.
03 · The vault itselfIncremental ingestion, no batch re-runs
A retrieval architecture is only as good as the index underneath it, and
in production the index is never finished. Documents arrive, get edited,
get deleted. We don't re-embed the corpus on every run — that's wasteful
and slow. Ingestion is manifest-driven: a record at
./memory/vectors/ingestion-manifest.json tracks the hash and
timestamp of every file we've already processed. Each run diffs the
filesystem against the manifest and only touches NEW,
CHANGED, and DELETED entries. A file
watcher rides alongside it and re-indexes documents the moment they
change on disk — no batch jobs, no overnight rebuilds, no stale results
sitting in the vault while someone waits for the next sync.
Manifest-driven, watcher-fed. The vault is never stale and never re-embedded for nothing — only the rows that actually changed cost compute.
04 · Why we built it this wayWhat "enough" looks like in production
Every team building AI on top of their own documents discovers the same
thing in roughly the same order. The vector demo is exhilarating, the
first real users are humbling, and somewhere around month three the
backlog fills up with the same handful of complaints: it can't find
things by name, it gives me the same paragraph five times, it sounds
sure about things it doesn't know. We've built this architecture
for clients enough times that we now ship it as the default. Hybrid
retrieval, RAG-Fusion rewriting, MMR diversity, Self-RAG confidence,
manifest-driven ingestion — none of these are novel on their own. The
work, and the value, is in stacking them so they cover each other's
gaps and ship as one coherent system.
That's what "enough" looks like in production. Not a vector database with
a chat wrapper — a retrieval architecture that knows what it knows, knows
what it doesn't, and keeps the vault honest as the documents underneath
change.
Building AI on top of your documents?
We build this architecture for clients in regulated industries, internal-tools teams, and anywhere a confidently-wrong answer carries a real cost.