Citation-first answers — the workflow, not the prompt · Grow2.ai field notes

Most public "hallucination prevention" advice for LLMs is prompt-engineering folklore: tell the model to "only answer based on the provided context" and hope. Grow2.ai has run that approach in production and measured it. It works ~92% of the time. We needed below 0.4%, so we built a retrieval pipeline that makes hallucination structurally hard, not just discouraged.

The five stages

Source-typed retrieval. Every retrieved chunk carries a structured source descriptor: { doc_id, section_id, version, last_modified, source_type }. The agent never sees naked text — it sees text plus provenance.
Quote-or-defer rule. The agent is instructed (in code, not just prompt) that any factual claim must either quote a chunk or explicitly say "I don't have a source for that — let me check with the team." The escalation path is real: those messages route to a human, who answers and writes a new chunk.
Citation rendering. Every reply that contains a quoted claim renders an inline citation marker. To the user, this can be styled invisibly — but the marker is in the message thread, and it becomes evidence if a dispute arises later.
Post-hoc validation. A second, smaller model reads the agent's reply and the chunks it cited. If the reply contains a factual claim not covered by the chunks, the validator flags. We sample 100 conversations a week through this gate; any flag becomes an eval-set entry.
Knowledge-gap log. Every "I don't have a source for that" deferral writes a row to a gaps table. Ops triages weekly. New chunks get added. The agent gets quietly smarter, on a per-client basis, without retraining anything.

The numbers across 14 pilots

Hallucination rate (validator-flagged): 0.31% across 71,200 sampled conversations.
Deferral rate ("let me check with the team"): 6.4% of conversations contain at least one deferral; 88% of those resolve without further escalation once the human supplies an answer.
Knowledge-gap closure: median time from gap-logged to chunk-added is 4 days, with weekly triage.

What it doesn't solve

Citation-first does not stop the model from getting tone wrong. It does not stop the model from selecting the wrong chunk when two chunks contradict (we have a separate "chunk-conflict" flag for that). It does not stop the model from answering a question the user didn't actually ask — the classic LLM failure where the reply is technically correct but addresses a different problem.

Hallucination is a small part of agent quality. Grow2.ai treats it as a solved problem at the architecture level so we can spend prompt-engineering time on the harder problems: tone, escalation timing, conversational repair when the user is upset.

Why we don't open-source the code

We get asked. The honest answer: the pipeline is ~600 lines of TypeScript and SQL, and it's not the interesting part. The interesting parts are the per-client chunk schemas, the deferral rules, and the eval set. None of those are extractable into a standalone library — they live and breathe with the workflow they serve. Grow2.ai will walk a serious technical buyer through the pipeline on a call. We will not ship a generic version.

Citation-first answers — the workflow, not the prompt

The five stages

The numbers across 14 pilots

What it doesn't solve

Why we don't open-source the code

Have a problem this note describes? Bring it to a call.