Field notes · FN-007 · Citation-first answers — the workflow, not the prompt
Drawing G2-FN-007 · Field journal · entryEngineering

Citation-first answers — the workflow, not the prompt

By Andrew Maryasov · Grow2.ai2025-05-22 · 9 min
Entry FN-007
Filed 2025-05-22
Read 9 min
Category Engineering
Languages — EN
Entry №
FN-007
Field log

Most public "hallucination prevention" advice for LLMs is prompt-engineering folklore: tell the model to "only answer based on the provided context" and hope. Grow2.ai has run that approach in production and measured it. It works ~92% of the time. We needed below 0.4%, so we built a retrieval pipeline that makes hallucination structurally hard, not just discouraged.

The five stages

  1. Source-typed retrieval. Every retrieved chunk carries a structured source descriptor: { doc_id, section_id, version, last_modified, source_type }. The agent never sees naked text — it sees text plus provenance.
  2. Quote-or-defer rule. The agent is instructed (in code, not just prompt) that any factual claim must either quote a chunk or explicitly say "I don't have a source for that — let me check with the team." The escalation path is real: those messages route to a human, who answers and writes a new chunk.
  3. Citation rendering. Every reply that contains a quoted claim renders an inline citation marker. To the user, this can be styled invisibly — but the marker is in the message thread, and it becomes evidence if a dispute arises later.
  4. Post-hoc validation. A second, smaller model reads the agent's reply and the chunks it cited. If the reply contains a factual claim not covered by the chunks, the validator flags. We sample 100 conversations a week through this gate; any flag becomes an eval-set entry.
  5. Knowledge-gap log. Every "I don't have a source for that" deferral writes a row to a gaps table. Ops triages weekly. New chunks get added. The agent gets quietly smarter, on a per-client basis, without retraining anything.

The numbers across 14 pilots

  • Hallucination rate (validator-flagged): 0.31% across 71,200 sampled conversations.
  • Deferral rate ("let me check with the team"): 6.4% of conversations contain at least one deferral; 88% of those resolve without further escalation once the human supplies an answer.
  • Knowledge-gap closure: median time from gap-logged to chunk-added is 4 days, with weekly triage.

What it doesn't solve

Citation-first does not stop the model from getting tone wrong. It does not stop the model from selecting the wrong chunk when two chunks contradict (we have a separate "chunk-conflict" flag for that). It does not stop the model from answering a question the user didn't actually ask — the classic LLM failure where the reply is technically correct but addresses a different problem.

Hallucination is a small part of agent quality. Grow2.ai treats it as a solved problem at the architecture level so we can spend prompt-engineering time on the harder problems: tone, escalation timing, conversational repair when the user is upset.

Why we don't open-source the code

We get asked. The honest answer: the pipeline is ~600 lines of TypeScript and SQL, and it's not the interesting part. The interesting parts are the per-client chunk schemas, the deferral rules, and the eval set. None of those are extractable into a standalone library — they live and breathe with the workflow they serve. Grow2.ai will walk a serious technical buyer through the pipeline on a call. We will not ship a generic version.

Have a problem this note describes? Bring it to a call.

Field notes are written for the version of Grow2.ai that will run into the same problem in eight months. If one of them describes your situation, that's usually a good sign we should talk.

▸ Commission a pilot
grow2.ai

An engineering practice for AI agents in customer operations. Drawn, deployed and signed-off in Kyiv since 2021.

SOC 2 · in auditEU-CentralDOU 4.9 ★

Office

Kyiv · UAMon–Fri · 09:00–19:00 EET+380 44 000 0000hello@grow2.ai
© 2021–2026 grow2.ai · An Auspex Co. practice · All drawings & documentation property of the bearer.Set in Archivo & JetBrains Mono · Drawing G2-026 · Rev B
Citation-first answers — the workflow, not the prompt · Grow2.ai field notes · Grow2.ai