3 protection levels of an AI agent — what happens when it slips up

This is the question clients ask before launch 9 times out of 10: "What if your AI tells a customer something off the rails?" The honest answer — with no protection, it will. With three protection levels — practically never. Here is how it works technically and what we guarantee.

Layer 1 — Prompt rules and white-list scope

The first layer isn't a "setting" — it's an architectural constraint. Any company's AI agent gets a system prompt along the lines of: "You are a sales assistant. Your job is to qualify the inbound request and book a meeting." Then come the hard prohibitions — the agent is NOT allowed to:

quote a specific price without pulling the price from the price list via API;
promise delivery timelines without pulling stock status via API;
confirm a discount above 5% without escalating to a manager;
answer questions outside the sales scope (claims, technical support, legal questions) — escalate.

If the question is outside the scope, the answer is: "I'll pass this question to a colleague." And it creates a task in the CRM.

What this gives you: 70-80% of potential mistakes never happen, because the agent refuses to answer without confirmation from the system. It doesn't invent a price — it asks the API for the real one. It doesn't invent a date — it asks the calendar. This works because LLMs (Claude Opus 4.7, GPT-5) are good at instruction-following when the constraints are clearly spelled out.

Layer 2 — LLM supervisor (a second model checks the first)

The second layer is a smaller, faster model that checks the first one's answer before it goes to the customer. At Grow2.ai the architecture looks like this:

The Agent (Claude Opus 4.7 or GPT-5) generates a draft answer.
The Supervisor (Claude Haiku 4.5 or GPT-5-mini) receives the original request + the draft + the rules.
The Supervisor returns a JSON approve/reject with a reason.
If approve=false, the draft is discarded and the agent regenerates or escalates.

What the supervisor checks: numbers (price against the price list), dates (a realistic meeting date), tone (brand voice), promise (the agent didn't promise something the company can't deliver). Cost: the supervisor is a smaller model, adding ~$0.001-0.005 per request. At 10K requests/month that's an extra $10-50. Infinitely cheaper than one bad incident with a VIP customer.

Layer 3 — Human-in-the-loop (escalation + audit)

The third layer is a guaranteed human control point in two scenarios.

Scenario A: the AI escalates itself. If the confidence score is below the threshold (usually 0.7) or the supervisor returned approve=false, the agent creates a task in the CRM tagged "manual review needed" and hands it to a manager with the context ready.
Scenario B: VIP segment and critical fields. Predefined segments always go through a human. The agent prepares a draft answer, the manager reviews it for 30 seconds, then sends or edits it.

Audit: every agent answer is stored with a full log — the original request, the system prompt, the supervisor response, the final decision, and who confirmed it and how. If a customer writes "your bot told me 50%, where's the discount?", we find the full trail in 30 seconds.

What happens when it slips up anyway

Honestly: the agent handles 2-5% of requests suboptimally. Not "making up a price" — that's blocked by Layers 1-2 — but giving a templated answer where the customer expected personalization, or stalling on an unusual request. This isn't an "error" in the engineering sense — it's a drop in quality compared to your best manager. What we do about it: a weekly review for the first two months, a customer feedback loop, A/B testing on contentious fields. This isn't "set it and forget it" — it's an ongoing process.

What the protection does NOT give you

The anti-hype part. None of the three protection layers guarantees:

empathy on an emotional request ("my father died today, I can't come in for the viewing" — the AI will understand the context and escalate, but it isn't a human response);
flexibility on a non-standard offer ("let me pay 6 months upfront for a 30% discount" — that isn't in the prompt, so it escalates);
intuition on "hot" signals (when a customer writes with nuances a human salesperson reads instantly and the AI misses).

An AI agent with three protection levels is a safety net, not magic. It gives you confidence that the basic mistakes are blocked. The hard part is still your team's work.

Frequently asked questions

What happens if the AI agent gives a customer the wrong price?

Technically this shouldn't happen: Layer 1 forbids the agent from inventing a price — it calls the pricing API via a function call. The Layer 2 supervisor checks the quoted price against the price list before sending. If it happens anyway (an integration bug), you have a full audit log: when the request came in, which price was current, what price the agent quoted, which manager was online. From that you decide: honor the quoted price for the customer, or explain with a reference to the correct one.

How often does the AI agent make mistakes, and how do you measure that?

Metric 1: Error rate — % of answers blocked by the supervisor. Normal range: 3-8%. Metric 2: Escalation rate — % of requests the agent hands to a human itself. Normal range: 10-20%. Metric 3: Customer feedback — the number of complaints that "the bot said the wrong thing". Normal range: under 0.5% of all requests. At Grow2.ai we monitor all three metrics in real time and run a weekly review for the first 2 months.

Can every answer from the AI agent be audited?

Yes — it's a mandatory part of the setup. Every answer is stored with: the timestamp, the customer's original request, the system prompt active at the time, the agent response (draft), the supervisor response (approve/reject + reason), the final outcome (sent/escalated), and which manager touched it. The audit log is kept for 12+ months and exports to JSON or CSV.

What happens during an LLM provider's downtime (OpenAI, Anthropic)?

At Grow2.ai the agent has multi-provider failover: if the primary (Anthropic Claude) doesn't respond within 10 seconds, the agent automatically switches to a secondary (OpenAI GPT-5) with the same prompt. The Anthropic and OpenAI SLAs are 99.9% each; combined — 99.99%. If both go down at once, the agent enters graceful degradation: all requests land in a queue tagged "manual response required", and managers handle them by hand.