What it does
The agent reduces the time from alert trigger to the first meaningful action — the MTTM (Mean Time To Mitigate) that determines how long customers actually suffer from an incident. It works as a combination of monitoring, runbook orchestration, and on-call communications, turning scattered signals into a single managed process.
What the agent does step by step
- Receives raw signals from observability systems — metrics, logs, traces, health-checks, alertmanager — and merges duplicates into a single incident using correlation keys.
- Classifies the incident by severity (SEV1-SEV4) and domain (DB, API, network, deploy, external vendor) based on historical patterns and pre-defined rules.
- Collects context: recent deploys, feature flag changes, similar past incidents, the list of component owners, and SLO/SLA for the service.
- Routes the alert to the correct communication channel — one, not five. The on-call engineer receives a compact briefing in Slack or PagerDuty instead of a dozen identical pages.
- Selects the appropriate runbook from the library and proposes its execution with a risk assessment for each step.
- On the on-call engineer's command, executes runbook steps with intermediate receipt confirmations — before each mutating action, it shows exactly what will be done and what consequences are expected.
- Documents the incident timeline: who did what, when, and what effect it had. Prepares a postmortem draft with facts, not guesses.
What the agent does not do
- Does not make decisions on rollback, failover, or drain without explicit confirmation from the on-call engineer — every irreversible action requires a receipt, which is why the pilot recorded zero erroneous rollbacks.
- Does not replace the on-call rotation or remove responsibility from the team — it speeds up the engineer, not replaces them.
- Does not guess the causes of incidents for which there is no data in the runbook library or historical records. New classes of failures are escalated to humans, and gaps in runbooks are highlighted in the post-incident report.
How it works
Under the hood — an orchestrator agent on an agent framework (LLM as the reasoning layer), connected to the observability stack, the communications system, and the runbook library. The key principle — all actions with side effects go through the receipt mechanism: the agent formulates its intent, shows it to a human, and waits for confirmation.
Incident processing flow
An alert enters the agent queue via a webhook from alertmanager, PagerDuty, or DataDog. The agent normalizes the format, checks against open incidents (to detect duplicates), and enriches the context from the monitoring API and CMDB. Next, the LLM layer classifies the incident and selects a runbook — this is a separate structured-output call with validation against a JSON schema. The orchestrator runs the runbook as a graph of steps: each step is either read-only (metrics query, log search) or mutating (restart pod, flip feature-flag, rollback deploy). Mutating steps require a receipt from the on-call engineer.
Implementation steps
- Inventory — collect a list of runbooks (even if they live in Confluence, in a senior engineer's head, or in gists), and catalog them by component and severity.
- Runbook normalization — convert to a machine-readable format: YAML, Markdown with frontmatter, or DSL. Each step is tagged as read-only or mutating, with an explicit rollback action.
- Connecting observability — configure outgoing webhooks from alertmanager/PagerDuty/DataDog to the agent, and map alert labels to domain classification.
- Communications integration — a Slack bot for briefings and receipt dialogs, threading by incident ID, channel routing by the responsible team.
- LLM pipeline setup — classifier, runbook selector, briefing generator. Each call uses structured output with a strict JSON schema.
- Pilot on 1-2 services — first in shadow mode (the agent suggests but does not act), then with manual approval for everything, then with auto-approve on read-only steps.
- Expand to other teams — as MTTM metrics stabilize and on-call trust grows.
System components
Component | Role |
|---|---|
Alert ingester | Normalization of webhooks from monitoring, deduplication by correlation keys |
Classifier | LLM classification of severity and domain with structured output |
Runbook store | Runbook library in YAML/Markdown with versioning |
Orchestrator | Step-by-step runbook execution, receipt mechanics on mutating steps |
Communications adapter | Briefings, receipt dialogs, threading in Slack |
Audit log | Timeline of all agent and human actions, input to postmortem |
The runbook store is a critical element: if runbooks are missing or outdated, the agent runs idle. The first weeks of implementation are spent specifically on team discipline around writing them. The audit log is the second critical element: without it, the receipt mechanism loses its meaning, because it becomes impossible to reconstruct who confirmed what.
The agent runs in a reasoning → action → receipt → observation loop until either a resolved state is reached (metrics return to normal) or escalation occurs (a human takes control, the agent shifts to an assistant role and documents the on-call engineer's actions).
Prerequisites
Implementation requires a baseline level of process maturity — without it, the agent has nothing to rely on.
Data and access
- An observability stack with webhook-based alert delivery (Prometheus + alertmanager, DataDog, New Relic, Grafana, PagerDuty — any modern one).
- At least 5-10 written runbooks for the most common incident classes. They can be in Confluence, Notion, or git — the main thing is that they exist.
- API access to infrastructure systems for mutating actions (kubectl, Terraform Cloud, feature-flag platform, CI/CD).
- An incident communications channel (Slack or Teams) with bot permissions to post, read threads, and create channels.
- A 3-6 month history of past incidents for classifier calibration.
Team readiness
- A designated owner from SRE/DevOps who is responsible for the runbook library and keeping it current.
- A blameless postmortem culture — otherwise an agent that documents everything will meet resistance.
- On-call staff are ready for the new workflow with receipt confirmations instead of direct console actions.
- An understanding that for the first 2-4 weeks the agent will operate in shadow mode without real actions — this is not a failure, but calibration of the classifier and runbook selector.
Timeline
An average project is 6-10 weeks from kick-off to productive use across several services. The first two weeks — inventory and normalization of runbooks, weeks three through five — integrations with observability and communications, a pilot in shadow mode. Weeks six through ten — scope expansion and configuring auto-approve for safe read-only steps.
Pain points
- Knowledge in heads, not in documents
- Constant context switching
- Slow Customer Response
FAQ
How long does implementation take?
6–10 weeks for a typical SRE team. The first 2 weeks go to runbook inventory and normalization, weeks 3–5 cover observability and communications integrations plus a pilot in shadow mode. Weeks 6–10 expand scope to additional services and gradually enable auto-approve on read-only steps. Pace depends heavily on whether the team has written runbooks at the start or has to build them from scratch.
What to do if we have no written runbooks?
This is the most common obstacle for SMB teams. The first 2–3 weeks of implementation turn into disciplined runbook writing together with senior engineers — during this time the agent helps extract procedures from their heads through structured interviews and incident history analysis. Without this work, moving forward is pointless: the agent has nothing to rely on, the classifier operates blind, and ROI does not materialize.
What are the risks and what can break?
The main risk is false positives from the classifier on rare incident classes. Mitigation — receipt mechanics: mutating actions require on-call confirmation, irreversible operations (rollback, drain, failover) always require explicit approval. Zero erroneous rollbacks were recorded in the pilot. The second risk is runbook library degradation over time. An SRE-side owner is essential to keep runbooks from going stale and misleading the agent.
Is the solution right for our industry?
The solution is optimal for SaaS/Tech with an observability stack and on-call rotation. In universal horizontal scenarios — any company with production services, on-call engineers, and alerts — it works as well. For teams with fewer than 5 services and infrequent incidents (fewer than 10 per month) ROI materializes more weakly than in companies with regular incident load, where MTTM directly impacts SLA and revenue.
Can it be implemented without replacing the current PagerDuty or alertmanager?
Yes. The agent connects on top of the existing stack via webhooks and API — it does not replace monitoring and alerting, but extends them with a layer of classification, context enrichment, and runbook orchestration. PagerDuty continues to escalate along the on-call rotation, alertmanager continues to deduplicate at the source level, the agent takes on triage, on-call briefing, and runbook execution on command.
What happens to incidents the agent cannot handle?
For such cases the agent escalates the on-call engineer and shifts to the role of assistant: it gathers context, documents the human's actions, searches for similar incidents in history, and suggests steps by analogy. New failure classes are material for expanding the runbook library; the agent itself highlights such gaps to the owner in the post-incident report, and they become the next candidates for automation.
Want this in your business?
Book a free audit — we'll show how this automation will work for you.