What it does
The AI agent works alongside the on-call engineer: reads alerts from Slack and the observability stack, collects diagnostic context, and prepares a pull request with a fix. It does not replace the on-call engineer — it responds to an incident first, so that by the time of escalation the context is already gathered and, in known cases, a fix is already proposed. In production mode, this saves the team 675 hours per month and closes 28 PRs without human involvement.
What the agent does
- Listens to the on-call channel and monitoring webhooks — catches a new alert in seconds, not after the engineer opens the notification.
- Extracts the stack trace, metrics, links to related dashboards, and recent deploys to build the full picture.
- Searches for similar incidents in the history of Slack threads and runbooks — surfacing knowledge that typically lives only in the heads of experienced engineers.
- Formulates a hypothesis about the incident cause and posts it to the thread as the first message, with a confidence level indicated.
- If the incident matches a known pattern — opens a pull request with a fix and assigns reviewers.
- Attaches evidence to the PR: logs, trace, links to similar cases, diff against previous fixes.
- Stays in the thread and responds to the on-call engineer's follow-up questions until the incident is closed — a single source of truth instead of manually copying context.
- After resolution, writes a short postmortem draft and records the new pattern for future incidents — the knowledge base is updated automatically.
The on-call engineer switches context less often: instead of the chain "alert → metrics → code → Slack → repository" they read a ready-made summary and make a decision. According to reference deployment data, 66% of the agent's suggestions receive positive feedback, and the cost of one interaction is $0,30.
What the agent does NOT do
- Does not merge a pull request without human approval — all changes go through standard code review and CI.
- Does not handle incidents for which there is no documented runbook or similar previous case — escalates to the on-call engineer with context already gathered.
- Does not make architectural decisions, does not refactor components, and does not touch code outside the permitted services — only targeted fixes for known patterns.
How it works
The agent is built on a multi-step orchestration pattern: LLM drives the cycle «observe → hypothesize → act → verify» until it finds a solution or decides to escalate. The core is a language model with tool use via an agent framework.
Architecture
The agent operates across three integration layers, each with its own tool calls:
Layer | What it gives the agent | Operation examples |
|---|---|---|
Observability / monitoring | Signal and metrics | Reading alerts, pulling metrics by instance/service, exporting stack traces |
Code repository | Code and change history | Finding a file by error, viewing recent commits, creating a branch and PR |
Communications | Team context | Reading Slack threads on the incident, posting a response, mentioning the on-call engineer |
Incident handling flow
- Triggering event. An alert from the observability system lands in the on-call Slack channel. The webhook passes the event to the agent with a payload: severity, service, metric.
- Context gathering. The agent makes a series of tool calls: reads the latest log lines, the metric chart for the past 24 hours, and deploy history for the last 6 hours.
- Pattern search. The agent uses vector search across the Slack incident history and runbooks to find similar cases with their resolutions.
- Hypothesis. The LLM formulates a hypothesis of the form «elevated latency on service X is caused by release Y — rollback or hotfix Z» with a confidence estimate.
- Diagnosis post. The agent posts the first message to the thread: summary, hypothesis, links to evidence. The on-call engineer sees a summary, not raw logs.
- Remediation path. If the pattern is known and confidence is high — the agent creates a branch, applies a fix from the template, opens a PR with a description, and assigns reviewers. If not — it stops and asks the on-call engineer to confirm the direction.
- Human-in-the-loop. The on-call engineer reviews the PR, approves it or requests changes. The agent responds to comments: adds logs, revises the fix, explains the choice.
- Post-mortem draft. After the incident, the agent compiles a timeline — what happened, what was done, how long it took — and posts the draft to the channel for editing.
How it is deployed on a project
- Connecting observability: a webhook from Datadog, Grafana, New Relic, Sentry, or Prometheus Alertmanager to the agent service.
- Repository integration: a GitHub App or GitLab access token with permissions to create branch, open PR, read commit history.
- Installing the Slack bot in the on-call channel: reading events, posting responses, threading.
- Importing historical incidents: parsing Slack threads and existing runbooks into a vector index — the core knowledge base of the agent.
- Defining auto-remediation patterns: a list of incident types where the agent is permitted to open a PR (rollback deploy, changing a feature flag, bumping limits).
- Guardrails: a list of services and repositories where the agent only reads, and a separate list where it can write.
- Pilot: one week in «agent writes diagnostics only, no PRs» mode. The team evaluates hypothesis quality.
- Expansion: after stable positive feedback, auto-remediation patterns are enabled one by one.
Where the value lies
The agent turns three pairs of hands into a single first responder who is always online. According to reference deployment data, 28 PRs per month are merged without human involvement — these are low-risk fixes that previously consumed senior engineers' time and pulled them away from their current work.
Prerequisites
To launch an On-call AI agent, a team needs three readiness groups: access, historical data, and operational process. Without them, the pilot shifts to debugging integrations instead of real incident work.
Access and integrations
- Observability stack with webhooks: Datadog, Grafana, New Relic, Sentry, or Prometheus Alertmanager.
- Git repository with configured CI and code review (GitHub, GitLab, Bitbucket).
- Slack or equivalent with an on-call channel and bot installation rights.
- Technical agreement: read-only for most repositories, write (create branch + open PR) for the approved list.
Historical data
- Slack incident threads for the past 6–12 months — the more, the more accurate the pattern matching.
- Runbooks in any format (Confluence, Notion, markdown in the repository).
- A list of known auto-remediation patterns: which incident types the team is ready to delegate to the agent (rollback, feature-flag toggle, limit bump).
Team readiness
- On-call rotation is set up: duty engineer and escalation process in place.
- Code review is required for all PRs — the agent does not merge on its own.
- An owner is assigned: a senior SRE or tech lead who validates patterns and reviews false positives in the first weeks.
Implementation timeline
Complexity — medium. Full launch from contract to production — 6–10 weeks:
- Weeks 1–2: integrations, access setup, incident history indexing.
- Weeks 3–5: pilot in diagnostic mode, pattern configuration.
- Weeks 6–8: enabling auto-remediation for one pattern, calibration.
- Weeks 9–10: handover to the team and owner playbook.
Pain points
- Knowledge in heads, not in documents
- Constant context switching
- Slow Customer Response
FAQ
How long does implementation take?
Full launch takes 6–10 weeks. The first 2 weeks go to integrations with observability, the repository, and Slack. The next 3–4 weeks are a pilot in "diagnostics-only" mode, where the team calibrates hypothesis quality. The final 2–4 weeks cover enabling auto-remediation for one pattern and handover to the owner. The diagnostics part can be launched faster if incidents are well documented in Slack threads.
We don't have up-to-date runbooks — will the agent work?
Partially. The agent compensates for the absence of runbooks with Slack thread history: if the team discusses incidents in channels, that data is sufficient for pattern matching. In the first weeks, the agent escalates more often instead of auto-remediation, but builds up the knowledge base. After 1–2 months of operation, a structured incident index emerges — the conversation history automatically becomes a runbook equivalent.
What are the risks and what can go wrong?
The main risk is false hypotheses that lead the on-call engineer in the wrong direction. That is why the agent shows a confidence level and evidence, and auto-remediation is only enabled for patterns with a success history. The second risk is a PR with an incorrect fix, but code review and CI stop such changes. The agent does not merge on its own and does not touch code outside permitted services.
Is automation suitable for our industry?
The primary profile is SaaS and Tech, where an observability stack and on-call rotation are in place. It also fits e-commerce, fintech, gaming — anywhere production requires on-call coverage. Not suitable for teams without monitoring or without a code review process. Industry-specific requirements are built into auto-remediation patterns: compliance checks matter for fintech, rollback speed for gaming.
Will the agent replace the on-call engineer?
No. The agent is a first responder, not a replacement. It gathers context, proposes a hypothesis, and in simple cases opens a PR, but decisions remain with the human. The reference implementation shows 66% positive feedback and 28 PRs per month without human intervention — these are low-risk fixes that previously took up senior engineers' time. Complex incidents are escalated by the agent with context already gathered.
Is it possible to run only the diagnostics part without auto-remediation?
Yes, this is the standard starting point. In diagnostics mode, the agent writes a summary, a hypothesis, and links to evidence, but does not open a PR. This addresses the main pain point — context switching and searching for similar incidents — without the risk of interference with code. Auto-remediation is enabled as a separate step, after 1–2 months of piloting, once the team sees stable hypothesis quality.
What model does the agent run on?
The core is an LLM with tool use via an agent framework. The model manages the "observation → hypothesis → action" cycle and makes calls to observability, the repository, and Slack. The choice is driven by code-reasoning quality and long-context stability — stack traces, logs, and diffs fit within a single window. Grow2.ai is responsible for prompt engineering, tool patterns, and agent behavior monitoring.
Want this in your business?
Book a free audit — we'll show how this automation will work for you.