#56IT / DevOps

On-call AI agent: diagnostics + auto-remediation PR

On-call AI agent: diagnostics + auto-remediation PR automates the incident response process in the IT / DevOps / SRE department and achieves savings of 675 engineering hours per month. The AI agent connects to the observability stack, codebase, and on-call Slack channels, collects context when an alert fires, and proposes a fix — from hypothesis to pull request with the fix. For a team of 60 engineers and 30 channels, the system processes 4,200 successful flows per month, receives 66% positive feedback, and closes 28 PRs without human involvement. The cost of a single diagnostic is $0.30. Automation addresses three common pain points for DevOps teams: knowledge is scattered across on-call engineers' heads, engineers constantly context-switch between alerts, logs, and code, customers are slow to learn incident status. Grow2.ai deploys the agent on an AI model with integration into the repository, monitoring, and Slack — full launch takes 6–10 weeks.

Expected effect
675 h/month· Engineering time saved
Complexity
Month (2-4 weeks)
Tool type
Agent framework
ROI
Time saved
Industries
SaaS / Tech, Other / Horizontal
Integrations
Observability / monitoring, Code repository, Communications
Patterns
Multi-Step Orchestration, Monitoring and Alerting, Extraction from Unstructured

What it does

The AI agent works alongside the on-call engineer: reads alerts from Slack and the observability stack, collects diagnostic context, and prepares a pull request with a fix. It does not replace the on-call engineer — it responds to an incident first, so that by the time of escalation the context is already gathered and, in known cases, a fix is already proposed. In production mode, this saves the team 675 hours per month and closes 28 PRs without human involvement.

What the agent does

  1. Listens to the on-call channel and monitoring webhooks — catches a new alert in seconds, not after the engineer opens the notification.
  2. Extracts the stack trace, metrics, links to related dashboards, and recent deploys to build the full picture.
  3. Searches for similar incidents in the history of Slack threads and runbooks — surfacing knowledge that typically lives only in the heads of experienced engineers.
  4. Formulates a hypothesis about the incident cause and posts it to the thread as the first message, with a confidence level indicated.
  5. If the incident matches a known pattern — opens a pull request with a fix and assigns reviewers.
  6. Attaches evidence to the PR: logs, trace, links to similar cases, diff against previous fixes.
  7. Stays in the thread and responds to the on-call engineer's follow-up questions until the incident is closed — a single source of truth instead of manually copying context.
  8. After resolution, writes a short postmortem draft and records the new pattern for future incidents — the knowledge base is updated automatically.

The on-call engineer switches context less often: instead of the chain "alert → metrics → code → Slack → repository" they read a ready-made summary and make a decision. According to reference deployment data, 66% of the agent's suggestions receive positive feedback, and the cost of one interaction is $0,30.

What the agent does NOT do

  • Does not merge a pull request without human approval — all changes go through standard code review and CI.
  • Does not handle incidents for which there is no documented runbook or similar previous case — escalates to the on-call engineer with context already gathered.
  • Does not make architectural decisions, does not refactor components, and does not touch code outside the permitted services — only targeted fixes for known patterns.

How it works

The agent is built on a multi-step orchestration pattern: LLM drives the cycle «observe → hypothesize → act → verify» until it finds a solution or decides to escalate. The core is a language model with tool use via an agent framework.

Architecture

The agent operates across three integration layers, each with its own tool calls:

Layer

What it gives the agent

Operation examples

Observability / monitoring

Signal and metrics

Reading alerts, pulling metrics by instance/service, exporting stack traces

Code repository

Code and change history

Finding a file by error, viewing recent commits, creating a branch and PR

Communications

Team context

Reading Slack threads on the incident, posting a response, mentioning the on-call engineer

Incident handling flow

  1. Triggering event. An alert from the observability system lands in the on-call Slack channel. The webhook passes the event to the agent with a payload: severity, service, metric.
  2. Context gathering. The agent makes a series of tool calls: reads the latest log lines, the metric chart for the past 24 hours, and deploy history for the last 6 hours.
  3. Pattern search. The agent uses vector search across the Slack incident history and runbooks to find similar cases with their resolutions.
  4. Hypothesis. The LLM formulates a hypothesis of the form «elevated latency on service X is caused by release Y — rollback or hotfix Z» with a confidence estimate.
  5. Diagnosis post. The agent posts the first message to the thread: summary, hypothesis, links to evidence. The on-call engineer sees a summary, not raw logs.
  6. Remediation path. If the pattern is known and confidence is high — the agent creates a branch, applies a fix from the template, opens a PR with a description, and assigns reviewers. If not — it stops and asks the on-call engineer to confirm the direction.
  7. Human-in-the-loop. The on-call engineer reviews the PR, approves it or requests changes. The agent responds to comments: adds logs, revises the fix, explains the choice.
  8. Post-mortem draft. After the incident, the agent compiles a timeline — what happened, what was done, how long it took — and posts the draft to the channel for editing.

How it is deployed on a project

  1. Connecting observability: a webhook from Datadog, Grafana, New Relic, Sentry, or Prometheus Alertmanager to the agent service.
  2. Repository integration: a GitHub App or GitLab access token with permissions to create branch, open PR, read commit history.
  3. Installing the Slack bot in the on-call channel: reading events, posting responses, threading.
  4. Importing historical incidents: parsing Slack threads and existing runbooks into a vector index — the core knowledge base of the agent.
  5. Defining auto-remediation patterns: a list of incident types where the agent is permitted to open a PR (rollback deploy, changing a feature flag, bumping limits).
  6. Guardrails: a list of services and repositories where the agent only reads, and a separate list where it can write.
  7. Pilot: one week in «agent writes diagnostics only, no PRs» mode. The team evaluates hypothesis quality.
  8. Expansion: after stable positive feedback, auto-remediation patterns are enabled one by one.

Where the value lies

The agent turns three pairs of hands into a single first responder who is always online. According to reference deployment data, 28 PRs per month are merged without human involvement — these are low-risk fixes that previously consumed senior engineers' time and pulled them away from their current work.

Prerequisites

To launch an On-call AI agent, a team needs three readiness groups: access, historical data, and operational process. Without them, the pilot shifts to debugging integrations instead of real incident work.

Access and integrations

  • Observability stack with webhooks: Datadog, Grafana, New Relic, Sentry, or Prometheus Alertmanager.
  • Git repository with configured CI and code review (GitHub, GitLab, Bitbucket).
  • Slack or equivalent with an on-call channel and bot installation rights.
  • Technical agreement: read-only for most repositories, write (create branch + open PR) for the approved list.

Historical data

  • Slack incident threads for the past 6–12 months — the more, the more accurate the pattern matching.
  • Runbooks in any format (Confluence, Notion, markdown in the repository).
  • A list of known auto-remediation patterns: which incident types the team is ready to delegate to the agent (rollback, feature-flag toggle, limit bump).

Team readiness

  • On-call rotation is set up: duty engineer and escalation process in place.
  • Code review is required for all PRs — the agent does not merge on its own.
  • An owner is assigned: a senior SRE or tech lead who validates patterns and reviews false positives in the first weeks.

Implementation timeline

Complexity — medium. Full launch from contract to production — 6–10 weeks:

  1. Weeks 1–2: integrations, access setup, incident history indexing.
  2. Weeks 3–5: pilot in diagnostic mode, pattern configuration.
  3. Weeks 6–8: enabling auto-remediation for one pattern, calibration.
  4. Weeks 9–10: handover to the team and owner playbook.

Pain points

  • Knowledge in heads, not in documents
  • Constant context switching
  • Slow Customer Response

FAQ

How long does implementation take?

Full launch takes 6–10 weeks. The first 2 weeks go to integrations with observability, the repository, and Slack. The next 3–4 weeks are a pilot in "diagnostics-only" mode, where the team calibrates hypothesis quality. The final 2–4 weeks cover enabling auto-remediation for one pattern and handover to the owner. The diagnostics part can be launched faster if incidents are well documented in Slack threads.

We don't have up-to-date runbooks — will the agent work?

Partially. The agent compensates for the absence of runbooks with Slack thread history: if the team discusses incidents in channels, that data is sufficient for pattern matching. In the first weeks, the agent escalates more often instead of auto-remediation, but builds up the knowledge base. After 1–2 months of operation, a structured incident index emerges — the conversation history automatically becomes a runbook equivalent.

What are the risks and what can go wrong?

The main risk is false hypotheses that lead the on-call engineer in the wrong direction. That is why the agent shows a confidence level and evidence, and auto-remediation is only enabled for patterns with a success history. The second risk is a PR with an incorrect fix, but code review and CI stop such changes. The agent does not merge on its own and does not touch code outside permitted services.

Is automation suitable for our industry?

The primary profile is SaaS and Tech, where an observability stack and on-call rotation are in place. It also fits e-commerce, fintech, gaming — anywhere production requires on-call coverage. Not suitable for teams without monitoring or without a code review process. Industry-specific requirements are built into auto-remediation patterns: compliance checks matter for fintech, rollback speed for gaming.

Will the agent replace the on-call engineer?

No. The agent is a first responder, not a replacement. It gathers context, proposes a hypothesis, and in simple cases opens a PR, but decisions remain with the human. The reference implementation shows 66% positive feedback and 28 PRs per month without human intervention — these are low-risk fixes that previously took up senior engineers' time. Complex incidents are escalated by the agent with context already gathered.

Is it possible to run only the diagnostics part without auto-remediation?

Yes, this is the standard starting point. In diagnostics mode, the agent writes a summary, a hypothesis, and links to evidence, but does not open a PR. This addresses the main pain point — context switching and searching for similar incidents — without the risk of interference with code. Auto-remediation is enabled as a separate step, after 1–2 months of piloting, once the team sees stable hypothesis quality.

What model does the agent run on?

The core is an LLM with tool use via an agent framework. The model manages the "observation → hypothesis → action" cycle and makes calls to observability, the repository, and Slack. The choice is driven by code-reasoning quality and long-context stability — stack traces, logs, and diffs fit within a single window. Grow2.ai is responsible for prompt engineering, tool patterns, and agent behavior monitoring.

Want this in your business?

Book a free audit — we'll show how this automation will work for you.

Related automations

#57 · IT / DevOps / SRE

Postmortem Draft from Slack + Telemetry

The Grow2.ai AI agent compiles a postmortem draft by pulling context from incident Slack threads, observability system alerts, and issue tracker tickets. The engineer gets the first draft in minutes — with an event timeline, affected services, team actions, and findings in blameless format — and edits it rather than writing from scratch. The solution fits SaaS teams, DevOps and SRE departments that lose incident knowledge in chats and don't have time to document. Automation addresses three pain points: loss of context from meetings and discussions, hours of manual work on the report, and knowledge that stays in a few people's heads and never makes it into team documents. Basic setup takes about a week: connecting data sources, configuring the prompt template with blameless rules, and testing on real incidents from the team's history. The result is reduced postmortem time: the draft is ready in minutes instead of hours of manually gathering artifacts and writing prose. The blameless format is encoded in the prompt rather than requiring discipline from each individual engineer, and document quality becomes predictable.

The engineer gets the postmortem draft in minutes, edits it — doesn't write from scratch. Blameless format encoded in the prompt.

Week (1-5 days)Agent frameworkTime saved
#58 · IT / DevOps / SRE

AI incident triage + runbook executor

AI incident triage + runbook executor automates initial incident handling and execution of standard runbooks in the IT / DevOps / SRE department and achieves MTTM reduction from 22 to 11 minutes (-50%). The AI agent receives signals from monitoring systems, classifies the incident by severity and domain, collects context from logs and metrics, presents the on-call engineer with a ready runbook, and executes its steps on command, with explicit receipt confirmations. As a result, the number of duplicate alerts decreases (-38% per incident), rollback errors disappear (all actions go through receipt), and SRE team satisfaction grows from 3.2 to 4.4/5. The solution fits SaaS/Tech and universal horizontal scenarios where system knowledge is fragmented across people and on-call engineers switch context dozens of times per shift. The agent does not make irreversible decisions on its own — it prepares the ground for the engineer and documents every step.

50%· Mean time to mitigate
Month (2-4 weeks)Agent frameworkRisk reduced
#59 · IT / DevOps / SRE

Natural language query across the entire observability stack

Natural language query across the observability stack — the AI agent answers the team's questions about logs, metrics, traces, and alerts in plain language. Instead of switching between Grafana, Datadog, Sentry, and Kubernetes dashboards, an engineer types: "why did checkout latency increase after the deploy at 14:07?" — the agent returns a coherent answer with links to specific sources. Automation addresses three pain points of IT teams: too many disparate tools, constant context switching, slow incident response. Time-to-insight drops from minutes or hours of hunt-and-peck to a single query. New engineers onboard faster because there is no need to learn each console separately. Suitable for IT / DevOps / SRE teams in SaaS and tech companies of 5–50 people, and also horizontally — anywhere with an observability stack of two or more tools. Build in a weekend: RAG + MCP connectors + AI model as the conversation engine.

Time-to-insight drops from minutes/hours of hunt-and-peck to a single NL query. New engineers onboard faster.

Weekend (1-2 days)Vertical SaaSTime saved
#60 · IT / DevOps / SRE

Cloud cost anomaly detection

Cloud cost anomaly detection automates the process of monitoring cloud infrastructure spend in the IT / DevOps / SRE department and achieves the effect of detecting anomalous spikes on the day they occur, not at the monthly reconcile stage. Automation suits SaaS product teams and any companies with non-trivial cloud resource consumption, where manual cost tracking takes up engineers' time and leads to missed budget leaks. Grow2.ai sets up a pipeline that pulls billing data from the cloud provider daily, runs it through a statistical anomaly detection model, and sends structured alerts to the team's work channel. The responsible person receives context directly in Slack or email: service, region, deviation from baseline, causes of the spike. The solution does not replace financial planning, but removes hours of manual billing report analysis and reduces response time to configuration errors. Typical scenarios: Terraform errors, forgotten dev instances, autoscaling without an upper limit, unplanned traffic.

Unexpected cost spikes are caught on the same day, not at the end of the month during reconcile.

Week (1-5 days)Custom codeCost saved
Take the AI-audit (2 min)