#59IT / DevOps

Natural language query across the entire observability stack

Natural language query across the observability stack — the AI agent answers the team's questions about logs, metrics, traces, and alerts in plain language. Instead of switching between Grafana, Datadog, Sentry, and Kubernetes dashboards, an engineer types: "why did checkout latency increase after the deploy at 14:07?" — the agent returns a coherent answer with links to specific sources. Automation addresses three pain points of IT teams: too many disparate tools, constant context switching, slow incident response. Time-to-insight drops from minutes or hours of hunt-and-peck to a single query. New engineers onboard faster because there is no need to learn each console separately. Suitable for IT / DevOps / SRE teams in SaaS and tech companies of 5–50 people, and also horizontally — anywhere with an observability stack of two or more tools. Build in a weekend: RAG + MCP connectors + AI model as the conversation engine.

Expected effect

Time-to-insight drops from minutes/hours of hunt-and-peck to a single NL query. New engineers onboard faster.

Complexity
Weekend (1-2 days)
Tool type
Vertical SaaS
ROI
Time saved
Industries
SaaS / Tech, Other / Horizontal
Integrations
Observability / monitoring, Communications
Patterns
Search / RAG Q&A, Content Generation (drafts)

What it does

What automation does

The AI agent serves as a single entry point into the entire observability stack. An engineer writes a question in Slack, web chat, or via CLI — the agent parses the intent, reaches the required sources through MCP connectors, collects data, and returns an answer with direct links to dashboards and log lines.

Specific scenarios

  1. Incident diagnosis. "What changed in the last 30 minutes?" — the agent collects deploy events, alert history, anomalous metrics, returns a chronological answer.
  2. Root cause search. "Why did 500s increase on checkout-service?" — the agent searches error logs, correlates with recent commits, shows the trace with the longest span.
  3. Context for on-call. "What's on fire right now?" — the agent aggregates open incidents, SLO burn rate, recent releases.
  4. Reports for stakeholders. "Compile a status for the weekly sync" — the agent compiles a draft on SLO, uptime, incidents, key metrics.
  5. Engineer onboarding. "How does the payment-flow work?" — the agent explains using code, docs, and traces.

Feature: the agent does not simply aggregate data, but links it contextually. A query about checkout automatically pulls in metrics, logs, and recent commits for the relevant service.

What automation does NOT do

It is important to draw the lines right away:

  • Does not replace on-call engineers. The agent provides hypotheses; decisions are made by a person.
  • Does not fix incidents. The agent does not run runbooks or deploy — it only reads.
  • Does not replace alerting. PagerDuty, Opsgenie, Sentry continue to work as before.
  • Does not provide 100% accurate answers. Hallucinations are possible — the agent always returns sources for verification.
  • Does not migrate the stack. All existing tools remain in place; the agent works on top of them.

Typical configuration options

Solo / team of 1–5 people. Minimal setup: connect 2–3 key sources — typically logs, monitoring, and git. The agent responds in a single Slack channel or via CLI. For a small team, it pays off through time savings for the founder-engineer who handles everything at once. Setup takes one day, no complex role model required. A language model via API and a simple RAG index are sufficient. Limitation: with more than five sources, hallucinations begin without structured routing. At this level, it is easier to keep focus on one cluster and one incident type.

SMB / team of 6–30 people. Full observability agent: 5–8 sources (logs, metrics, traces, errors, git, CI/CD, incident management, docs). Agent router by request type, separate MCP servers for each source. Responses in Slack with markup by team (backend, frontend, data). Adds audit log and role-based access for prod vs staging. At this level, fine-tuning prompt templates for the specific stack starts to make sense. Typical savings — hours per week per team, by eliminating hunt-and-peck between consoles.

Enterprise / team of 30+ people. Multi-agent architecture: specialized agents for Kubernetes, database, security, networking. A central router determines which agent to route a request to. Integration with an internal service catalog, compliance filters (PII, secrets), a separate tenant per team. Requires a dedicated support group (2–3 engineers) and a meaningful budget for LLM tokens. ROI — less about time savings and more about reducing MTTR on critical incidents and accelerating onboarding of new teams in a large organization.

How it works

How it works

Technically, automation is a composition of four layers.

  1. Interface — A Slack bot, web chat, or CLI. The engineer types a question in natural language.
  2. Request router — The LLM orchestrator determines which sources are needed for the answer, which filters to apply, and whether a follow-up is needed.
  3. MCP connectors to data sources — a separate connector for each tool: logs, metrics, traces, errors, git, docs.
  4. Response synthesizer — The LLM aggregates data from sources, explains the connections, and returns an answer with references to the source data.

Step-by-step flow

For a typical question like "why did checkout latency increase after 14:07?" the agent does the following:

  1. Parses the intent: this is a root cause analysis for a specific service and time window.
  2. Identifies the required sources: metrics (latency), logs (error patterns), deploy history (what changed), traces (which span is slow).
  3. Issues parallel requests via MCP connectors to each source.
  4. Aggregates results, finds intersections — for example, deploy at 14:06 → latency increase on p95 checkout-service → DB timeout error in logs → trace shows a slow query in a new feature flag.
  5. Generates a coherent answer: hypothesis + evidence + references + next step suggestion.

The entire cycle takes seconds instead of minutes of manual hunt-and-peck.

The role of the LLM

The language model is a key component. Its strengths for this task:

  • Long context allows simultaneous analysis of excerpts from 5+ sources without overload.
  • Strong tool use — MCP connectors are invoked via structured tool calls without parsing free text.
  • Ability for chain-of-thought reasoning for complex correlations between metrics and events.
  • Careful handling of references — the agent returns sources rather than fabricating them.

MCP connectors

Model Context Protocol (MCP) is the standard for connecting LLMs to external sources. For an observability scenario, a typical set of connectors:

  • logs-mcp — reads the log aggregator (Loki, Elastic, CloudWatch Logs).
  • metrics-mcp — PromQL/Prometheus and/or Grafana API.
  • traces-mcp — Tempo, Jaeger, or an OpenTelemetry-compatible backend.
  • errors-mcp — Sentry, Rollbar, Honeybadger.
  • git-mcp — commit history and deploy events.
  • docs-mcp — internal runbooks, Notion, Confluence.

Each connector is a separate process with read-only access and its own rate limit.

Alternative approaches

Approach

Strengths

Limitations

Manual analysis

Full control, zero implementation cost

Slow, requires knowledge of each tool, does not scale

Vendor no-code aggregator (Datadog Bits AI, New Relic Grok, Grafana Assistant)

Ready-made solution, vendor support

Works only within a single vendor's ecosystem, often expensive, inflexible

AI automation on MCP (this approach)

Works on top of any stack, adapts to the team, responds contextually

Requires initial setup, requires response quality control, risk of hallucinations

Manual analysis works while the team is small and the stack is simple. When there are more than three sources and more than five engineers, every incident turns into a context hunt. Vendor no-code solutions work well within their own ecosystem but connect different sources poorly — and a real observability stack is usually assembled from multiple tools from different vendors. AI automation on MCP works on top of what already exists and does not require migrating to a single vendor stack. The downside: internal expertise is needed for setup and response quality monitoring.

Security and compliance

Observability data often contains sensitive information: PII in logs, tokens, internal URLs. Three basic requirements for the setup:

  1. Read-only access. The agent must not have rights to modify data or to run runbooks. Read-only access via API tokens with minimal scopes.
  2. Connector-level filtering. PII redaction and secrets masking before data enters the LLM context.
  3. Audit log. All requests and responses are logged — for incident analysis and for compliance (SOC 2, GDPR, HIPAA as needed).

If the team works with personal user data, use an LLM provider with a zero retention policy (the AI model via Anthropic API supports this) or self-hosted inference for sensitive environments.

Prerequisites

What you need in advance

Before launching the agent, set up the infrastructure and team.

Technical prerequisites

  1. List of observability sources. Which tools are in use: Grafana, Datadog, Sentry, CloudWatch, Prometheus, or anything else. At least two sources — otherwise the agent becomes a wrapper over a single API.
  2. API tokens with read-only scope. A separate token for each source. No write permissions, no admin privileges.
  3. MCP connectors. Either ready-made ones (they exist for popular tools) or build your own — a day or two of work per tool.
  4. LLM provider. LLM via Anthropic API — a working default thanks to long context and quality tool use.
  5. Access channel. Slack, Microsoft Teams, web chat, or CLI — wherever the team will ask questions. Slack is the typical choice.
  6. Sandbox on pre-prod. Test queries and responses for two weeks before giving the whole team access.

Roles and responsibilities

  • DevOps / SRE lead — sets up MCP connectors, validates access.
  • Tech lead — defines question types, collects feedback on response quality.
  • Security — reviews compliance settings, PII filters, audit log.
  • Product engineer — adapts prompt templates to the team's specifics.

Potential pitfalls

  • Hallucinations without sources. If mandatory citation of source data is not configured, the agent starts fabricating numbers and events. Fix: require the system prompt to show the source of each fact, reject responses without citations.
  • Context overload. If you pull the entire log from the past hour into the LLM, the context fills up and responses degrade. Fix: filtering at the connector level, only relevant fragments reach the LLM.
  • Phantom correlations. The agent may find a "connection" between two random events. Fix: explicitly request hypotheses rather than assertions, add a confidence score, validate against a regression query set.
  • Secrets in context. API keys, tokens, passwords from logs end up in the LLM prompt. Fix: regex filters on the connector + zero retention policy at the LLM provider + rotation of compromised keys in the event of a leak.
  • Quality drift. After a month, sources change, log schemas evolve — responses degrade with no visible errors. Fix: weekly sample review of 10–20 queries, regression tests for typical scenarios, alert on confidence drop.

Pain points

  • Too Many Tools Without Integration
  • Constant context switching
  • Slow Customer Response

FAQ

How long does implementation take?

Minimal build — one weekend: connect 2–3 sources, configure a Slack bot, test on 10–20 typical questions. Full setup for a team of 6–30 people — 2–3 weeks: integration of 5–8 sources, role model, audit log, sample review. Enterprise scenario with multi-agent architecture — 2–3 months with dedicated engineers.

What if we don't have a centralized observability stack?

The agent requires at least two sources — for example, logs and metrics. If everything is currently in one tool, the value is lower — it is simpler to use the vendor's built-in AI assistant. If there are more sources — the agent adds connectivity. If there is almost no stack, first set up basic observability (logs + metrics + error tracking), then come back to this automation.

What are the risks and what can break?

Three main risks: hallucinations (the agent makes up facts — fix via mandatory sources), context overload on large samples (fix via pre-LLM filtering), secrets leaking into the prompt (fix via regex filters and a zero retention policy). Observability itself does not break: the agent operates read-only and does not modify data in sources.

Is it suitable for our industry?

Works best in SaaS and Tech, where the observability stack is typically assembled from multiple tools from different vendors. Horizontally applicable to any company with an engineering team of five or more people and two or more observability tools. Less useful for teams on a single vendor stack: the built-in AI assistant covers most scenarios there.

How to handle sensitive data in logs?

Three protective layers: regex filters for PII and secrets at the MCP connector level, a zero retention policy at the LLM provider, audit log of all requests and responses. For handling medical or financial data — self-hosted inference on an on-prem model. Grow2.ai helps configure the security perimeter for specific compliance requirements (SOC 2, GDPR, HIPAA).

Can the agent fix incidents or only respond?

In the base configuration — read-only only: answers questions, finds correlations, forms hypotheses. Decisions and actions remain with the human. Extending to actions (running runbooks, restarting services) is possible, but requires a separate role model and human-in-the-loop approval of each step. It is recommended to start with read-only, verify the quality of responses, then expand capabilities.

How accurate are the answers?

On simple factual queries ("what is on fire right now", "what is the p95 latency") — high accuracy with correct source configuration. On complex correlation questions ("why did latency increase after the deploy") — accuracy depends on data quality and prompt engineering. Always require the agent to return references to source data so an engineer can verify the output.

Want this in your business?

Book a free audit — we'll show how this automation will work for you.

Related automations

#56 · IT / DevOps / SRE

On-call AI agent: diagnostics + auto-remediation PR

On-call AI agent: diagnostics + auto-remediation PR automates the incident response process in the IT / DevOps / SRE department and achieves savings of 675 engineering hours per month. The AI agent connects to the observability stack, codebase, and on-call Slack channels, collects context when an alert fires, and proposes a fix — from hypothesis to pull request with the fix. For a team of 60 engineers and 30 channels, the system processes 4,200 successful flows per month, receives 66% positive feedback, and closes 28 PRs without human involvement. The cost of a single diagnostic is $0.30. Automation addresses three common pain points for DevOps teams: knowledge is scattered across on-call engineers' heads, engineers constantly context-switch between alerts, logs, and code, customers are slow to learn incident status. Grow2.ai deploys the agent on an AI model with integration into the repository, monitoring, and Slack — full launch takes 6–10 weeks.

675 h/month· Engineering time saved
Month (2-4 weeks)Agent frameworkTime saved
#57 · IT / DevOps / SRE

Postmortem Draft from Slack + Telemetry

The Grow2.ai AI agent compiles a postmortem draft by pulling context from incident Slack threads, observability system alerts, and issue tracker tickets. The engineer gets the first draft in minutes — with an event timeline, affected services, team actions, and findings in blameless format — and edits it rather than writing from scratch. The solution fits SaaS teams, DevOps and SRE departments that lose incident knowledge in chats and don't have time to document. Automation addresses three pain points: loss of context from meetings and discussions, hours of manual work on the report, and knowledge that stays in a few people's heads and never makes it into team documents. Basic setup takes about a week: connecting data sources, configuring the prompt template with blameless rules, and testing on real incidents from the team's history. The result is reduced postmortem time: the draft is ready in minutes instead of hours of manually gathering artifacts and writing prose. The blameless format is encoded in the prompt rather than requiring discipline from each individual engineer, and document quality becomes predictable.

The engineer gets the postmortem draft in minutes, edits it — doesn't write from scratch. Blameless format encoded in the prompt.

Week (1-5 days)Agent frameworkTime saved
#58 · IT / DevOps / SRE

AI incident triage + runbook executor

AI incident triage + runbook executor automates initial incident handling and execution of standard runbooks in the IT / DevOps / SRE department and achieves MTTM reduction from 22 to 11 minutes (-50%). The AI agent receives signals from monitoring systems, classifies the incident by severity and domain, collects context from logs and metrics, presents the on-call engineer with a ready runbook, and executes its steps on command, with explicit receipt confirmations. As a result, the number of duplicate alerts decreases (-38% per incident), rollback errors disappear (all actions go through receipt), and SRE team satisfaction grows from 3.2 to 4.4/5. The solution fits SaaS/Tech and universal horizontal scenarios where system knowledge is fragmented across people and on-call engineers switch context dozens of times per shift. The agent does not make irreversible decisions on its own — it prepares the ground for the engineer and documents every step.

50%· Mean time to mitigate
Month (2-4 weeks)Agent frameworkRisk reduced
#60 · IT / DevOps / SRE

Cloud cost anomaly detection

Cloud cost anomaly detection automates the process of monitoring cloud infrastructure spend in the IT / DevOps / SRE department and achieves the effect of detecting anomalous spikes on the day they occur, not at the monthly reconcile stage. Automation suits SaaS product teams and any companies with non-trivial cloud resource consumption, where manual cost tracking takes up engineers' time and leads to missed budget leaks. Grow2.ai sets up a pipeline that pulls billing data from the cloud provider daily, runs it through a statistical anomaly detection model, and sends structured alerts to the team's work channel. The responsible person receives context directly in Slack or email: service, region, deviation from baseline, causes of the spike. The solution does not replace financial planning, but removes hours of manual billing report analysis and reduces response time to configuration errors. Typical scenarios: Terraform errors, forgotten dev instances, autoscaling without an upper limit, unplanned traffic.

Unexpected cost spikes are caught on the same day, not at the end of the month during reconcile.

Week (1-5 days)Custom codeCost saved
Take the AI-audit (2 min)