Time-to-insight drops from minutes/hours of hunt-and-peck to a single NL query. New engineers onboard faster.
What it does
What automation does
The AI agent serves as a single entry point into the entire observability stack. An engineer writes a question in Slack, web chat, or via CLI — the agent parses the intent, reaches the required sources through MCP connectors, collects data, and returns an answer with direct links to dashboards and log lines.
Specific scenarios
- Incident diagnosis. "What changed in the last 30 minutes?" — the agent collects deploy events, alert history, anomalous metrics, returns a chronological answer.
- Root cause search. "Why did 500s increase on checkout-service?" — the agent searches error logs, correlates with recent commits, shows the trace with the longest span.
- Context for on-call. "What's on fire right now?" — the agent aggregates open incidents, SLO burn rate, recent releases.
- Reports for stakeholders. "Compile a status for the weekly sync" — the agent compiles a draft on SLO, uptime, incidents, key metrics.
- Engineer onboarding. "How does the payment-flow work?" — the agent explains using code, docs, and traces.
Feature: the agent does not simply aggregate data, but links it contextually. A query about checkout automatically pulls in metrics, logs, and recent commits for the relevant service.
What automation does NOT do
It is important to draw the lines right away:
- Does not replace on-call engineers. The agent provides hypotheses; decisions are made by a person.
- Does not fix incidents. The agent does not run runbooks or deploy — it only reads.
- Does not replace alerting. PagerDuty, Opsgenie, Sentry continue to work as before.
- Does not provide 100% accurate answers. Hallucinations are possible — the agent always returns sources for verification.
- Does not migrate the stack. All existing tools remain in place; the agent works on top of them.
Typical configuration options
Solo / team of 1–5 people. Minimal setup: connect 2–3 key sources — typically logs, monitoring, and git. The agent responds in a single Slack channel or via CLI. For a small team, it pays off through time savings for the founder-engineer who handles everything at once. Setup takes one day, no complex role model required. A language model via API and a simple RAG index are sufficient. Limitation: with more than five sources, hallucinations begin without structured routing. At this level, it is easier to keep focus on one cluster and one incident type.
SMB / team of 6–30 people. Full observability agent: 5–8 sources (logs, metrics, traces, errors, git, CI/CD, incident management, docs). Agent router by request type, separate MCP servers for each source. Responses in Slack with markup by team (backend, frontend, data). Adds audit log and role-based access for prod vs staging. At this level, fine-tuning prompt templates for the specific stack starts to make sense. Typical savings — hours per week per team, by eliminating hunt-and-peck between consoles.
Enterprise / team of 30+ people. Multi-agent architecture: specialized agents for Kubernetes, database, security, networking. A central router determines which agent to route a request to. Integration with an internal service catalog, compliance filters (PII, secrets), a separate tenant per team. Requires a dedicated support group (2–3 engineers) and a meaningful budget for LLM tokens. ROI — less about time savings and more about reducing MTTR on critical incidents and accelerating onboarding of new teams in a large organization.
How it works
How it works
Technically, automation is a composition of four layers.
- Interface — A Slack bot, web chat, or CLI. The engineer types a question in natural language.
- Request router — The LLM orchestrator determines which sources are needed for the answer, which filters to apply, and whether a follow-up is needed.
- MCP connectors to data sources — a separate connector for each tool: logs, metrics, traces, errors, git, docs.
- Response synthesizer — The LLM aggregates data from sources, explains the connections, and returns an answer with references to the source data.
Step-by-step flow
For a typical question like "why did checkout latency increase after 14:07?" the agent does the following:
- Parses the intent: this is a root cause analysis for a specific service and time window.
- Identifies the required sources: metrics (latency), logs (error patterns), deploy history (what changed), traces (which span is slow).
- Issues parallel requests via MCP connectors to each source.
- Aggregates results, finds intersections — for example, deploy at 14:06 → latency increase on p95 checkout-service → DB timeout error in logs → trace shows a slow query in a new feature flag.
- Generates a coherent answer: hypothesis + evidence + references + next step suggestion.
The entire cycle takes seconds instead of minutes of manual hunt-and-peck.
The role of the LLM
The language model is a key component. Its strengths for this task:
- Long context allows simultaneous analysis of excerpts from 5+ sources without overload.
- Strong tool use — MCP connectors are invoked via structured tool calls without parsing free text.
- Ability for chain-of-thought reasoning for complex correlations between metrics and events.
- Careful handling of references — the agent returns sources rather than fabricating them.
MCP connectors
Model Context Protocol (MCP) is the standard for connecting LLMs to external sources. For an observability scenario, a typical set of connectors:
logs-mcp— reads the log aggregator (Loki, Elastic, CloudWatch Logs).metrics-mcp— PromQL/Prometheus and/or Grafana API.traces-mcp— Tempo, Jaeger, or an OpenTelemetry-compatible backend.errors-mcp— Sentry, Rollbar, Honeybadger.git-mcp— commit history and deploy events.docs-mcp— internal runbooks, Notion, Confluence.
Each connector is a separate process with read-only access and its own rate limit.
Alternative approaches
Approach | Strengths | Limitations |
|---|---|---|
Manual analysis | Full control, zero implementation cost | Slow, requires knowledge of each tool, does not scale |
Vendor no-code aggregator (Datadog Bits AI, New Relic Grok, Grafana Assistant) | Ready-made solution, vendor support | Works only within a single vendor's ecosystem, often expensive, inflexible |
AI automation on MCP (this approach) | Works on top of any stack, adapts to the team, responds contextually | Requires initial setup, requires response quality control, risk of hallucinations |
Manual analysis works while the team is small and the stack is simple. When there are more than three sources and more than five engineers, every incident turns into a context hunt. Vendor no-code solutions work well within their own ecosystem but connect different sources poorly — and a real observability stack is usually assembled from multiple tools from different vendors. AI automation on MCP works on top of what already exists and does not require migrating to a single vendor stack. The downside: internal expertise is needed for setup and response quality monitoring.
Security and compliance
Observability data often contains sensitive information: PII in logs, tokens, internal URLs. Three basic requirements for the setup:
- Read-only access. The agent must not have rights to modify data or to run runbooks. Read-only access via API tokens with minimal scopes.
- Connector-level filtering. PII redaction and secrets masking before data enters the LLM context.
- Audit log. All requests and responses are logged — for incident analysis and for compliance (SOC 2, GDPR, HIPAA as needed).
If the team works with personal user data, use an LLM provider with a zero retention policy (the AI model via Anthropic API supports this) or self-hosted inference for sensitive environments.
Prerequisites
What you need in advance
Before launching the agent, set up the infrastructure and team.
Technical prerequisites
- List of observability sources. Which tools are in use: Grafana, Datadog, Sentry, CloudWatch, Prometheus, or anything else. At least two sources — otherwise the agent becomes a wrapper over a single API.
- API tokens with read-only scope. A separate token for each source. No write permissions, no admin privileges.
- MCP connectors. Either ready-made ones (they exist for popular tools) or build your own — a day or two of work per tool.
- LLM provider. LLM via Anthropic API — a working default thanks to long context and quality tool use.
- Access channel. Slack, Microsoft Teams, web chat, or CLI — wherever the team will ask questions. Slack is the typical choice.
- Sandbox on pre-prod. Test queries and responses for two weeks before giving the whole team access.
Roles and responsibilities
- DevOps / SRE lead — sets up MCP connectors, validates access.
- Tech lead — defines question types, collects feedback on response quality.
- Security — reviews compliance settings, PII filters, audit log.
- Product engineer — adapts prompt templates to the team's specifics.
Potential pitfalls
- Hallucinations without sources. If mandatory citation of source data is not configured, the agent starts fabricating numbers and events. Fix: require the system prompt to show the source of each fact, reject responses without citations.
- Context overload. If you pull the entire log from the past hour into the LLM, the context fills up and responses degrade. Fix: filtering at the connector level, only relevant fragments reach the LLM.
- Phantom correlations. The agent may find a "connection" between two random events. Fix: explicitly request hypotheses rather than assertions, add a confidence score, validate against a regression query set.
- Secrets in context. API keys, tokens, passwords from logs end up in the LLM prompt. Fix: regex filters on the connector + zero retention policy at the LLM provider + rotation of compromised keys in the event of a leak.
- Quality drift. After a month, sources change, log schemas evolve — responses degrade with no visible errors. Fix: weekly sample review of 10–20 queries, regression tests for typical scenarios, alert on confidence drop.
Pain points
- Too Many Tools Without Integration
- Constant context switching
- Slow Customer Response
FAQ
How long does implementation take?
Minimal build — one weekend: connect 2–3 sources, configure a Slack bot, test on 10–20 typical questions. Full setup for a team of 6–30 people — 2–3 weeks: integration of 5–8 sources, role model, audit log, sample review. Enterprise scenario with multi-agent architecture — 2–3 months with dedicated engineers.
What if we don't have a centralized observability stack?
The agent requires at least two sources — for example, logs and metrics. If everything is currently in one tool, the value is lower — it is simpler to use the vendor's built-in AI assistant. If there are more sources — the agent adds connectivity. If there is almost no stack, first set up basic observability (logs + metrics + error tracking), then come back to this automation.
What are the risks and what can break?
Three main risks: hallucinations (the agent makes up facts — fix via mandatory sources), context overload on large samples (fix via pre-LLM filtering), secrets leaking into the prompt (fix via regex filters and a zero retention policy). Observability itself does not break: the agent operates read-only and does not modify data in sources.
Is it suitable for our industry?
Works best in SaaS and Tech, where the observability stack is typically assembled from multiple tools from different vendors. Horizontally applicable to any company with an engineering team of five or more people and two or more observability tools. Less useful for teams on a single vendor stack: the built-in AI assistant covers most scenarios there.
How to handle sensitive data in logs?
Three protective layers: regex filters for PII and secrets at the MCP connector level, a zero retention policy at the LLM provider, audit log of all requests and responses. For handling medical or financial data — self-hosted inference on an on-prem model. Grow2.ai helps configure the security perimeter for specific compliance requirements (SOC 2, GDPR, HIPAA).
Can the agent fix incidents or only respond?
In the base configuration — read-only only: answers questions, finds correlations, forms hypotheses. Decisions and actions remain with the human. Extending to actions (running runbooks, restarting services) is possible, but requires a separate role model and human-in-the-loop approval of each step. It is recommended to start with read-only, verify the quality of responses, then expand capabilities.
How accurate are the answers?
On simple factual queries ("what is on fire right now", "what is the p95 latency") — high accuracy with correct source configuration. On complex correlation questions ("why did latency increase after the deploy") — accuracy depends on data quality and prompt engineering. Always require the agent to return references to source data so an engineer can verify the output.
Want this in your business?
Book a free audit — we'll show how this automation will work for you.