Negative trends surface on the day they appear, not after a monthly review.
What it does
An anomaly detector is a service that scans business metrics daily (or more frequently) and raises a flag when a metric behaves unusually. The logic is simple: the model learns from historical data, establishes a normal range accounting for seasonality and trend, and flags points outside that range. The team learns about a revenue dip, a spike in churn rate, or an unusual conversion at the moment the signal appears — not two weeks later at a retrospective.
What a typical setup includes:
- Connection to a data source — a data warehouse or a direct query to a BI tool.
- Definition of the metric set: revenue by channel, MRR, active users, funnel stage conversion, churn, average order value, order size, inventory levels, runway.
- Training baseline models on historical data for each metric — daily and weekly seasonality, holidays, and trend are taken into account.
- Regular execution of checks (cron or event-driven) with calculation of the deviation from the expected value.
- Publishing an alert in Slack or Teams with context: metric, current value, expected range, deviation magnitude, link to the dashboard.
- Logging confirmed anomalies and false positives — for retraining thresholds.
What the detector will NOT do:
- It does not explain the cause of the anomaly. The signal says 'something is off here,' but root cause analysis remains with the human.
- It does not work without clean historical data. If the revenue data mart breaks down once a week or the metric recently changed its formula — the model will produce noise.
- It is not responsible for business decisions. An alert in Slack is an input trigger for investigation, not an instruction to stop a campaign or raise prices.
How it works
Architecturally, the service consists of four layers: data source, calculation engine, alert engine, delivery channel. The custom-code approach is chosen when off-the-shelf SaaS platforms for anomaly detection are excessively priced or do not fit well with the specifics of the client's metrics.
Technical flow
- The scheduler (Airflow, Prefect, Dagster, or cron in kubernetes) runs a batch job on a schedule — once per hour or once per day, depending on the metric.
- The job runs an SQL query against the data warehouse and retrieves the time series for the required metric with a history of 90-365 days.
- The detection module applies one of the models: STL decomposition and z-score for most metrics, Prophet or ARIMA for series with pronounced seasonality, isolation forest for multivariate cases.
- Calculation of the expected range for the current point. If the actual value falls outside the confidence interval boundaries, an anomaly is recorded with the direction and magnitude indicated.
- Post-processing: duplicate filtering (one anomaly does not alert twice), aggregation of related signals, classification by severity.
- Composing a message in Slack or Teams via webhook — metric, value, expectation, delta, time window, link to the BI dashboard for drill-down.
Implementation steps
- Metric audit and prioritization: a list of 10-20 critical KPIs worth monitoring (more than that — you will drown in alerts).
- Data preparation: quality checks, a unified metrics table in the DWH or materialized view, documenting the SLA for data freshness.
- Stack selection: Python with libraries for time-series analysis, an orchestrator, secrets for connecting to the DWH and messenger.
- Prototype on 2-3 metrics, manual threshold calibration, a run on historical data to verify accuracy.
- Coverage expansion, adding severity levels, channel separation (P1 alerts go to the on-call channel, P2 — to the digest).
- Two-week shadow mode: alerts are written to the log but not sent to Slack — false positive frequency is verified.
- Launch to production, monthly review of thresholds and effectiveness.
System components
Layer | What it does | Typical tool |
|---|---|---|
Storage | Time series source | Data warehouse (Snowflake, BigQuery, Postgres) |
Orchestrator | Scheduled job execution | Airflow, Prefect, cron |
Calculation | Anomaly detection models | Python + time-series libraries |
Delivery | Alert channels | Slack, Teams, email |
Infrastructure costs are low: calculation takes minutes, the load on the DWH is small. The main resource is the time of a data engineer or ML engineer for model calibration and working with metric owners.
Prerequisites
What should be in place on the client side before the project starts:
Data and access:
- Data warehouse or a centralized analytics database with at least 6 months of metric history (a year is better).
- Documented SQL queries or dbt models for key metrics. If each metric is calculated ad hoc in different ways — we establish order first.
- A service account for reading data and a webhook URL in Slack or Teams for sending alerts.
- An understanding of seasonality: the team knows that revenue drops on Saturday and the average check grows in December — the model is trained with this in mind.
Team and owners:
- A metrics on-call person — someone who responds to an alert. Without an owner, the service turns into a noise channel.
- An analyst or data engineer who owns the metric logic and assists with calibration.
- A DevOps or platform engineer for deployment (Docker, secrets, access to DWH from the infrastructure).
Technology stack:
- Python 3.10+, Docker, an orchestrator (if not yet in place — we set up a simple Prefect or cron in the existing kubernetes cluster).
- Access to Slack or Microsoft Teams via incoming webhook.
Timeline:
- Prototype with 2-3 metrics: 2-3 weeks.
- Full set of 10-20 metrics with calibration and shadow mode: 4-6 weeks.
- If the data warehouse is not set up or the data is dirty — add 2-4 weeks for preparation.
Pain points
- We don't see customer churn signals
- Poor Forecasting (cashflow/sales/stock)
FAQ
How long does implementation take?
A basic launch with 2-3 key metrics takes 2-3 weeks. A full scope of 10-20 metrics with model calibration and a two-week shadow mode takes 4-6 weeks. Timelines grow if prior data warehouse preparation or unification of SQL queries per metric is required — that adds 2-4 weeks.
What if we don't have a data warehouse?
The minimum requirement is one analytics database (Postgres replica, ClickHouse) with metric history. If data currently lives in Google Sheets or a product database — we add a step to export it into a separate analytics data mart. This extends the project by 2-4 weeks but provides a foundation for other tasks, not just the detector.
The main risk is false positives. What can be done about it?
Alert fatigue kills the service: if 20 notifications per day flood into Slack, the team stops reading them. We address this in three ways: shadow mode before launch for threshold calibration, severity levels (P1 — call, P2 — digest), and feedback from on-call staff (marking 'not an anomaly' refines the model). After 4-6 weeks, the noise level reaches a workable baseline.
Is this suitable for our industry?
The solution is universal for businesses with structured metrics over time. SaaS companies use it for MRR, churn, and active users. Retail — for inventory and average order value. Fintech — for cashflow and transaction anomalies. The key requirement is a data warehouse or analytics database with at least 6 months of history.
Why custom code when there are ready-made SaaS platforms?
Ready-made platforms are good when you need to cover hundreds of metrics and have a substantial budget for an annual subscription. A custom-code approach is more cost-effective for 10-30 key metrics: it gives full control over model logic, does not tie you to a vendor, and runs on your own infrastructure. For most SMBs, this is the more rational choice in terms of price-to-result ratio.
Which metrics are best suited for the detector?
Metrics with a regular frequency (daily or hourly values), a stable calculation formula, and at least 90 days of history. Works well: revenue by channel, MRR, active users, funnel conversion, churn rate, inventory levels, average order value. Does not work well: metrics with abrupt formula changes, rare events, indicators without seasonality or trend.
Want this in your business?
Book a free audit — we'll show how this automation will work for you.