Unplanned downtime decreases. Spare parts ordering proactive. MTBF (mean time between failures) grows.
What it does
Predictive maintenance alerts shifts equipment maintenance from reactive mode ("broken — fix it") to proactive. Automation continuously analyzes telemetry, finds early signs of wear, and alerts the team before a failure. The goal is to eliminate unplanned downtime and move from emergency repairs to scheduled ones.
The process step by step:
- Telemetry collection. Data from sensors (vibration, temperature, pressure, energy consumption) and equipment logs flows into the observability stack — Prometheus, InfluxDB, or an industry-specific SCADA/MES.
- Normalization and storage. Metrics are brought to a unified format, aggregated into time series, and stored with 6-24 months of retention for model training.
- Baseline model. A statistical profile of normal operation is built for each piece of equipment: metric ranges, seasonality, correlations between parameters.
- Anomaly detector. ML models (Isolation Forest, LSTM-autoencoder, or rule-based rules) compare current readings against the baseline and calculate an anomaly score.
- Tier classification. Alerts are divided by severity: watch (monitor), warning (schedule an inspection), critical (stop and check now).
- Team notification. The alert is sent to Slack, email, or SMS with context — which node, which metric deviated, a recommended action, and a predicted time to failure.
- Closing the loop. The engineer confirms the cause (true positive / false positive / planned maintenance) — the data is returned to the model for retraining.
- Parts and scheduling. On warning alerts, the system automatically creates a spare parts request in the ERP and a task in the maintenance calendar.
What automation does NOT do:
- Does not replace a diagnostics engineer. An alert is a "look here" signal, not a ready-made diagnosis of the failure cause. Root cause is determined by a person.
- Does not work without a failure history. At least 3-6 months of normal operation data and several documented failures are needed for the model to distinguish noise from real anomalies.
- Does not cover equipment without sensors. If a press has no vibration sensor, vibration-based predictive maintenance is not possible — IoT retrofitting will first be required as a separate project.
How it works
The technical data pipeline is divided into three layers: ingest (collection), analytics (models), and delivery (alerts). Each layer is handled by a separate set of tools and implemented in custom-code, because there are no ready-made end-to-end boxes for a specific equipment fleet.
Ingest layer. Sources — PLC, SCADA, individual IoT sensors, industrial software logs. Data is collected via OPC UA, MQTT, Modbus, or the equipment manufacturer's API. The collector (Telegraf, Node-RED, custom Python) normalizes the format and writes to a time-series database (Prometheus, InfluxDB, TimescaleDB).
Analytics layer. Three types of models are used here:
- Threshold-based rules. Simple rules: "if vibration > X for Y minutes — alert". Work immediately, without training, but generate many false positives.
- Statistical models. Z-score, EWMA, ARIMA on time series. Detect deviations from the seasonal baseline without a heavy ML stack.
- ML models. Isolation Forest for anomaly detection, LSTM-autoencoder for multivariate signals, XGBoost for failure type classification. Trained on historical data, require a retraining pipeline.
Model outputs — anomaly score and failure probability estimate over a horizon (24 hours, 7 days, 30 days).
Delivery layer. The alert router (Alertmanager, custom-code, or a workflow engine) filters duplicates, applies escalation rules, and sends notifications to Slack/Teams, email, SMS, or a voice call for critical.
Example components:
Component | Purpose | Example tool |
|---|---|---|
Data collection | Equipment telemetry | Telegraf, Node-RED, OPC UA client |
Storage | Time-series metrics | Prometheus, InfluxDB, TimescaleDB |
Visualization | Dashboards, manual analysis | Grafana |
Models | Anomaly detection | Python (scikit-learn, PyTorch), MLflow |
Alert routing | Filtering and escalation | Alertmanager, orchestrator, custom |
Channels | Notification delivery | Slack, email, SMS (Twilio) |
Implementation stages:
- Discovery (1-2 weeks). Equipment inventory, data sources, failure history. Formulating hypotheses about predictor signals for key nodes.
- Data pipeline (2-3 weeks). Connecting sources, configuring collectors, backfilling historical data for 6-12 months.
- Baseline and models (2-3 weeks). Exploratory analysis, selection of model architecture, training on historical data, validation on a held-out dataset.
- Alert logic (1-2 weeks). Configuring tiers, deduplication rules, notification templates, escalation chains.
- Pilot (2-4 weeks). Launch on 3-5 units of equipment. Engineers evaluate each alert, model precision is tuned to values the team considers acceptable for the critical tier.
- Rollout (2-4 weeks). Expansion to the full fleet, team training, documentation of runbooks for typical alerts.
The feedback loop is critical: every closed alert is labeled as true positive, false positive, or planned maintenance. These labels feed into model retraining every 1-3 months. Without this loop, accuracy degrades — new equipment, changes in operating modes, and seasonal fluctuations throw off the baseline.
Prerequisites
To launch predictive maintenance, three groups of prerequisites are required: data, access, and team. Without any one of them, the project stretches out or hits an accuracy ceiling.
Data and equipment:
- Sensors on critical nodes — vibration, temperature, pressure, current. If there are no sensors, the first step is IoT retrofitting (separate budget and timeline).
- Historical data for a minimum of 3-6 months, preferably 12+ months.
- A failure log for the same period with annotations: failure type, time, repair costs.
- Equipment technical documentation — normative metric ranges, maintenance regulations.
Access and integrations:
- Access to PLC/SCADA/MES via OPC UA, Modbus, MQTT, or the manufacturer's API.
- Storage for a time-series database — on-premise server or cloud (Prometheus, InfluxDB Cloud, AWS Timestream).
- Notification channels with the ability to create a bot or webhook — Slack, Teams, Twilio for SMS.
- ERP or a maintenance system with an API, if an automatic spare parts request is needed.
Team and processes:
- Chief engineer or maintenance lead — owner of the alert business logic and tier classification.
- OT/IoT engineer — for connecting equipment and working with industrial protocols.
- Data engineer or ML engineer — for the data pipeline and models.
- An agreed SLA for alert response: who receives warning, who receives critical, and at what time.
Timeline: 6-10 weeks for a full launch with sensors and history in place. If starting with IoT retrofitting — add 4-8 weeks. A pilot on 3-5 units of equipment fits within 4-6 weeks and provides data for a scaling decision.
Pain points
- Poor Forecasting (cashflow/sales/stock)
- Errors in Manual Operations
FAQ
How long does implementation take?
The baseline timeline is 6-10 weeks with sensors in place and 3-6 months of historical data. A pilot on 3-5 units of equipment is separated into a distinct phase of 4-6 weeks to test hypotheses about failure predictors and fine-tune model precision. Rollout to the full fleet adds another 2-4 weeks depending on the number of nodes and the readiness of integrations with the ERP and maintenance system.
What to do if we have no failure history?
Two paths. The first is to start with threshold-rules based on manufacturer specifications, while accumulating 3-6 months of history for ML models in parallel. The second is to connect external datasets on similar equipment for transfer learning. Both approaches yield lower accuracy at the start but allow you to avoid waiting six months for the first alert. As data accumulates, the model retrains and reaches target accuracy.
What are the risks and what can go wrong?
Three main risks. The first is alert fatigue: if false positives drown out real ones, engineers stop responding to notifications. The second is a missed failure (false negative) due to an unaccounted operating mode. The third is data drift: an old model degrades after a line upgrade or product changeover. All three are mitigated by a feedback loop and regular model retraining every 1-3 months.
Is this suitable for a manufacturing company of our size (5-50 employees)?
Yes. For a small production facility, the focus shifts to 5-15 critical units of equipment where downtime is most costly. A simplified stack (Prometheus + Grafana + Python scripts + Slack) works without Enterprise licenses. ROI analysis is built on the cost of one hour of downtime for a specific line and the historical frequency of unplanned stoppages — these numbers the team usually knows or can recover from the maintenance log.
How to reduce the number of false positives?
Three levers. Tier classification: watch/warning/critical with different thresholds — some alerts go to the dashboard rather than Slack. Model consensus: an alert fires only if two independent detectors agree. Feedback loop: each false positive is flagged by an engineer and fed into retraining. The goal is for the critical tier to have high precision, while warning can be somewhat less strict by default.
Can this integrate with our CMMS or ERP?
Yes, if the system has a REST API or webhook. A typical scenario: on a warning alert, a work order is automatically created in the CMMS linked to the equipment, metric type, and predicted time to failure. On critical, a spare parts request is simultaneously created in the ERP. Integration adds 1-2 weeks to the baseline timeline and requires API access and an agreed-upon equipment reference schema.
Want this in your business?
Book a free audit — we'll show how this automation will work for you.