📖 Understanding Incidents
When something goes wrong with a monitored service, Monitron creates an incident. This page explains how incidents work and how to manage them.
🔄 Incident Lifecycle
Monitor Check Fails
│
▼
Fail Threshold Met? ──── No ──→ Wait for next check
│
Yes
▼
┌──────────────── ─┐
│ INVESTIGATING │ ← Incident created, notifications sent
└────────┬────────┘
▼
┌─────────────────┐
│ IDENTIFIED │ ← Team acknowledges and identifies cause
└────────┬────────┘
▼
┌─────────────────┐
│ MONITORING │ ← Fix applied, watching for stability
└────────┬────────┘
▼
┌─────────────────┐
│ RESOLVED │ ← Confirmed fixed, incident closed
└─────────────────┘
📊 Incident Statuses
| Статус | Значение | Цвет |
|---|---|---|
| 🔴 Investigating | Just detected, team is looking into it | Red |
| 🟡 Identified | Root cause found, working on a fix | Yellow |
| 🔵 Monitoring | Fix applied, monitoring for stability | Blue |
| 🟢 Resolved | Confirmed fixed, all clear | Green |
⚡ Severity Levels
| Серьезность | Когда использовать | Цвет |
|---|---|---|
| ℹ️ Info | Minor issues, degraded performance | Blue |
| ⚠️ Warning | Partial outage, potential impact | Yellow |
| 🔴 Critical | Full service outage | Red |
| 🔥 Emergency | Multiple services affected, major impact | Red (flashing) |
🤖 Automatic Incidents
Monitron automatically creates incidents when:
-
A monitor goes down — After the fail threshold is met (configurable), an incident is created with:
- Title:
"{Monitor Name} is down" - Severity: Critical
- Status: Investigating
- The error message from the check
- Title:
-
A heartbeat is missed — If no ping is received within interval + grace period
-
A monitor recovers — The incident is automatically resolved
✋ Manual Incidents
You can also create incidents manually from the Incidents page:
- Click "New Incident"
- Fill in the title, severity, description
- Optionally link to a monitor
- The incident appears on your dashboard and status pages
👆 Managing Incidents
Acknowledge
Click the "Acknowledge" button to signal that someone is looking at the issue. This records who acknowledged it and when.
Resolve
Click "Resolve" to close the incident. This records the resolution time and calculates the total duration.
Duration
Monitron automatically calculates how long each incident lasted:
- Start: When the incident was created
- End: When it was resolved (or current time if still open)
- Duration: Human-readable format (e.g., "2 hours 15 minutes")
🤖 AI-Powered Incident Management
If AI features are enabled, you get extra superpowers:
| Feature | What It Does |
|---|---|
| AI Root Cause | Click to get an AI analysis of why the incident happened |
| AI Postmortem | Auto-generate a blameless postmortem report (for resolved incidents) |
| AI Status Draft | Get a public-facing status update drafted by AI |
See the AI Features section for details.
💡 Советы
- Keep incidents updated — Add incident updates as you learn more. Your team and status page subscribers see these.
- Use severity correctly — Reserve "Emergency" for true emergencies. Alert fatigue is real!
- Review resolved incidents — Use AI Postmortem or manual review to learn from incidents and prevent recurrence.