๐ Understanding Incidents
When something goes wrong with a monitored service, Monitron creates an incident. This page explains how incidents work and how to manage them.
๐ Incident Lifecycleโ
Monitor Check Fails
โ
โผ
Fail Threshold Met? โโโโ No โโโ Wait for next check
โ
Yes
โผ
โโโโโโโโโโโโโโโโโโโ
โ INVESTIGATING โ โ Incident created, notifications sent
โโโโโโโโโโฌโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโ
โ IDENTIFIED โ โ Team acknowledges and identifies cause
โโโโโโโโโโฌโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโ
โ MONITORING โ โ Fix applied, watching for stability
โโโโโโโโโโฌโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโ
โ RESOLVED โ โ Confirmed fixed, incident closed
โโโโโโโโโโโโโโโโโโโ
๐ Incident Statusesโ
| Status | Bedeutung | Farbe |
|---|---|---|
| ๐ด Investigating | Just detected, team is looking into it | Red |
| ๐ก Identified | Root cause found, working on a fix | Yellow |
| ๐ต Monitoring | Fix applied, monitoring for stability | Blue |
| ๐ข Resolved | Confirmed fixed, all clear | Green |
โก Severity Levelsโ
| Schweregrad | Wann Verwenden | Farbe |
|---|---|---|
| โน๏ธ Info | Minor issues, degraded performance | Blue |
| โ ๏ธ Warning | Partial outage, potential impact | Yellow |
| ๐ด Critical | Full service outage | Red |
| ๐ฅ Emergency | Multiple services affected, major impact | Red (flashing) |
๐ค Automatic Incidentsโ
Monitron automatically creates incidents when:
-
A monitor goes down โ After the fail threshold is met (configurable), an incident is created with:
- Title:
"{Monitor Name} is down" - Severity: Critical
- Status: Investigating
- The error message from the check
- Title:
-
A heartbeat is missed โ If no ping is received within interval + grace period
-
A monitor recovers โ The incident is automatically resolved
โ Manual Incidentsโ
You can also create incidents manually from the Incidents page:
- Click "New Incident"
- Fill in the title, severity, description
- Optionally link to a monitor
- The incident appears on your dashboard and status pages
๐ Managing Incidentsโ
Acknowledgeโ
Click the "Acknowledge" button to signal that someone is looking at the issue. This records who acknowledged it and when.
Resolveโ
Click "Resolve" to close the incident. This records the resolution time and calculates the total duration.
Durationโ
Monitron automatically calculates how long each incident lasted:
- Start: When the incident was created
- End: When it was resolved (or current time if still open)
- Duration: Human-readable format (e.g., "2 hours 15 minutes")
๐ค AI-Powered Incident Managementโ
If AI features are enabled, you get extra superpowers:
| Feature | What It Does |
|---|---|
| AI Root Cause | Click to get an AI analysis of why the incident happened |
| AI Postmortem | Auto-generate a blameless postmortem report (for resolved incidents) |
| AI Status Draft | Get a public-facing status update drafted by AI |
See the AI Features section for details.
๐ก Tippsโ
- Keep incidents updated โ Add incident updates as you learn more. Your team and status page subscribers see these.
- Use severity correctly โ Reserve "Emergency" for true emergencies. Alert fatigue is real!
- Review resolved incidents โ Use AI Postmortem or manual review to learn from incidents and prevent recurrence.