📖 Understanding Incidents

When something goes wrong with a monitored service, Monitron creates an incident. This page explains how incidents work and how to manage them.

🔄 Incident Lifecycle

Monitor Check Fails
        │
        ▼
   Fail Threshold Met? ──── No ──→ Wait for next check
        │
       Yes
        ▼
   ┌─────────────────┐
   │  INVESTIGATING   │ ← Incident created, notifications sent
   └────────┬────────┘
            ▼
   ┌─────────────────┐
   │   IDENTIFIED     │ ← Team acknowledges and identifies cause
   └────────┬────────┘
            ▼
   ┌─────────────────┐
   │   MONITORING     │ ← Fix applied, watching for stability
   └────────┬────────┘
            ▼
   ┌─────────────────┐
   │    RESOLVED      │ ← Confirmed fixed, incident closed
   └─────────────────┘

📊 Incident Statuses

Статус	Значение	Цвет
🔴 Investigating	Just detected, team is looking into it	Red
🟡 Identified	Root cause found, working on a fix	Yellow
🔵 Monitoring	Fix applied, monitoring for stability	Blue
🟢 Resolved	Confirmed fixed, all clear	Green

⚡ Severity Levels

Серьезность	Когда использовать	Цвет
ℹ️ Info	Minor issues, degraded performance	Blue
⚠️ Warning	Partial outage, potential impact	Yellow
🔴 Critical	Full service outage	Red
🔥 Emergency	Multiple services affected, major impact	Red (flashing)

🤖 Automatic Incidents

Monitron automatically creates incidents when:

A monitor goes down — After the fail threshold is met (configurable), an incident is created with:
- Title: "{Monitor Name} is down"
- Severity: Critical
- Status: Investigating
- The error message from the check
A heartbeat is missed — If no ping is received within interval + grace period
A monitor recovers — The incident is automatically resolved

✋ Manual Incidents

You can also create incidents manually from the Incidents page:

Click "New Incident"
Fill in the title, severity, description
Optionally link to a monitor
The incident appears on your dashboard and status pages

👆 Managing Incidents

Acknowledge

Click the "Acknowledge" button to signal that someone is looking at the issue. This records who acknowledged it and when.

Resolve

Click "Resolve" to close the incident. This records the resolution time and calculates the total duration.

Duration

Monitron automatically calculates how long each incident lasted:

Start: When the incident was created
End: When it was resolved (or current time if still open)
Duration: Human-readable format (e.g., "2 hours 15 minutes")

🤖 AI-Powered Incident Management

If AI features are enabled, you get extra superpowers:

Feature	What It Does
AI Root Cause	Click to get an AI analysis of why the incident happened
AI Postmortem	Auto-generate a blameless postmortem report (for resolved incidents)
AI Status Draft	Get a public-facing status update drafted by AI

See the AI Features section for details.

💡 Советы

Keep incidents updated — Add incident updates as you learn more. Your team and status page subscribers see these.
Use severity correctly — Reserve "Emergency" for true emergencies. Alert fatigue is real!
Review resolved incidents — Use AI Postmortem or manual review to learn from incidents and prevent recurrence.

🔄 Incident Lifecycle​

📊 Incident Statuses​

⚡ Severity Levels​

🤖 Automatic Incidents​

✋ Manual Incidents​

👆 Managing Incidents​

Acknowledge​

Resolve​

Duration​

🤖 AI-Powered Incident Management​

💡 Советы​