Перейти к основному содержимому

📖 Understanding Incidents

When something goes wrong with a monitored service, Monitron creates an incident. This page explains how incidents work and how to manage them.


🔄 Incident Lifecycle

Monitor Check Fails


Fail Threshold Met? ──── No ──→ Wait for next check

Yes

┌─────────────────┐
│ INVESTIGATING │ ← Incident created, notifications sent
└────────┬────────┘

┌─────────────────┐
│ IDENTIFIED │ ← Team acknowledges and identifies cause
└────────┬────────┘

┌─────────────────┐
│ MONITORING │ ← Fix applied, watching for stability
└────────┬────────┘

┌─────────────────┐
│ RESOLVED │ ← Confirmed fixed, incident closed
└─────────────────┘

📊 Incident Statuses

СтатусЗначениеЦвет
🔴 InvestigatingJust detected, team is looking into itRed
🟡 IdentifiedRoot cause found, working on a fixYellow
🔵 MonitoringFix applied, monitoring for stabilityBlue
🟢 ResolvedConfirmed fixed, all clearGreen

⚡ Severity Levels

СерьезностьКогда использоватьЦвет
ℹ️ InfoMinor issues, degraded performanceBlue
⚠️ WarningPartial outage, potential impactYellow
🔴 CriticalFull service outageRed
🔥 EmergencyMultiple services affected, major impactRed (flashing)

🤖 Automatic Incidents

Monitron automatically creates incidents when:

  1. A monitor goes down — After the fail threshold is met (configurable), an incident is created with:

    • Title: "{Monitor Name} is down"
    • Severity: Critical
    • Status: Investigating
    • The error message from the check
  2. A heartbeat is missed — If no ping is received within interval + grace period

  3. A monitor recovers — The incident is automatically resolved


✋ Manual Incidents

You can also create incidents manually from the Incidents page:

  1. Click "New Incident"
  2. Fill in the title, severity, description
  3. Optionally link to a monitor
  4. The incident appears on your dashboard and status pages

👆 Managing Incidents

Acknowledge

Click the "Acknowledge" button to signal that someone is looking at the issue. This records who acknowledged it and when.

Resolve

Click "Resolve" to close the incident. This records the resolution time and calculates the total duration.

Duration

Monitron automatically calculates how long each incident lasted:

  • Start: When the incident was created
  • End: When it was resolved (or current time if still open)
  • Duration: Human-readable format (e.g., "2 hours 15 minutes")

🤖 AI-Powered Incident Management

If AI features are enabled, you get extra superpowers:

FeatureWhat It Does
AI Root CauseClick to get an AI analysis of why the incident happened
AI PostmortemAuto-generate a blameless postmortem report (for resolved incidents)
AI Status DraftGet a public-facing status update drafted by AI

See the AI Features section for details.


💡 Советы

  • Keep incidents updated — Add incident updates as you learn more. Your team and status page subscribers see these.
  • Use severity correctly — Reserve "Emergency" for true emergencies. Alert fatigue is real!
  • Review resolved incidents — Use AI Postmortem or manual review to learn from incidents and prevent recurrence.