Skip to main content
The Incidents page groups related problems — anomalies, error spikes, latency regressions — into a single timeline so you can investigate them as a unit. When Zespan detects that multiple signals are happening at the same time and likely have a common cause, it opens an incident automatically. You can also open incidents manually for any issue you want to track to resolution.
Zespan incidents page showing active and resolved incidents with timeline
Incident detection and management require the Team or Scale plan.

How incidents are created

Zespan creates incidents automatically when:
  • An anomaly is detected (cost spike, error rate jump, latency surge) and a related alert rule fires within 15 minutes
  • Three or more traces with the same error code occur within a 10-minute window
  • An AI analysis detects a pattern across multiple traces that it classifies as a systemic issue
You can also open an incident manually from the New incident button on the Incidents page, or by clicking Open incident on any anomaly card in AI Features.

Incident states

Each incident moves through three states:
StateMeaning
DetectingThe platform is still gathering signals — the issue may still be developing
ActiveConfirmed ongoing issue requiring attention
ResolvedThe issue is no longer occurring
Zespan auto-resolves an incident when its driving metric returns to baseline for 30 consecutive minutes. You can also resolve an incident manually.

The incidents list

The main Incidents page shows a table of all incidents, ordered by most recent. Each row shows:
  • Severityhigh, medium, or low, based on impact to cost or error rate
  • Title — a one-sentence summary of the issue
  • Affected metric — which measurement is out of range
  • State — detecting, active, or resolved
  • Duration — how long the incident has been open
  • Models affected — which model(s) are involved
Use the State filter to see only active incidents, or the Severity filter to focus on high-severity issues.

Incident detail

Click any incident to open its detail view. The detail view shows:

Timeline

A chronological feed of all signals related to this incident:
  • Anomaly detections with their explanation and severity
  • Alert rule triggers with threshold and actual value
  • Related error traces (grouped by error code)
  • Configuration changes that may have contributed (from the audit log)

Root cause analysis

When an incident is created, Zespan automatically runs AI root cause analysis across the correlated traces. The results appear at the top of the incident detail view under AI Root Cause Analysis.
FieldDescription
Root causeSingle-sentence diagnosis of what caused the failure
Contributing factorsUp to 5 conditions that made the problem worse
Suggested fixThe single highest-impact action to resolve the issue
Prevention tipsUp to 5 steps to avoid the same failure class in future
Confidencehigh, medium, or low
The analyzer uses a fast path for common patterns (rate limits, timeouts, context length exceeded, provider 5xx errors) and falls back to AI analysis for novel cases. Trace content is sanitized before analysis — only error details, model metadata, span structure, and latency are used. If analysis is still running, the section shows “Analysis in progress…” and updates automatically when complete.

Affected traces

A filtered list of the specific traces associated with this incident, with error codes, latency, and cost. Click any trace to open it in the flame graph view.

Resolution notes

A free-text field where you can record what you found and how you fixed it. Resolution notes are preserved after the incident closes and appear in the incident history. Use them to build a runbook for recurring issues.

Resolving an incident

Click Mark resolved on any active incident. You’ll be prompted to add a brief resolution note. Once resolved:
  • The incident state changes to “Resolved”
  • The resolution timestamp and note are saved
  • If the same underlying issue recurs, a new incident opens automatically — it does not reopen the closed one
Always add a resolution note before closing an incident. Future incidents of the same type will surface the previous resolution notes so your on-call engineer can see what worked before.

Auto-remediation

On the Scale plan, you can configure auto-remediation rules that ZespanPilot applies automatically when an incident of a specific type opens. For example:
  • “When a GPT-4o error spike incident opens, switch to GPT-4o-mini”
  • “When a rate-limit incident opens, reduce sample rate to 50%”
Auto-remediation rules are configured in Settings → Incidents and require explicit approval from an Owner-level user to activate.
Auto-remediation applies SDK config changes without human confirmation. Only enable it for actions you have validated are safe to apply automatically. All auto-remediation actions are logged in the audit trail.

Notifications

Incidents trigger the same notification channels as alert rules — email for Pro/Team, and webhooks for Scale. If an incident is opened while an alert for the same metric is active, Zespan deduplicates the notification so you do not receive duplicate alerts.