Incidents — detect, correlate, and resolve LLM issues

The Incidents page groups related problems — anomalies, error spikes, latency regressions — into a single timeline so you can investigate them as a unit. When Zespan detects that multiple signals are happening at the same time and likely have a common cause, it opens an incident automatically. You can also open incidents manually for any issue you want to track to resolution.

Zespan incidents page showing active and resolved incidents with timeline

Incident detection and management require the Team or Scale plan.

How incidents are created

Zespan creates incidents automatically when:

An anomaly is detected (cost spike, error rate jump, latency surge) and a related alert rule fires within 15 minutes
Three or more traces with the same error code occur within a 10-minute window
An AI analysis detects a pattern across multiple traces that it classifies as a systemic issue

You can also open an incident manually from the New incident button on the Incidents page, or by clicking Open incident on any anomaly card in AI Features.

Incident states

Each incident moves through three states:

State	Meaning
Detecting	The platform is still gathering signals — the issue may still be developing
Active	Confirmed ongoing issue requiring attention
Resolved	The issue is no longer occurring

Zespan auto-resolves an incident when its driving metric returns to baseline for 30 consecutive minutes. You can also resolve an incident manually.

The incidents list

The main Incidents page shows a table of all incidents, ordered by most recent. Each row shows:

Severity — high, medium, or low, based on impact to cost or error rate
Title — a one-sentence summary of the issue
Affected metric — which measurement is out of range
State — detecting, active, or resolved
Duration — how long the incident has been open
Models affected — which model(s) are involved

Use the State filter to see only active incidents, or the Severity filter to focus on high-severity issues.

Incident detail

Click any incident to open its detail view. The detail view shows:

Timeline

A chronological feed of all signals related to this incident:

Anomaly detections with their explanation and severity
Alert rule triggers with threshold and actual value
Related error traces (grouped by error code)
Configuration changes that may have contributed (from the audit log)

Root cause analysis

When an incident is created, Zespan automatically runs AI root cause analysis across the correlated traces. The results appear at the top of the incident detail view under AI Root Cause Analysis.

Field	Description
Root cause	Single-sentence diagnosis of what caused the failure
Contributing factors	Up to 5 conditions that made the problem worse
Suggested fix	The single highest-impact action to resolve the issue
Prevention tips	Up to 5 steps to avoid the same failure class in future
Confidence	`high`, `medium`, or `low`

The analyzer uses a fast path for common patterns (rate limits, timeouts, context length exceeded, provider 5xx errors) and falls back to AI analysis for novel cases. Trace content is sanitized before analysis — only error details, model metadata, span structure, and latency are used. If analysis is still running, the section shows “Analysis in progress…” and updates automatically when complete.

Affected traces

A filtered list of the specific traces associated with this incident, with error codes, latency, and cost. Click any trace to open it in the flame graph view.

Resolution notes

A free-text field where you can record what you found and how you fixed it. Resolution notes are preserved after the incident closes and appear in the incident history. Use them to build a runbook for recurring issues.

Resolving an incident

Click Mark resolved on any active incident. You’ll be prompted to add a brief resolution note. Once resolved:

The incident state changes to “Resolved”
The resolution timestamp and note are saved
If the same underlying issue recurs, a new incident opens automatically — it does not reopen the closed one

Always add a resolution note before closing an incident. Future incidents of the same type will surface the previous resolution notes so your on-call engineer can see what worked before.

Auto-remediation

On the Scale plan, you can configure auto-remediation rules that ZespanPilot applies automatically when an incident of a specific type opens. For example:

“When a GPT-4o error spike incident opens, switch to GPT-4o-mini”
“When a rate-limit incident opens, reduce sample rate to 50%”

Auto-remediation rules are configured in Settings → Incidents and require explicit approval from an Owner-level user to activate.

Auto-remediation applies SDK config changes without human confirmation. Only enable it for actions you have validated are safe to apply automatically. All auto-remediation actions are logged in the audit trail.

Notifications

Incidents trigger the same notification channels as alert rules — email for Pro/Team, and webhooks for Scale. If an incident is opened while an alert for the same metric is active, Zespan deduplicates the notification so you do not receive duplicate alerts.

Tracing & Observability

Evaluations

Datasets & Experiments

Guardrails

Prompt Management

Alerts & Incidents

Cost & Analytics

ZespanPilot (AI Copilot)

Incidents — detect, correlate, and resolve LLM issues

How incidents are created

Incident states

The incidents list

Incident detail

Timeline

Root cause analysis

Affected traces

Resolution notes

Resolving an incident

Auto-remediation

Notifications

​How incidents are created

​Incident states

​The incidents list

​Incident detail

​Timeline

​Root cause analysis

​Affected traces

​Resolution notes

​Resolving an incident

​Auto-remediation

​Notifications

How incidents are created

Incident states

The incidents list

Incident detail

Timeline

Root cause analysis

Affected traces

Resolution notes

Resolving an incident

Auto-remediation

Notifications