Pulse

Is anything wrong right now?

The live feed

Every finished run lands on the activity feed within a second. Failures show with their error signature; latency shows next to each run. Click through for the full trace (tool calls, LLM calls, logs, parent and child runs).

P95 latency rolls up per agent across 24h, 7d, and 30d windows. Sudden moves are usually a model swap, a longer prompt, or a provider slowing down.

Schedule monitoring

Set the agent's cron and tolerance window (default 5 minutes past expected time) on its settings page. A scheduler worker checks every minute; a missed window fires an alert. This is the single most common reason teams sign up: silent failure of a daily summary or nightly batch.

Failure clustering

Failed runs are fingerprinted by error type plus the first 100 characters of the error message. Identical fingerprints cluster, so a provider outage that took down 400 runs surfaces as one row. Cluster IDs are stable across time.

Alert routes

Configurable per team, overridable per agent:

Route Tier
Slack (incoming webhook) All
Email All
Generic webhook All
PagerDuty (Events API v2) Team and up
Microsoft Teams (incoming webhook) Team and up
Linear (creates an issue) Team and up

Typical pattern: critical agents to PagerDuty, everything else to Slack.