Pulse
Is anything wrong right now?
The live feed
Every finished run lands on the activity feed within a second. Failures show with their error signature; latency shows next to each run. Click through for the full trace (tool calls, LLM calls, logs, parent and child runs).
P95 latency rolls up per agent across 24h, 7d, and 30d windows. Sudden moves are usually a model swap, a longer prompt, or a provider slowing down.
Schedule monitoring
Set the agent's cron and tolerance window (default 5 minutes past expected time) on its settings page. A scheduler worker checks every minute; a missed window fires an alert. This is the single most common reason teams sign up: silent failure of a daily summary or nightly batch.
Failure clustering
Failed runs are fingerprinted by error type plus the first 100 characters of the error message. Identical fingerprints cluster, so a provider outage that took down 400 runs surfaces as one row. Cluster IDs are stable across time.
Alert routes
Configurable per team, overridable per agent:
| Route | Tier |
|---|---|
| Slack (incoming webhook) | All |
| All | |
| Generic webhook | All |
| PagerDuty (Events API v2) | Team and up |
| Microsoft Teams (incoming webhook) | Team and up |
| Linear (creates an issue) | Team and up |
Typical pattern: critical agents to PagerDuty, everything else to Slack.