000 pulse · symptom

Your AI agent stopped running. Nobody got paged.

Scheduled agents fail differently. They go quiet. The cron was removed during a refactor, the upstream API changed shape, the prompt slot started returning empty strings. No exception, no 500, no paging signal. The team usually finds out from a customer asking why their dashboard is stale, eleven days in.

001 why nothing fires

The signal is the absence of a signal. APM watches presence.

Sentry needs an exception. Datadog needs a latency spike. Both watch the system while it's running and report when something goes wrong. Neither watches the gap when a scheduled run simply doesn't happen. There's no exception to catch, no request to time out, and no log line to alert on.

Common patterns

  • The deploy that quietly removed the schedule. A refactor moved the cron definition into a config file that wasn't included in the deploy. The agent process is healthy; nothing's wrong with it. It just isn't being called.
  • The upstream that changed shape. The agent depends on an internal API for customer metadata. That API now returns empty strings for unknown fields. The agent runs, produces plausible-looking output, and routes everything to "unknown".
  • The queue that backed up. The schedule fires, the worker picks it up, the worker is stuck on a single long-running job from yesterday. The next ten runs are queued and not running, but the metric for "runs in progress" is 1.
002 what catches it within the grace window

Schedule freshness, with the agent's context attached.

The minimal instrumentation is a cron expression on each scheduled agent, plus a tolerance window. A scheduler worker checks every minute; when an expected run is past the window, the agent's alert route fires with the last-successful-run context so the on-call engineer can act without opening a browser.

A real alert payload

POST · your-slack-webhook missed schedule alert
{
  "alert": "schedule_missed",
  "agent": "daily-summary",
  "team": "production",
  "expected_at": "2026-05-16T09:00:00Z",
  "tolerance_minutes": 5,
  "now": "2026-05-16T09:06:14Z",
  "last_successful_run": {
    "id": "run_018f3a2b9c1d7e8fa4b9c2d7e8f1a3b6",
    "finished_at": "2026-05-15T09:00:42Z",
    "duration_ms": 41892,
    "cost_gbp": 0.149
  },
  "dashboard_url": "https://app.agentping.io/agents/daily-summary"
}

Same payload across the generic webhook route. PagerDuty gets a protocol-shaped variant; Slack and Teams get formatted message blocks; email gets a digest. The on-call engineer sees the missed run plus the last good run's context in one notification.

The three things to instrument

  • Schedule freshness. A cron expression and a tolerance window per scheduled agent. Missed runs page on the agent's alert route within the grace period.
  • Run status. Every run lands as success, failed, timeout, or cancelled, with its error captured, so a bad deploy is obvious at a glance.
  • Run-level traces. Tool calls, LLM calls, parent and child runs, plus the actual output. When the alert fires, the on-call can step through the last successful run, not just read a notification that something went wrong.
003 frequently asked
My monitoring tools watch latency and exceptions. Why didn't they catch this?
APM tools (Datadog, New Relic, Sentry) watch what happens when something runs. If a scheduled agent simply stops running, those tools have nothing to watch. The signal is the absence of a signal, and you need a tool that knows the agent was supposed to run to detect it.
How is schedule monitoring different from a generic cron monitor?
A generic cron monitor (Cronitor, Healthchecks.io) checks heartbeats. That covers the "did the process run" question. Agent observability adds the agent's context: cost, latency, output, and the full run trace. When a heartbeat misses, you don't just know it missed; you have the last successful run attached.
What's a tolerance window and why does it matter?
Scheduled jobs don't fire to the millisecond. A tolerance window (default 5 minutes past expected) prevents flapping alerts when a queue is slightly backed up. The alert fires only if the window elapses without a run landing, so on-call gets paged for real misses, not for normal jitter.
Can I see the run that came right before the missed one?
Yes. The alert payload includes the last successful run's ID, finished_at, cost, and duration, plus a link to the dashboard. The on-call engineer can step through the previous run's trace (tool calls, LLM calls, output) to see whether anything was already trending toward failure before the silence started.
What does this look like for user-triggered agents that don't run on a schedule?
Schedule monitoring doesn't apply, but everything else does. Every user-triggered run still lands with its status, cost, latency, and full trace, and a failed run can route to your alert channel. You lose the "did it run on time" signal; you keep the rest.
004 read next

How AgentPing implements schedule monitoring.

Pulse is the live-monitoring half of AgentPing. The features page walks through the live activity feed, schedule freshness, run traces, and the alert routes. The docs go one level deeper.

Pulse features Pulse docs What is AI agent observability?