How to catch a silent AI agent before your customer does

Scheduled agents fail differently. They go quiet. No exception, no 500, no paging signal. Heartbeats and expected cadences close the gap between "stopped running" and "we know about it".

Matt King

May 5, 2026 11 min read

A team we worked with last quarter had a nightly summariser. It ran at 02:00, pulled the day's customer events, produced a digest, and emailed it to the operations channel by 02:30. It had been doing this for eight months.

On the 14th, they shipped a deploy that touched a config file the summariser depended on. The job did not crash. It did not throw. It simply stopped firing, because the cron entry that triggered it had silently become invalid.

The Slack channel got no digest on the 15th. Nobody noticed; weekends were always quieter. By the 18th, the channel had been empty for four nights, which still looked normal because the team was used to glancing past the digest. On the 25th, eleven days in, a support engineer asked in the channel where the nightly summary had gone. That was when anyone realised.

Eleven days. No exception, no 500, no alert. The job had simply stopped existing as far as the cron was concerned, and the team's monitoring had no way to notice the absence of work. The pattern is common enough that we wrote a dedicated lander on it: your AI agent stopped running, nobody got paged.

Why traditional uptime monitoring misses this

Uptime tools watch endpoints. They send a GET, they expect a 200, they alert if they get anything else. That model assumes the thing you are monitoring is reachable on a URL, and that reachability is what you care about. For a web service, this is fine.

For a scheduled agent, it is the wrong shape entirely. There is no endpoint to hit. The agent is something a scheduler invokes; there is nothing to ping from the outside. You can monitor the scheduler itself, but the scheduler reporting healthy tells you nothing about whether the job inside it ran. The scheduler can be perfectly healthy with no jobs registered.

The signal you actually need is the inverse. You want to be told when an expected event did not happen. That is the problem heartbeats solve.

Heartbeats: a one-line health check

A heartbeat is the smallest possible operational event. It says: "this agent ran, at this time, with this outcome". One line of curl can produce one:

curl -X POST https://api.agentping.io/v1/ping \
  -H "Authorization: Bearer ping_eu_018f4c2a..." \
  -d '{"status":"ok"}'

That is the entire integration. The endpoint accepts the ping, records the timestamp as received_at, stores the status, and returns 200. From the agent's side it is a sub-second HTTP call that adds nothing meaningful to the job's runtime.

What makes the heartbeat useful is not the ping itself, it is the schedule attached to the agent. When you register the agent in AgentPing, you declare its expected cadence: "every five minutes", "every Monday 09:00", a cron expression, an ISO interval. The system then watches for pings against that schedule and pages you when one is missing.

The heartbeat tells the system the job ran. The schedule tells the system when it should have. The intersection is what you actually want to know.

Setting an expected cadence

The cadence lives on the agent, not on the ping. You set it once, in the dashboard or via the API, and every subsequent ping is evaluated against it. A typical setup looks like this:

nightly-summariser: expected 0 2 * * * (cron, 02:00 daily). Grace period 30 minutes.
lead-enrichment: expected */5 * * * * (every 5 minutes). Grace period 60 seconds.
weekly-report: expected 0 9 * * 1 (Mondays 09:00). Grace period 2 hours.

Grace is the slack you give the job before declaring it missing, also called a tolerance window. A five-minute job that has been silent for 60 seconds is probably fine. A five-minute job that has been silent for 11 minutes is not. The default grace is 10% of the interval, capped at one hour, but for jobs where lateness is unusual (a daily summariser that always finishes inside 30 minutes), you tighten it.

When a window closes without a ping, the schedule checker fires an alert. The alert carries the agent id, the missed window, the timestamp of the last successful run, and a link to the agent's page. The on-call engineer opens the page, sees the gap in the run timeline, and either re-triggers the job or escalates.

Ping tokens vs API keys

There are two credential types for getting data into AgentPing. The full team API key (apk_...) authenticates the SDK and any code that legitimately writes for multiple agents. The per-agent ping token (ping_...) is scoped to a single agent and carries no other permissions.

For cron heartbeats, always use a ping token. The example above used one:

Authorization: Bearer ping_eu_018f4c2a...

The reason is operational hygiene. A ping token in a crontab or a curl line in a CI script will end up in shell history, in log files, in screenshots passed around for debugging. If it leaks, the blast radius is one agent. An apk_ key in the same place would let an attacker write fake heartbeats and runs for every agent on the team.

The rule of thumb: anything that ends up in a URL, a shell command, or a third-party tool's config (n8n, GitHub Actions, Make, Zapier) is a ping token. Anything that runs inside your own application code, behind your own secrets management, can be the team API key.

Alert routes

When a missed run fires, you want it to go where on-call lives. AgentPing supports five destinations per agent:

Slack: an incoming webhook, channel of your choice. Best for non-urgent or business-hours alerts.
PagerDuty: integration key on a service. Best for customer-facing agents where someone needs to wake up.
Microsoft Teams: a Teams incoming webhook. Same role as Slack for organisations on Teams.
Email: one or more addresses. Useful as a redundant secondary route.
Webhook: a generic JSON POST to a URL you control. Useful when you want to drive an internal tool, a status page, or a custom Opsgenie or VictorOps integration.

Routes are set per agent, per alert type. A typical setup splits by severity: missed-run alerts on customer-facing agents go to PagerDuty; missed-run alerts on internal jobs go to Slack; spend baseline alerts go to a finance Slack channel; quality drift alerts go to engineering. You can wire all of them in five minutes and adjust as you learn which alerts actually wake people up usefully.

A worked example: cron with the two-line pattern

The cleanest pattern for a cron heartbeat is the success-or-failure pair. You wrap the real job, ping with status=ok on success, ping with status=fail on failure. Both branches end up in the same agent's timeline.

0 2 * * * /usr/local/bin/nightly-summariser.sh \
  && curl -fsS -X POST https://api.agentping.io/v1/ping \
       -H "Authorization: Bearer ping_eu_018f4c2a..." \
       -d '{"status":"ok"}' \
  || curl -fsS -X POST https://api.agentping.io/v1/ping \
       -H "Authorization: Bearer ping_eu_018f4c2a..." \
       -d '{"status":"fail"}'

That single crontab line covers three failure modes. If the job crashes, the && short-circuits and the || branch sends fail. The dashboard shows a run with status=fail and the configured alert route fires. If the job runs cleanly, the ok branch fires and the run shows green. If the cron does not fire at all (the case from the opening story), no ping arrives, the schedule checker notices the missed window, and the missed-run alert fires.

The team in the opening story now has this exact pattern on every scheduled job. Their last missed-run incident lasted 4 minutes from "cron did not fire" to "PagerDuty paged on-call". Eleven days has become four minutes. The mechanism is one line of curl and an expected cadence on the agent's page. See Pulse features for the full implementation.

If you have scheduled agents in production and no heartbeat coverage, the bug is not yet in your code, but the absence of monitoring is the bug already. Get started and wire up heartbeat coverage before the next missed run becomes a support ticket.

What is the difference between a heartbeat and a full run?

A heartbeat is a minimal "I ran, here is my status" signal: an agent id, a timestamp, a status of ok or fail, and nothing else. A full run carries inputs, outputs, tokens, cost, and an event timeline. Heartbeats are what you wire into a cron job with one line of curl; runs are what the SDK emits when it wraps an agent. Both land in the same table and share the same scheduling logic, so a team can start with curl heartbeats on day one and upgrade to full SDK runs later without losing history.

How do I monitor a scheduled job that runs less than once a day?

Set the expected cadence on the agent itself. AgentPing accepts a cron expression or a plain interval per agent, so a weekly job declares "expected every Monday 09:00" and gets paged the moment a Monday goes by without a ping. The cadence is independent of the agent volume; a job that runs every five minutes and a job that runs once a quarter use the same primitive. The page fires after one missed window with a configurable grace period.

Can I have different alert routes per agent?

Yes. Each agent can route to its own destination: a Slack channel, a PagerDuty service key, a Microsoft Teams webhook, an email address, or a generic webhook for anything else. Most teams route customer-facing agents to PagerDuty and internal jobs to Slack. Routes are evaluated per agent per alert type, so the same agent can send spend alerts to finance and missed-run alerts to engineering.

What credential should I use for cron heartbeats?

Use a ping token, not the team API key. Ping tokens are scoped to a single agent and carry no other permissions; if one leaks out of a crontab or a CI log, the blast radius is one agent. The full team API key (the apk_ credential) is for the SDK and any code that legitimately writes for multiple agents. As a rule of thumb, anything that ends up in a URL or a shell command should be a ping_ token.

How quickly does AgentPing alert on a missed run?

The schedule checker runs every minute and compares each agent's expected cadence against the most recent received_at. For a job declared as "every five minutes", a miss is detected within 60 seconds of the expected window closing. For a job with a wider cadence, the grace period is configurable per agent so a five-minute network blip on a daily job does not page you at 03:00. The default grace is 10% of the cadence, capped at one hour.