Cron monitoring is one of those disciplines that feels solved until the night it is not. The job has run cleanly for months, nobody thinks about it, and then it quietly stops, produces no error because it did not run at all, and the absence goes unnoticed until something downstream is visibly broken. Scheduled AI agents inherit every bit of that, and add two wrinkles of their own. This is the full practice.
For the worked incident behind it, see how to catch a silent AI agent before your customer does; here the focus is the operating discipline, job by job.
The core problem: you cannot poll for absence
Uptime monitoring works by hitting an endpoint and expecting a 200. That model assumes the thing you watch is reachable and that reachability is what you care about. A scheduled job breaks both assumptions. There is no endpoint. And the failure you fear is not "it returned an error", it is "it did not run", which produces no signal at all.
You cannot detect nothing by looking for something. The job has to report in when it runs, and your monitor has to know when it was supposed to, so it can notice the report that never came. That inversion, "alert me when an expected event did not happen", is the whole of cron monitoring, and everything below is mechanism for it.
Practice 1: heartbeat every scheduled job
A heartbeat is the smallest operational event: this agent, this time, this status. One line at the end of the job emits one.
curl -fsS -X POST https://api.agentping.io/v1/ping \
-H "Authorization: Bearer ping_eu_018f4c2a..." \
-d '{"status":"ok"}'
The heartbeat on its own proves nothing useful; a ping that arrives tells you the job ran, but a ping that does not arrive could mean the job failed or could mean nothing. What makes it a monitor is the schedule attached to the agent, covered next. Heartbeat first, schedule second, and the intersection is the signal.
Practice 2: declare the expected cadence
The cadence lives on the agent, set once, and every ping is evaluated against it. Use a cron expression or a plain interval:
nightly-digest: expected0 2 * * *, daily at 02:00.lead-enrichment: expected*/5 * * * *, every five minutes.weekly-report: expected0 9 * * 1, Mondays at 09:00.
Now the monitor knows when each job should report, and a window that closes with no ping is a missed run. The cadence is independent of volume; a job that runs every minute and a job that runs once a quarter use the exact same primitive, which is what makes the practice scale across a fleet of mixed schedules.
Practice 3: set the grace window deliberately
Grace, or the tolerance window, is how late a job may be before you call it missing. The mistake is setting one global number. Grace should track each job's normal variance.
A five-minute job that always finishes in seconds has near-zero legitimate variance, so a tight grace of about a minute is right; if it is two minutes late, something is wrong. A daily summariser that sometimes runs long because the input volume varies needs thirty minutes of grace, or you will page on a normal heavy day. A sensible default is 10% of the interval, capped at one hour, then tightened wherever lateness is genuinely abnormal.
Too loose and a real failure sits undetected for hours. Too tight and you train the on-call to ignore the alert, which is worse than not having it. The grace window is where a cron monitor earns or loses the team's trust.
Practice 4: report failure, not just success
The cleanest pattern is the success-or-failure pair. Wrap the real job, ping ok on success, ping fail on failure:
0 2 * * * /usr/local/bin/nightly-digest.sh \
&& curl -fsS -X POST https://api.agentping.io/v1/ping \
-H "Authorization: Bearer ping_eu_018f4c2a..." -d '{"status":"ok"}' \
|| curl -fsS -X POST https://api.agentping.io/v1/ping \
-H "Authorization: Bearer ping_eu_018f4c2a..." -d '{"status":"fail"}'
This single line covers three failure modes at once. If the job crashes, the && short-circuits and the || branch reports fail. If it runs clean, the ok branch reports green. If the cron never fires at all, no ping arrives and the missed-window check catches it. One line, three modes, which is about the best ratio in operations.
Practice 5: use scoped tokens, never the master key
Cron heartbeats leak. Not maybe; eventually. They live in crontabs, CI logs, shell history, and the screenshots people paste into chat while debugging. Plan for the leak by scoping the credential.
Use a ping token, which is bound to a single agent and carries no other permission. If it leaks, the worst anyone can do is write fake heartbeats for that one agent. The full team API key in the same place would let an attacker forge runs and heartbeats for every agent you have. The rule of thumb is simple: anything that ends up in a URL, a shell command, or a third-party tool's config field is a scoped token; only your own application code behind real secrets management gets the master key.
Practice 6: route alerts by severity
A missed run should land where on-call actually lives, and that differs by job. Route per agent, per alert type. Customer-facing jobs go to PagerDuty so someone wakes up; internal jobs go to Slack so someone notices in the morning; a generic webhook drives a status page or an internal tool. The split is by blast radius: who needs to know, and how fast. A nightly internal report missing is a Slack message; a customer-facing enrichment pipeline going dark is a page.
The two AI-specific wrinkles
Everything above applies to any cron job. Agents add two things.
First, absence is not the only failure. A traditional cron job that runs has usually done its work. An agent can fire perfectly on schedule, return valid-looking output, and still be producing garbage, because an upstream change fed it bad inputs and it did what models do and produced plausible output anyway. Schedule monitoring catches the job that stopped; it is completely blind to the job that kept running but stopped working. That second mode needs output checks and sampled scoring on top of the freshness check. The two together cover both halves of silent failure.
Second, a missed run has a tail you may want to recover. When a non-AI cron misses, you usually just wait for the next window. When an agent misses, the work it would have done (the digest it would have sent, the leads it would have enriched) may still need doing. Good tooling carries the inputs the missed run would have taken into the alert, so the on-call can replay or amend rather than just acknowledge. The signal is not only "it failed", it is "here is the work that did not happen, do you want to run it".
How AgentPing implements the practice
In AgentPing every agent has an expected cadence, the schedule checker runs every minute against the most recent received_at, and a missed window fires an alert carrying the agent id, the missed window, the last successful run, and the inputs the next run would have taken, with a link to amend or replay. Schedule freshness is one pillar (Pulse); output scoring is another (Verify), so both halves of silent failure are covered from the same per-run event. See Pulse features for the detail.
If you have scheduled agents in production and no heartbeat coverage, the failure is not in your code yet, but the absence of monitoring already is. Get started and wire up freshness checks before the next missed run becomes a support ticket.