Cron monitoring best practices for scheduled AI agents

Scheduled jobs fail by going quiet, and a quiet job produces no signal at all. The discipline that catches it is older than AI agents, but agents add two wrinkles. Here is the full practice, from heartbeat to grace window to alert route.

Matt King

June 16, 2026 9 min read

Cron monitoring is one of those disciplines that feels solved until the night it is not. The job has run cleanly for months, nobody thinks about it, and then it quietly stops, produces no error because it did not run at all, and the absence goes unnoticed until something downstream is visibly broken. Scheduled AI agents inherit every bit of that, and add two wrinkles of their own. This is the full practice.

For the worked incident behind it, see how to catch a silent AI agent before your customer does; here the focus is the operating discipline, job by job.

The core problem: you cannot poll for absence

Uptime monitoring works by hitting an endpoint and expecting a 200. That model assumes the thing you watch is reachable and that reachability is what you care about. A scheduled job breaks both assumptions. There is no endpoint. And the failure you fear is not "it returned an error", it is "it did not run", which produces no signal at all.

You cannot detect nothing by looking for something. The job has to report in when it runs, and your monitor has to know when it was supposed to, so it can notice the report that never came. That inversion, "alert me when an expected event did not happen", is the whole of cron monitoring, and everything below is mechanism for it.

Practice 1: heartbeat every scheduled job

A heartbeat is the smallest operational event: this agent, this time, this status. One line at the end of the job emits one.

curl -fsS -X POST https://api.agentping.io/v1/ping \
  -H "Authorization: Bearer ping_eu_018f4c2a..." \
  -d '{"status":"ok"}'

The heartbeat on its own proves nothing useful; a ping that arrives tells you the job ran, but a ping that does not arrive could mean the job failed or could mean nothing. What makes it a monitor is the schedule attached to the agent, covered next. Heartbeat first, schedule second, and the intersection is the signal.

Practice 2: declare the expected cadence

The cadence lives on the agent, set once, and every ping is evaluated against it. Use a cron expression or a plain interval:

nightly-digest: expected 0 2 * * *, daily at 02:00.
lead-enrichment: expected */5 * * * *, every five minutes.
weekly-report: expected 0 9 * * 1, Mondays at 09:00.

Now the monitor knows when each job should report, and a window that closes with no ping is a missed run. The cadence is independent of volume; a job that runs every minute and a job that runs once a quarter use the exact same primitive, which is what makes the practice scale across a fleet of mixed schedules.

Practice 3: set the grace window deliberately

Grace, or the tolerance window, is how late a job may be before you call it missing. The mistake is setting one global number. Grace should track each job's normal variance.

A five-minute job that always finishes in seconds has near-zero legitimate variance, so a tight grace of about a minute is right; if it is two minutes late, something is wrong. A daily summariser that sometimes runs long because the input volume varies needs thirty minutes of grace, or you will page on a normal heavy day. A sensible default is 10% of the interval, capped at one hour, then tightened wherever lateness is genuinely abnormal.

Too loose and a real failure sits undetected for hours. Too tight and you train the on-call to ignore the alert, which is worse than not having it. The grace window is where a cron monitor earns or loses the team's trust.

Practice 4: report failure, not just success

The cleanest pattern is the success-or-failure pair. Wrap the real job, ping ok on success, ping fail on failure:

0 2 * * * /usr/local/bin/nightly-digest.sh \
  && curl -fsS -X POST https://api.agentping.io/v1/ping \
       -H "Authorization: Bearer ping_eu_018f4c2a..." -d '{"status":"ok"}' \
  || curl -fsS -X POST https://api.agentping.io/v1/ping \
       -H "Authorization: Bearer ping_eu_018f4c2a..." -d '{"status":"fail"}'

This single line covers three failure modes at once. If the job crashes, the && short-circuits and the || branch reports fail. If it runs clean, the ok branch reports green. If the cron never fires at all, no ping arrives and the missed-window check catches it. One line, three modes, which is about the best ratio in operations.

Practice 5: use scoped tokens, never the master key

Cron heartbeats leak. Not maybe; eventually. They live in crontabs, CI logs, shell history, and the screenshots people paste into chat while debugging. Plan for the leak by scoping the credential.

Use a ping token, which is bound to a single agent and carries no other permission. If it leaks, the worst anyone can do is write fake heartbeats for that one agent. The full team API key in the same place would let an attacker forge runs and heartbeats for every agent you have. The rule of thumb is simple: anything that ends up in a URL, a shell command, or a third-party tool's config field is a scoped token; only your own application code behind real secrets management gets the master key.

Practice 6: route alerts by severity

A missed run should land where on-call actually lives, and that differs by job. Route per agent, per alert type. Customer-facing jobs go to PagerDuty so someone wakes up; internal jobs go to Slack so someone notices in the morning; a generic webhook drives a status page or an internal tool. The split is by blast radius: who needs to know, and how fast. A nightly internal report missing is a Slack message; a customer-facing enrichment pipeline going dark is a page.

The two AI-specific wrinkles

Everything above applies to any cron job. Agents add two things.

First, absence is not the only failure. A traditional cron job that runs has usually done its work. An agent can fire perfectly on schedule, return valid-looking output, and still be producing garbage, because an upstream change fed it bad inputs and it did what models do and produced plausible output anyway. Schedule monitoring catches the job that stopped; it is completely blind to the job that kept running but stopped working. That second mode needs output checks and sampled scoring on top of the freshness check. The two together cover both halves of silent failure.

Second, a missed run has a tail you may want to recover. When a non-AI cron misses, you usually just wait for the next window. When an agent misses, the work it would have done (the digest it would have sent, the leads it would have enriched) may still need doing. Good tooling carries the inputs the missed run would have taken into the alert, so the on-call can replay or amend rather than just acknowledge. The signal is not only "it failed", it is "here is the work that did not happen, do you want to run it".

How AgentPing implements the practice

In AgentPing every agent has an expected cadence, the schedule checker runs every minute against the most recent received_at, and a missed window fires an alert carrying the agent id, the missed window, the last successful run, and the inputs the next run would have taken, with a link to amend or replay. Schedule freshness is one pillar (Pulse); output scoring is another (Verify), so both halves of silent failure are covered from the same per-run event. See Pulse features for the detail.

If you have scheduled agents in production and no heartbeat coverage, the failure is not in your code yet, but the absence of monitoring already is. Get started and wire up freshness checks before the next missed run becomes a support ticket.

What is the difference between monitoring a cron job and monitoring a web service?

A web service is monitored from the outside by hitting an endpoint and expecting a 200. A cron job has no endpoint to hit, and its failure mode is absence: it simply does not run. You cannot detect absence by polling, because there is nothing to poll. The job has to report in when it runs (a heartbeat), and your monitor has to know when it should have reported so it can notice when one is missing. The signal you need is the inverse of uptime: alert me when an expected event did not happen.

What is a heartbeat and how do I send one?

A heartbeat is a minimal "I ran, here is my status" ping: an agent id, a timestamp, and a status of ok or fail. You send it with one line of curl at the end of your job, or the SDK sends a richer version automatically when it wraps an agent run. The endpoint records the timestamp as received_at and stores the status. The heartbeat itself is trivial; the value comes from pairing it with an expected schedule so the monitor can flag a missed window.

How do I set the grace period for a scheduled job?

The grace period, or tolerance window, is how late a job can be before you call it missing. Set it relative to the job's normal variance, not a fixed number. A five-minute job that always finishes in seconds gets a tight grace of about a minute; a daily summariser that sometimes runs long gets thirty minutes. A sensible default is 10% of the interval capped at one hour, then tighten it for jobs where lateness is genuinely abnormal. Too loose and you detect failures hours late; too tight and you page on normal jitter.

What credential should a cron job use to report in?

A per-agent scoped token, not your full API key. Cron heartbeats end up in crontabs, CI scripts, shell history, and screenshots, so the credential will leak eventually. A scoped ping token limits the blast radius of that leak to a single agent; a full team key would let anyone who finds it write fake heartbeats and runs for every agent you have. The rule of thumb: anything that lands in a URL or a shell command is a scoped token.

What are the two failure modes I need to catch for scheduled agents?

The job that stopped running, and the job that keeps running but stopped working. The first is caught by a missed-heartbeat alert against the expected schedule. The second is caught by reporting status and sampling output quality, because an agent can fire on time, return valid-looking output, and still be producing garbage. Schedule monitoring alone catches the first and is blind to the second, which is why mature setups pair freshness checks with output checks.