Find queue latency and dead-letter spikes

See which background jobs are backing up, retrying, or landing in dead-letter queues before customers notice stale state.

Published

difficulty intermediate · time to value 5 minutes

Start from this

Show queue health for the last 24 hours: job latency, retry count, dead-letter count, and the job types causing the backlog.

Why this matters

Queues fail quietly. The UI still loads, but emails arrive late, imports stall, and webhooks lag. This recipe makes queue health visible from the same agent loop as errors and deploys.

What you get

  • p95 job latency by queue and job type
  • Retry and dead-letter spikes
  • First deploy or cron window correlated with the spike
  • The job type to inspect first

Walk through it

You

Are background jobs backing up?

Agent

I’ll compare completed and failed job events by queue.

POST /v1/projects/:project_id/analytics/query {
  project_id: "default",
  hogql: "SELECT event, properties.queue AS queue, properties.job_type AS job_type, count() AS count, quantile(0.95)(toFloat(properties.latency_ms)) AS p95_latency_ms, max(toInt(properties.attempt)) AS max_attempt FROM events WHERE event IN ('job_completed','job_failed') AND timestamp > now() - INTERVAL 24 HOUR GROUP BY event, queue, job_type ORDER BY count DESC LIMIT 50"
}

The output

The agent returns a ranked table and identifies whether the issue is latency, retries, or terminal failures. If failures map to cases, it should fetch the top case before suggesting a fix.

Setting it up

Emit job lifecycle events from the queue worker. Use stable job type names and queue names.

Variations

  • “Only show jobs with p95 latency over 5 minutes.”
  • “Compare the queue after the last deploy.”
  • “Group dead letters by exception fingerprint.”

Try this recipe in your own agent.

Ask your agent to adapt the starter prompt to your saved signal map and live events, then run it against your data.

Install agentry.sh/install.md for me
Agent will onboard itself and then your app