Why this matters
Queues fail quietly. The UI still loads, but emails arrive late, imports stall, and webhooks lag. This recipe makes queue health visible from the same agent loop as errors and deploys.
What you get
- p95 job latency by queue and job type
- Retry and dead-letter spikes
- First deploy or cron window correlated with the spike
- The job type to inspect first
Walk through it
Are background jobs backing up?
I’ll compare completed and failed job events by queue.
POST /v1/projects/:project_id/analytics/query {
project_id: "default",
hogql: "SELECT event, properties.queue AS queue, properties.job_type AS job_type, count() AS count, quantile(0.95)(toFloat(properties.latency_ms)) AS p95_latency_ms, max(toInt(properties.attempt)) AS max_attempt FROM events WHERE event IN ('job_completed','job_failed') AND timestamp > now() - INTERVAL 24 HOUR GROUP BY event, queue, job_type ORDER BY count DESC LIMIT 50"
}
The output
The agent returns a ranked table and identifies whether the issue is latency, retries, or terminal failures. If failures map to cases, it should fetch the top case before suggesting a fix.
Setting it up
Emit job lifecycle events from the queue worker. Use stable job type names and queue names.
Variations
- “Only show jobs with p95 latency over 5 minutes.”
- “Compare the queue after the last deploy.”
- “Group dead letters by exception fingerprint.”