Detect a 429-spike before customers tweet about it

Why this matters

429s are the most underwatched error class. They feel “non-critical” — the request just failed, retry later — so they sit at the bottom of every oncall queue. Then a customer’s integration goes wild and DoSes your Stripe webhook, or a third-party tightens limits without warning, and suddenly half your jobs are silently bouncing.

Datadog can graph 429s if you remember to set it up. Sentry filters them out by default (they’re “not errors”). Neither tells you which upstream is throttling you or which downstream caller is the offender.

Agentry treats them as first-class signal. One prompt compares this week’s 429 rate against last week’s per-endpoint and per-upstream, and surfaces what changed.

What you get

Week-over-week delta in 429 count, by endpoint and by upstream service
The specific customer / API key whose calls are responsible (if it’s one)
A “this just started” timestamp so you can correlate with a third-party change

Walk through it

You

Are we hitting any rate limits unusually often this week? Look for 429 errors trending up vs the prior week.

Agent

Let me compare 429 counts this week vs last week, bucketed by endpoint.

agentry_analytics_query {
  project_id: "default",
  hogql: "SELECT properties.endpoint AS endpoint, properties.upstream AS upstream, countIf(timestamp >= now() - INTERVAL 7 DAY) AS this_week, countIf(timestamp >= now() - INTERVAL 14 DAY AND timestamp < now() - INTERVAL 7 DAY) AS last_week FROM events WHERE event = '$exception' AND properties.status_code = 429 GROUP BY endpoint, upstream ORDER BY this_week DESC"
}

{
  "rows": [
    { "endpoint": "POST /api/stripe/webhook", "upstream": "stripe", "this_week": 4218, "last_week": 142 },
    { "endpoint": "GET /api/openai/completion", "upstream": "openai", "this_week": 312, "last_week": 280 },
    { "endpoint": "POST /api/sendgrid/send", "upstream": "sendgrid", "this_week": 88, "last_week": 80 }
  ]
}

Agent

Stripe webhook 429s jumped 30x — 142 last week, 4218 this week. Almost certainly one customer or a regression. Let me check who’s calling.

agentry_analytics_query {
  project_id: "default",
  hogql: "SELECT distinct_id, count() AS calls FROM events WHERE event = '$exception' AND properties.status_code = 429 AND properties.endpoint = 'POST /api/stripe/webhook' AND timestamp >= now() - INTERVAL 7 DAY GROUP BY distinct_id ORDER BY calls DESC LIMIT 5"
}

{
  "rows": [
    { "distinct_id": "acct_megacorp_prod", "calls": 4011 },
    { "distinct_id": "acct_acme_prod", "calls": 102 },
    { "distinct_id": "acct_globex_prod", "calls": 41 }
  ]
}

Agent

95% of the spike is one account: acct_megacorp_prod. They went from ~20 webhook calls a day to ~570 — either they enabled a new integration on Monday or their retry logic has gone exponential. The Stripe webhook isn’t actually the problem; your code is calling it too often.

Recommended: ping that customer’s CSM, check if they shipped something Monday, and consider adding circuit-breaker on your side. Want me to draft the customer message?

The output

429-spike report (last 7d vs prior 7d)

ENDPOINT                          UPSTREAM    LAST   THIS    DELTA
POST /api/stripe/webhook          stripe       142  4,218  +2,872%  CRITICAL
GET  /api/openai/completion       openai       280    312     +11%  normal
POST /api/sendgrid/send           sendgrid      80     88     +10%  normal

Root cause for /api/stripe/webhook:
  acct_megacorp_prod (4,011 calls — 95% of spike)
  → Started Mon May 12 ~14:00 UTC
  → Their normal baseline: ~20 calls/day
  → Now: ~570 calls/day, mostly retry-bursts

Recommended actions:
  1. Ping megacorp's CSM today — they likely shipped something Monday
  2. Add per-account circuit-breaker (`opossum` or similar)
  3. Suppress this fingerprint for the rest if it's downstream noise (see `suppress-noise`)

Setting it up

The recipe needs status codes on your error or analytics events. Two patterns work:

Pattern A — log 429s as errors with the status code:

const res = await fetch(stripeWebhookUrl, { ... });
if (res.status === 429) {
  await fetch(`https://api.agentry.sh/v1/logs/${PROJECT_ID}/`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.AGENTRY_DSN}`,
      "Content-Type": "application/json",
      "User-Agent": "myapp/1.0",  // REQUIRED — Cloudflare 403s default UAs
    },
    body: JSON.stringify({
      message: "Upstream 429",
      level: "warning",
      user: { id: currentAccount.id },
      tags: {
        status_code: 429,
        endpoint: "POST /api/stripe/webhook",
        upstream: "stripe",
      },
    }),
  });
}

Pattern B — track every outbound call as an analytics event (heavier but richer):

await fetch(`https://api.agentry.sh/v1/analytics/${PROJECT_ID}/`, {
  method: "POST",
  headers: { /* same */ "User-Agent": "myapp/1.0" },
  body: JSON.stringify({
    event: "upstream_call",
    distinct_id: currentAccount.id,
    properties: {
      endpoint: "POST /api/stripe/webhook",
      upstream: "stripe",
      status_code: res.status,
      latency_ms: latency,
    },
  }),
});

Either way, the keys to surface in properties are: status_code, endpoint, upstream. The agent can slice on those.

Variations

“Show 429s by upstream service over the last 90 days. Anything trending up monthly?”
“Which of our customers are the top 429-generators against OpenAI this week?”
“Are we hitting our OWN rate limits (per-API-key) for customer X? They opened a support ticket.”
“Daily cron at 9am: alert me if any upstream’s 429 rate doubled overnight.” (uses a Routine)

Detect a 429-spike before customers tweet about it

Why this matters

What you get

Walk through it

The output

Setting it up

Variations

Related recipes

Find what broke after the last npm update

Find connection-pool / timeout patterns in errors

Watch staging — page only on genuinely new errors

Try this recipe in your own agent.