Why this matters
429s are the most underwatched error class. They feel “non-critical” — the request just failed, retry later — so they sit at the bottom of every oncall queue. Then a customer’s integration goes wild and DoSes your Stripe webhook, or a third-party tightens limits without warning, and suddenly half your jobs are silently bouncing.
Datadog can graph 429s if you remember to set it up. Sentry filters them out by default (they’re “not errors”). Neither tells you which upstream is throttling you or which downstream caller is the offender.
Agentry treats them as first-class signal. One prompt compares this week’s 429 rate against last week’s per-endpoint and per-upstream, and surfaces what changed.
What you get
- Week-over-week delta in 429 count, by endpoint and by upstream service
- The specific customer / API key whose calls are responsible (if it’s one)
- A “this just started” timestamp so you can correlate with a third-party change
Walk through it
Are we hitting any rate limits unusually often this week? Look for 429 errors trending up vs the prior week.
Let me compare 429 counts this week vs last week, bucketed by endpoint.
agentry_analytics_query {
project_id: "default",
hogql: "SELECT properties.endpoint AS endpoint, properties.upstream AS upstream, countIf(timestamp >= now() - INTERVAL 7 DAY) AS this_week, countIf(timestamp >= now() - INTERVAL 14 DAY AND timestamp < now() - INTERVAL 7 DAY) AS last_week FROM events WHERE event = '$exception' AND properties.status_code = 429 GROUP BY endpoint, upstream ORDER BY this_week DESC"
}
{
"rows": [
{ "endpoint": "POST /api/stripe/webhook", "upstream": "stripe", "this_week": 4218, "last_week": 142 },
{ "endpoint": "GET /api/openai/completion", "upstream": "openai", "this_week": 312, "last_week": 280 },
{ "endpoint": "POST /api/sendgrid/send", "upstream": "sendgrid", "this_week": 88, "last_week": 80 }
]
}
Stripe webhook 429s jumped 30x — 142 last week, 4218 this week. Almost certainly one customer or a regression. Let me check who’s calling.
agentry_analytics_query {
project_id: "default",
hogql: "SELECT distinct_id, count() AS calls FROM events WHERE event = '$exception' AND properties.status_code = 429 AND properties.endpoint = 'POST /api/stripe/webhook' AND timestamp >= now() - INTERVAL 7 DAY GROUP BY distinct_id ORDER BY calls DESC LIMIT 5"
}
{
"rows": [
{ "distinct_id": "acct_megacorp_prod", "calls": 4011 },
{ "distinct_id": "acct_acme_prod", "calls": 102 },
{ "distinct_id": "acct_globex_prod", "calls": 41 }
]
}
95% of the spike is one account: acct_megacorp_prod. They went from ~20 webhook calls a day to ~570 — either they enabled a new integration on Monday or their retry logic has gone exponential. The Stripe webhook isn’t actually the problem; your code is calling it too often.
Recommended: ping that customer’s CSM, check if they shipped something Monday, and consider adding circuit-breaker on your side. Want me to draft the customer message?
The output
429-spike report (last 7d vs prior 7d)
ENDPOINT UPSTREAM LAST THIS DELTA
POST /api/stripe/webhook stripe 142 4,218 +2,872% CRITICAL
GET /api/openai/completion openai 280 312 +11% normal
POST /api/sendgrid/send sendgrid 80 88 +10% normal
Root cause for /api/stripe/webhook:
acct_megacorp_prod (4,011 calls — 95% of spike)
→ Started Mon May 12 ~14:00 UTC
→ Their normal baseline: ~20 calls/day
→ Now: ~570 calls/day, mostly retry-bursts
Recommended actions:
1. Ping megacorp's CSM today — they likely shipped something Monday
2. Add per-account circuit-breaker (`opossum` or similar)
3. Suppress this fingerprint for the rest if it's downstream noise (see `suppress-noise`)
Setting it up
The recipe needs status codes on your error or analytics events. Two patterns work:
Pattern A — log 429s as errors with the status code:
const res = await fetch(stripeWebhookUrl, { ... });
if (res.status === 429) {
await fetch(`https://api.agentry.sh/v1/logs/${PROJECT_ID}/`, {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.AGENTRY_DSN}`,
"Content-Type": "application/json",
"User-Agent": "myapp/1.0", // REQUIRED — Cloudflare 403s default UAs
},
body: JSON.stringify({
message: "Upstream 429",
level: "warning",
user: { id: currentAccount.id },
tags: {
status_code: 429,
endpoint: "POST /api/stripe/webhook",
upstream: "stripe",
},
}),
});
}
Pattern B — track every outbound call as an analytics event (heavier but richer):
await fetch(`https://api.agentry.sh/v1/analytics/${PROJECT_ID}/`, {
method: "POST",
headers: { /* same */ "User-Agent": "myapp/1.0" },
body: JSON.stringify({
event: "upstream_call",
distinct_id: currentAccount.id,
properties: {
endpoint: "POST /api/stripe/webhook",
upstream: "stripe",
status_code: res.status,
latency_ms: latency,
},
}),
});
Either way, the keys to surface in properties are: status_code, endpoint, upstream. The agent can slice on those.
Variations
- “Show 429s by upstream service over the last 90 days. Anything trending up monthly?”
- “Which of our customers are the top 429-generators against OpenAI this week?”
- “Are we hitting our OWN rate limits (per-API-key) for customer X? They opened a support ticket.”
- “Daily cron at 9am: alert me if any upstream’s 429 rate doubled overnight.” (uses a Routine)