Why this matters
Pool exhaustion fans out. One slow query holds a connection for 30 seconds; the next 50 requests queue waiting; the queue times out; 50 unrelated endpoints throw “could not acquire connection from pool” simultaneously. Your error tracker shows 50 different fingerprints all in the same minute — none of which is the actual cause.
The actual cause is one slow query in one endpoint, often new and innocent-looking. Finding it is forensic work that takes engineers an hour because they have to:
- Notice the burst of pool errors
- Find the slow-query log (different system)
- Correlate timestamps (often clock-skew issues)
- Identify the offending query
Agentry collapses this. The agent groups all the pool/timeout/deadlock-shaped messages and looks for the earliest, slowest, and most-concentrated trigger query — the root cause, not the symptom.
What you get
- A grouped count of DB error patterns:
pool_exhausted,statement_timeout,deadlock,lock_wait_timeout - The query / route most likely causing the cascade (longest-held lock or slowest in the spike window)
- A timeline showing the cascade origin vs the downstream symptom errors
Walk through it
Are there any database-related error patterns trending up? Look for connection timeouts, pool exhaustion, deadlock retries.
Let me grep for DB error fingerprints in the last 24h.
agentry_list_cases {
project_id: "default",
status: "open",
since: "24h",
message_matches: "pool|timeout|deadlock|lock_wait|connection refused",
limit: 20
}
{
"cases": [
{ "id": "c_8e21", "message": "could not acquire connection from pool (timeout 5000ms)",
"event_count": 412, "users_affected": 88, "first_seen_at": 1747350400 },
{ "id": "c_1c92", "message": "statement_timeout: query exceeded 30000ms",
"event_count": 14, "users_affected": 4, "first_seen_at": 1747349200 },
{ "id": "c_4a01", "message": "deadlock detected: process X waits for ShareLock on Y",
"event_count": 8, "users_affected": 3, "first_seen_at": 1747350100 }
]
}
Three DB-shaped fingerprints in the last hour. Note the timing — statement_timeout (c_1c92) appeared 20 minutes BEFORE the pool exhaustion (c_8e21). That’s the cause. Let me get its query.
agentry_get_case { case_id: "c_1c92" }
{
"id": "c_1c92",
"error_type": "QueryError",
"message": "statement_timeout: query exceeded 30000ms",
"stack_top": "src/api/admin/reports.ts:88",
"tags": {
"query_id": "admin_monthly_report",
"route": "GET /api/admin/reports/monthly"
}
}
Confirmed cause. admin_monthly_report — a GET on /api/admin/reports/monthly — has been hitting the 30s statement timeout. Each call holds a pool connection for the full 30 seconds. Your pool is probably size 10. After 10 admins click “view monthly report” the pool is full, and every subsequent request from anywhere in the app fails with c_8e21.
The 412 pool-exhaustion errors hitting 88 users are all collateral damage — actual users trying to use the app while admin queries hog connections.
Two-part fix:
- Today: isolate admin queries to a separate read replica or smaller pool
- This week: optimize
admin_monthly_report— it’s probably scanning a year of events without proper indexing
Want me to pull the actual SQL by checking agentry_get_distinct_id_summary for one of the admin users?
The output
Database cascade analysis (24h)
ROOT CAUSE (earliest, query-level)
c_1c92 statement_timeout: query exceeded 30000ms
src/api/admin/reports.ts:88 (route: GET /api/admin/reports/monthly)
14 events / 4 admin users — started 1747349200 (1h ago)
CASCADE (downstream collateral)
c_8e21 could not acquire connection from pool (timeout 5000ms)
412 events / 88 USERS — started 20min after c_1c92
↗ All paths in the app — random endpoint sample:
/api/projects/list, /api/auth/me, /api/integrations/slack/status
c_4a01 deadlock detected: process X waits for ShareLock on Y
8 events / 3 users — concurrent with the cascade
Recommended actions:
[now] Add ?statement_timeout=5s to admin pool connection string
[this w] Move admin/reporting queries to a read replica (separate pool)
[next] Add btree index on events(account_id, timestamp DESC)
(will let admin_monthly_report finish in <1s)
Setting it up
This recipe gets sharper when DB errors carry query identifiers. Most ORMs let you tag a logical name on a query — propagate it into the error event:
// Wherever you catch DB errors (could be inside a per-route middleware)
try {
const result = await db.query.adminMonthlyReport(...);
} catch (err) {
await fetch(`https://api.agentry.sh/v1/logs/${PROJECT_ID}/`, {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.AGENTRY_DSN}`,
"Content-Type": "application/json",
"User-Agent": "myapp-api/1.0", // REQUIRED — Cloudflare 403s default UAs
},
body: JSON.stringify({
message: err.message,
stack: err.stack,
tags: {
query_id: "admin_monthly_report", // ← LOGICAL name, not full SQL
route: req.route.path,
db_pool: "primary",
},
}),
});
throw err;
}
If you’re on Postgres, also set application_name and statement_timeout per logical pool, and the pg driver’s error messages will carry them. The agent reads them out of the message verbatim.
For deeper visibility, add a periodic sample of pool stats as analytics events:
setInterval(() => {
fetch(`https://api.agentry.sh/v1/analytics/${PROJECT_ID}/`, {
method: "POST",
headers: { /* same */ "User-Agent": "myapp-api/1.0" },
body: JSON.stringify({
event: "db_pool_stats",
distinct_id: process.env.HOSTNAME,
properties: {
pool: "primary",
in_use: pool.totalCount - pool.idleCount,
waiting: pool.waitingCount,
max: 10,
},
}),
});
}, 30_000);
Variations
- “Show me the longest-running queries by route in the last 24h, regardless of whether they errored.”
- “Are deadlocks correlated with any specific deploy? List deploys followed within 1h by deadlock spikes.”
- “What’s the p99 wait-time for a connection in our primary pool? Trend it weekly.”
- “Alert if any pool exceeds 80% utilization for 5+ minutes.” (uses a Routine +
agentry_create_alert)