Why this matters
Flaky tests are the worst kind of bug because they masquerade as not-bugs. You re-run, it passes, you merge, you forget. Two weeks later it’s failing on production for a real reason and you can’t tell because everyone learned to ignore that test.
The fix is to treat each CI failure as a data point and look for patterns across runs. If 4 failures all happen in the same useEffect ordering issue, it’s a race condition. If they happen at random points across the test body, it’s flake (network, timing, GC pauses). You can usually tell within a minute — if you have the failure data structured.
Most CI runners throw failure logs in a tar.gz no one reads. Agentry pulls them into a searchable event store so the agent can group them.
What you get
- All recent failures of one test (or test pattern) grouped together
- A verdict: real bug (with root cause) vs flake (with suggested mitigation)
- An option to mark known-flaky tests so they don’t page on each future failure
Walk through it
The Cypress test checkout flow > submits valid card failed 4 times this week in CI. Pull the failure payloads, find what’s common, and tell me if it’s a real race or just timing.
Let me find those failures.
agentry_list_cases {
project_id: "default",
environment: "ci",
query: "checkout flow > submits valid card",
status: "open",
since: "7d"
}
{
"cases": [
{ "id": "c_ci0", "fingerprint": "f_ci0",
"message": "AssertionError: expected #checkout-success to be visible",
"event_count": 4,
"last_seen_at": 1747320000,
"first_seen_at": 1746748800 }
]
}
Same fingerprint all 4 times — that’s a clean signal, all failures hit the same assertion. Let me pull the individual events to compare timing.
agentry_get_case { case_id: "c_ci0", include_recent_events: 4 }
{
"events": [
{ "received_at": 1746748800,
"stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
"extra": { "ci_run_id": "abc1", "duration_ms": 4012, "retry": 0 } },
{ "received_at": 1747001200,
"stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
"extra": { "ci_run_id": "abc2", "duration_ms": 4008, "retry": 0 } },
{ "received_at": 1747186400,
"stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
"extra": { "ci_run_id": "abc3", "duration_ms": 4015, "retry": 0 } },
{ "received_at": 1747320000,
"stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
"extra": { "ci_run_id": "abc4", "duration_ms": 4011, "retry": 0 } }
]
}
Verdict: likely timing, not a bug.
Evidence:
- All 4 failures hit the same assertion (
#checkout-successnot visible) - All 4 hit the 4000ms timeout exactly (4008–4015ms — that’s the test’s
defaultCommandTimeout) - Stripe’s test-mode latency is typically 200-3000ms but spikes occasionally to 5-8s
The success page renders after Stripe returns. Your timeout is just a bit too tight for the long tail of Stripe responses.
Two fixes, depending on intent:
- Increase timeout (5 min effort, no risk):
cy.get('#checkout-success', { timeout: 10000 }).should('be.visible') - Stub Stripe in CI (1h effort, removes flake entirely + makes test 5x faster):
Use
cy.intercept('POST', '**/charges', { fixture: 'stripe-success.json' })
Want me to draft the PR for option 1?
The output
🧪 Flaky test analysis: "checkout flow > submits valid card"
4 failures in 7 days · Same fingerprint each time · Same assertion
VERDICT: Timing (Stripe latency spike), not a code bug
Evidence: all failures hit defaultCommandTimeout (4000ms) exactly
Stripe test-mode p99 latency ≈ 6s
FIXES:
Quick: cy.get('#checkout-success', { timeout: 10000 })…
Right: cy.intercept('POST', '**/charges', { fixture: ... })
NEXT: ask your agent to open a PR ("increase checkout test timeout")
OR ask your agent to open a PR ("stub Stripe in cypress checkout tests")
OR agentry_record_suppression (mark as known-flaky, stop paging on it)
Setting it up
CI runners need to post test failures to Agentry. Three lines of YAML for GitHub Actions:
- name: Report test failures to Agentry
if: failure()
run: |
cat test-results.json | jq -c '.failures[] | {
message: .title,
stack: .error.stack,
environment: "ci",
tags: { ci_run_id: env.GITHUB_RUN_ID, test_file: .file }
}' | while read -r payload; do
curl -X POST https://api.agentry.sh/v1/logs/$AGENTRY_PROJECT_ID/ \
-H "Authorization: Bearer $AGENTRY_DSN" \
-H "Content-Type: application/json" \
-d "$payload"
done
The environment: "ci" tag is what makes Agentry treat these as separate from production errors. Use any structured test output your runner produces (Jest, Vitest, Cypress, Playwright all support JSON output).
Variations
- “Across all my CI failures this week, which test is the most flaky? Rank by failure count.”
- “Compare CI failures on
mainvsfeature/*branches — is something broken on main that we’re masking with re-runs?” - “Mark
checkout flow > submits valid cardas known-flaky for the next 7 days. If it’s still failing then, escalate.” - “Find tests that are slow but not flaky — taking 30s+ on average.”