Diagnose a flaky CI test from its error payloads

Why this matters

Flaky tests are the worst kind of bug because they masquerade as not-bugs. You re-run, it passes, you merge, you forget. Two weeks later it’s failing on production for a real reason and you can’t tell because everyone learned to ignore that test.

The fix is to treat each CI failure as a data point and look for patterns across runs. If 4 failures all happen in the same useEffect ordering issue, it’s a race condition. If they happen at random points across the test body, it’s flake (network, timing, GC pauses). You can usually tell within a minute — if you have the failure data structured.

Most CI runners throw failure logs in a tar.gz no one reads. Agentry pulls them into a searchable event store so the agent can group them.

What you get

All recent failures of one test (or test pattern) grouped together
A verdict: real bug (with root cause) vs flake (with suggested mitigation)
An option to mark known-flaky tests so they don’t page on each future failure

Walk through it

You

The Cypress test checkout flow > submits valid card failed 4 times this week in CI. Pull the failure payloads, find what’s common, and tell me if it’s a real race or just timing.

Agent

Let me find those failures.

agentry_list_cases {
  project_id: "default",
  environment: "ci",
  query: "checkout flow > submits valid card",
  status: "open",
  since: "7d"
}

{
  "cases": [
    { "id": "c_ci0", "fingerprint": "f_ci0",
      "message": "AssertionError: expected #checkout-success to be visible",
      "event_count": 4,
      "last_seen_at": 1747320000,
      "first_seen_at": 1746748800 }
  ]
}

Agent

Same fingerprint all 4 times — that’s a clean signal, all failures hit the same assertion. Let me pull the individual events to compare timing.

agentry_get_case { case_id: "c_ci0", include_recent_events: 4 }

{
  "events": [
    { "received_at": 1746748800,
      "stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
      "extra": { "ci_run_id": "abc1", "duration_ms": 4012, "retry": 0 } },
    { "received_at": 1747001200,
      "stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
      "extra": { "ci_run_id": "abc2", "duration_ms": 4008, "retry": 0 } },
    { "received_at": 1747186400,
      "stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
      "extra": { "ci_run_id": "abc3", "duration_ms": 4015, "retry": 0 } },
    { "received_at": 1747320000,
      "stack": "AssertionError at cy.get(#checkout-success) timeout 4000ms",
      "extra": { "ci_run_id": "abc4", "duration_ms": 4011, "retry": 0 } }
  ]
}

Agent

Verdict: likely timing, not a bug.

Evidence:

All 4 failures hit the same assertion (#checkout-success not visible)
All 4 hit the 4000ms timeout exactly (4008–4015ms — that’s the test’s defaultCommandTimeout)
Stripe’s test-mode latency is typically 200-3000ms but spikes occasionally to 5-8s

The success page renders after Stripe returns. Your timeout is just a bit too tight for the long tail of Stripe responses.

Two fixes, depending on intent:

Increase timeout (5 min effort, no risk): cy.get('#checkout-success', { timeout: 10000 }).should('be.visible')
Stub Stripe in CI (1h effort, removes flake entirely + makes test 5x faster): Use cy.intercept('POST', '**/charges', { fixture: 'stripe-success.json' })

Want me to draft the PR for option 1?

The output

🧪 Flaky test analysis: "checkout flow > submits valid card"

4 failures in 7 days · Same fingerprint each time · Same assertion

VERDICT: Timing (Stripe latency spike), not a code bug
  Evidence: all failures hit defaultCommandTimeout (4000ms) exactly
            Stripe test-mode p99 latency ≈ 6s

FIXES:
  Quick:  cy.get('#checkout-success', { timeout: 10000 })…
  Right:  cy.intercept('POST', '**/charges', { fixture: ... })

NEXT: ask your agent to open a PR ("increase checkout test timeout")
   OR ask your agent to open a PR ("stub Stripe in cypress checkout tests")
   OR agentry_record_suppression  (mark as known-flaky, stop paging on it)

Setting it up

CI runners need to post test failures to Agentry. Three lines of YAML for GitHub Actions:

- name: Report test failures to Agentry
  if: failure()
  run: |
    cat test-results.json | jq -c '.failures[] | {
      message: .title,
      stack: .error.stack,
      environment: "ci",
      tags: { ci_run_id: env.GITHUB_RUN_ID, test_file: .file }
    }' | while read -r payload; do
      curl -X POST https://api.agentry.sh/v1/logs/$AGENTRY_PROJECT_ID/ \
        -H "Authorization: Bearer $AGENTRY_DSN" \
        -H "Content-Type: application/json" \
        -d "$payload"
    done

The environment: "ci" tag is what makes Agentry treat these as separate from production errors. Use any structured test output your runner produces (Jest, Vitest, Cypress, Playwright all support JSON output).

Variations

“Across all my CI failures this week, which test is the most flaky? Rank by failure count.”
“Compare CI failures on main vs feature/* branches — is something broken on main that we’re masking with re-runs?”
“Mark checkout flow > submits valid card as known-flaky for the next 7 days. If it’s still failing then, escalate.”
“Find tests that are slow but not flaky — taking 30s+ on average.”

Diagnose a flaky CI test from its error payloads

Why this matters

What you get

Walk through it

The output

Setting it up

Variations

Related recipes

Find the bugs 5 people hit but no one reported

Find what broke after your last deploy

Stop noisy errors from paging you — in plain English

Try this recipe in your own agent.