Mobile background-sync failure monitoring

Why this matters

Background failures are the worst kind of bug: invisible until the user notices the symptom. The sync didn’t run. The image prefetch failed. The token refresh silently expired. The user opens the app, sees stale data, and uninstalls. No crash, no error report — just a slow drip of churn.

These bugs cluster around two dimensions:

task_name: which job is failing (sync_messages, prefetch_images, refresh_token, upload_pending)
network_type: wifi-vs-cellular reveals timeout-tuning bugs and DNS-fallback misses

Grouping by both is the diagnostic. “All sync_messages on cellular” → cellular timeout is too aggressive. “All prefetch_images everywhere” → CDN regression.

Agentry runs the cross-tab in one prompt. The signal usually surfaces in 30 seconds.

What you get

A failure-rate matrix: task_name × network_type
Week-over-week delta so you can spot regressions vs chronic noise
The top open cases tied to the worst-performing task

Walk through it

You

Are background tasks failing more this week? Group failures by task_name and network_type.

Agent

First the matrix.

agentry_analytics_query {
  project_id: "default",
  query: "
    select
      properties.task_name as task,
      properties.network_type as net,
      countIf(event = 'background_task_started') as started,
      countIf(event = 'background_task_failed') as failed,
      countIf(event = 'background_task_failed') / countIf(event = 'background_task_started') as fail_rate
    from events
    where event in ('background_task_started', 'background_task_failed')
      and timestamp > now() - interval 7 day
    group by task, net
    having started > 100
    order by fail_rate desc
  "
}

{
  "rows": [
    { "task": "sync_messages",   "net": "cellular", "started":  84210, "failed": 24180, "fail_rate": 0.287 },
    { "task": "prefetch_images", "net": "wifi",     "started":  42100, "failed":  8420, "fail_rate": 0.200 },
    { "task": "prefetch_images", "net": "cellular", "started":  21430, "failed":  6210, "fail_rate": 0.290 },
    { "task": "refresh_token",   "net": "cellular", "started":  62800, "failed":  3820, "fail_rate": 0.061 },
    { "task": "sync_messages",   "net": "wifi",     "started": 168900, "failed":  4180, "fail_rate": 0.025 },
    { "task": "refresh_token",   "net": "wifi",     "started":  82400, "failed":   820, "fail_rate": 0.010 },
    { "task": "upload_pending",  "net": "wifi",     "started":  18400, "failed":   180, "fail_rate": 0.010 }
  ]
}

Agent

Now the week-over-week.

agentry_analytics_query {
  project_id: "default",
  query: "
    select
      properties.task_name as task,
      properties.network_type as net,
      countIf(event = 'background_task_failed' and timestamp > now() - interval 7 day) /
        nullIf(countIf(event = 'background_task_started' and timestamp > now() - interval 7 day), 0) as this_week,
      countIf(event = 'background_task_failed' and timestamp between now() - interval 14 day and now() - interval 7 day) /
        nullIf(countIf(event = 'background_task_started' and timestamp between now() - interval 14 day and now() - interval 7 day), 0) as last_week
    from events
    where event in ('background_task_started', 'background_task_failed')
      and timestamp > now() - interval 14 day
    group by task, net
    having this_week > 0.05 or last_week > 0.05
    order by this_week - last_week desc
  "
}

{
  "rows": [
    { "task": "sync_messages",   "net": "cellular", "this_week": 0.287, "last_week": 0.040 },
    { "task": "prefetch_images", "net": "wifi",     "this_week": 0.200, "last_week": 0.020 },
    { "task": "prefetch_images", "net": "cellular", "this_week": 0.290, "last_week": 0.022 },
    { "task": "refresh_token",   "net": "cellular", "this_week": 0.061, "last_week": 0.058 }
  ]
}

Agent

Two regressions, both this week:

prefetch_images (wifi AND cellular): jumped from 2% to 20-29%. Same on both networks → not a network issue → CDN regression. Let me check the open cases.
sync_messages (cellular only): jumped from 4% to 28.7%. Wifi is fine. Cellular-specific → likely a timeout tuning issue. Either you tightened a timeout in the last release or the upstream got slower.

agentry_list_cases {
  project_id: "default",
  status: "open",
  matching: "prefetch|image|sync_message|timeout",
  window_hours: 168
}

{
  "cases": [
    {
      "case_id": "c_bg_1",
      "fingerprint": "f_img_a",
      "error_type": "HttpError",
      "message": "GET https://cdn.example.com/images/* returned 503",
      "events_count": 14630,
      "users_affected": 8200,
      "first_seen_at": "2026-05-13T11:00:00Z"
    },
    {
      "case_id": "c_bg_2",
      "fingerprint": "f_sync_a",
      "error_type": "SocketTimeoutException",
      "message": "Failed to connect to sync.api.example.com:443 after 5000ms",
      "events_count": 24180,
      "users_affected": 11420,
      "first_seen_at": "2026-05-13T15:30:00Z"
    }
  ]
}

Agent

Both cases first appeared May 13 — same window the prod deploy went out for v4.2.0.

CDN 503s: 14,630 events across 8,200 users. The CDN itself or a config push on it. Talk to whoever owns the CDN config — check if a rate-limit/auth rule was added.
Cellular sync timeouts: 24,180 events across 11,420 users. The 5000ms timeout in sync_messages is too aggressive for cellular. The Android default is 30s for a reason. Either bump the timeout for cellular, or implement an exponential-retry strategy.

Both are fixable in a 4.2.1 patch. Want me to open a triage note in Agentry memory so it’s pinned to these cases?

The output

Background task failures (7d) — cross-tab task × network

                       WIFI fail %   CELLULAR fail %   WoW Δ (cellular)
sync_messages              2.5%        28.7% ★★★         +24.7 pp
prefetch_images           20.0% ★      29.0% ★★          +27.0 pp
refresh_token              1.0%         6.1%              +0.3 pp
upload_pending             1.0%         (no data)         —

Two regressions detected (both started May 13, deploy 4.2.0):

  1. prefetch_images: 2% → 20-29% (BOTH networks)
     Root: CDN 503s on https://cdn.example.com/images/*
     Case: c_bg_1 (14,630 events, 8,200 users)

  2. sync_messages: cellular 4% → 28.7% (wifi unaffected)
     Root: SocketTimeoutException at 5000ms — too aggressive for cellular
     Case: c_bg_2 (24,180 events, 11,420 users)

Recommended actions:
  - CDN config audit (rate limit / auth rule pushed recently?)
  - Bump cellular timeout to 30s OR add exponential retry
  - Ship 4.2.1 with both fixes

Setting it up

Two events per background task: started and failed. The network_type property is the diagnostic axis.

// Android WorkManager wrapper
class TrackedWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {
  override suspend fun doWork(): Result {
    postAnalytics("background_task_started", mapOf(
      "task_name"    to taskName,
      "network_type" to currentNetworkType(),
      "app_version"  to BuildConfig.VERSION_NAME
    ))
    return try {
      doActualWork()
      Result.success()
    } catch (err: Throwable) {
      // analytics: count the failure
      postAnalytics("background_task_failed", mapOf(
        "task_name"    to taskName,
        "network_type" to currentNetworkType(),
        "error_type"   to err.javaClass.simpleName,
        "app_version"  to BuildConfig.VERSION_NAME
      ))
      // logs: capture the stack
      postLog(mapOf(
        "message"      to err.message,
        "stack"        to err.stackTraceToString(),
        "error_type"   to err.javaClass.simpleName,
        "properties"   to mapOf("task_name" to taskName, "network_type" to currentNetworkType())
      ))
      Result.retry()
    }
  }
}

Where postLog is your /v1/logs/ helper:

async function postLog(payload: Record<string, unknown>) {
  await fetch(`https://api.agentry.sh/v1/logs/${PROJECT_ID}/`, {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${DSN}`,
      "Content-Type": "application/json",
      "User-Agent": "myapp-android/4.2.0",  // REQUIRED — Cloudflare 403s default UAs
    },
    body: JSON.stringify(payload),
  });
}

Variations

“Same cross-tab but per app_version — is the regression isolated to 4.2.0?”
“Median retry attempts before success per task — long tails indicate flaky infrastructure.”
“Per-region cellular failure rate — sometimes a single carrier’s gateway is the issue.”
“Set up a Routine that alerts #mobile-oncall when any task × network fail-rate exceeds 15% over 1h.”

Mobile background-sync failure monitoring

Why this matters

What you get

Walk through it

The output

Setting it up

Variations

Related recipes

Track crash-free user rate per app version

Detect a 429-spike before customers tweet about it

Find connection-pool / timeout patterns in errors

Try this recipe in your own agent.