Five Bugs Hiding as One: How We Unravelled a "Dashboard Won't Load" Incident
2025-02-16T00:00:00.000Z
I came back from vacation to a broken dashboard and a familiar message: “Refreshing data…” forever. No one flagged it earlier. The feature had shipped while I was out—no review, no supervision, just a string of Cursor-driven commits.
What looked like a single issue turned out to be five independent bugs colliding at the worst possible layer: coordination between services.
This is the story of how it unraveled—and what it exposes about “vibe coding” in production systems.
The Setup
The system is straightforward on paper:
FastAPI backend on GKE (3 replicas) React frontend using polling (202 → poll → result) Redis for coordination Intercom export API (slow, async, rate-limited)
Because exports take minutes and Cloudflare times out at 100s, everything runs on an accept-then-poll model:
POST /api/overview → 202 { job_id } GET /api/jobs/{job_id} → pending → complete
That model only works if one assumption holds:
A job marked “pending” must actually be running somewhere.
That assumption was broken in five different ways.
The Symptom Every page stuck on “Refreshing data…” Jobs accepted (202), but never completed Polling responses flipping between: pending job not found
That flip should be impossible in a healthy system.
It means different pods are telling different stories.
The Investigation
I ignored the UI and hit the API directly. Tracked a single job_id over time.
The key signal:
Same job ID Different answers Within seconds
That only happens when:
state is not shared correctly or pods are dying mid-work
Both were true.
The Five Bugs
- New pods were never starting (Binary Authorization)
Images were deployed using :latest.
Cluster required immutable digests.
Result:
New pods rejected Old pods dying (OOM) No replacements
System slowly lost capacity while still “looking alive”.
- Job state lived in memory (per pod)
Errors were stored in a local dict.
If a job failed on Pod A:
Pod B had no idea Returned 404
Frontend retried → new job → same failure loop
Classic distributed system mistake: state wasn’t shared.
- Startup logic was DDoS-ing our own dependency
Every pod boot triggered a heavy Intercom export.
3 pods + cron + user traffic = rate limit exceeded.
Result:
All real jobs failed System stuck in retry loops
The code was correct once. It just never evolved.
- Redis failover never recovered
One transient Redis outage → pod switched to file cache.
It never switched back.
Now:
Some pods used Redis One didn’t
Coordination layer split in half.
- Pods were killed mid-job
Kubernetes shutdown killed in-flight async tasks.
No cleanup. No state reconciliation.
Result:
Jobs marked “pending” But no process actually running them
Zombie jobs everywhere.
What Actually Broke
Not FastAPI. Not React. Not Kubernetes.
The contract between them.
The system depended on a simple invariant:
If Redis says a job is pending, exactly one pod must be working on it.
That invariant failed repeatedly.
Everything else was just noise.
Why This Happened
Because most of the code wasn’t designed—it was generated.
Cursor suggested. Engineer accepted. No one asked:
What happens across replicas? Where does state live? What happens on restart? What happens under rate limits?
That’s vibe coding:
Local correctness, global blindness.
It works until coordination matters. Then it fails spectacularly.
What Fixed It Enforced digest-based deployments (pods actually start) Moved job/error state to Redis (shared truth) Removed harmful startup jobs Added Redis recovery path Drained tasks on shutdown Added real observability (job lifecycle logs)
Not fancy. Just correct.
What You Get From This
If you’re building anything distributed, this pattern will hit you.
This incident gives you shortcuts:
If a status flips between states → think multi-node inconsistency If jobs never complete → check lifecycle, not logic If retries make things worse → suspect rate limits If one pod behaves differently → suspect hidden local state
Most importantly:
Debug invariants, not symptoms.
Lessons (for people who don’t want this happening to them) “It works” is meaningless without considering replicas Single-node thinking doesn’t scale. All coordination state must be shared Memory is not a cache. It’s a liability. Startup code is production code If it runs on every pod, it must be safe at scale. Failover must include recovery Otherwise you create permanent degradation. Async jobs need lifecycle ownership Start, track, recover, terminate—explicitly. Rate limits are part of your architecture Not an edge case. Generated code is not designed code Tools don’t think in systems. You have to. The Real Takeaway
This wasn’t one bug. It was what happens when a system has no single source of truth.
And that’s the uncomfortable part:
None of these issues are hard individually. They only exist when no one is thinking end-to-end.
That’s the gap between writing code and building systems.