#1: The Guru Paradox: My Learnings from a $20M Fire
Welcome to payments!
It all started with a Slack notification so casual it’s comical in retrospect:
‘hey man, some providers seeing failed payments. can u look 🙏’
I’d recently increased scope to take over the payments platform, and was still digging through onboarding docs. No one had hinted at any smoldering fires, so I wasn’t too worried. Looped in my engineering lead, checked our dashboard, and… saw nothing. Crickets. No noticeable spikes. Failure rates were flat. The dollar amount that failed over the past week wasn’t zero, but it wasn’t astronomical either (some tens of thousands in a system that processed tens of millions daily).
I opened a ticket, slotted it for standup, and resumed reading. Business as usual. I had no idea I was about to learn one of the most expensive lessons of my career.
By the afternoon, alarm bells were starting to sound.
We discovered that for months, we’d been silently accumulating a backlog of payments that were never sent. Not retried. Not monitored. They were just… sitting there. I scramble to pull together queries that accurately capture the backlog and stare at the result: a jaw-dropping $5M.
By evening, it got worse. I learned about business logic in our system that made my stomach drop: a single failed payment blocks all future payments for that user. Meaning cascading failure by design. Meaning cases like a $10 failure blocking over $2M from going out the door. Meaning failures weren’t just adding up; they were multiplying.
The day ends with $7M blocked, and a MUST-ATTEND 9am cross-functional sync scheduled by yours truly.
& so it begins…
My 8am alarm goes off. I groggily grab my phone, refresh the dashboard, and am instantly, brutally awake.
$14M. Completely Frozen.
This wasn’t an abstract metric. For our users, this was payroll. Rent. Funds they need to keep their businesses alive.
Polite emails turned into a flood of panicked ALL-CAPS demands. Slack avatars of C-suite execs started lighting up our DMs.
We were officially in full-blown crisis mode.
The Hydra
What made this fire particularly nasty was that it wasn't a single bug but rather a hydra-headed monster, with each head blocking millions:
- Some payments had invalid addresses that needed manual fixes.
- Some had failed once, were now fine, but weren’t being retried.
- Some had been sent to the wrong recipients, and never resent to the right ones (yes, some of the wrong recipients kept the money. Big yikes)
- Some were blocked for unknown reasons needing deep, bespoke debugging.
But the real killer, the monster’s heart, was that 70% of the volume was being intentionally held due to previously failed payments. We had a digital traffic jam of our own making, with no alternate routes, and the pile-up was increasing by the hour.
On the surface, this is a story about a multi-million dollar fire. But really, it’s about what happens when your product’s complexity scales faster than your team’s understanding of it.
This involved a lot of whiteboarding, even more caffeine, and the uncomfortable realization that the single most critical piece of infrastructure in our entire payments system wasn’t a server or database.
It was one person’s brain.
The Guru Paradox
Every complex product platform, especially in a scaling tech company, has a Guru. Payments always does.
They’re usually a senior engineer who’s been there forever. They wrote the original code, fought the early fires, and can outline the architecture in their sleep.
They’re brilliant. Revered. Indispensable.
And their existence is a dangerous liability for your product.
Our Guru, let’s call him S, was a walking encyclopedia. While the rest of us threw theories at a whiteboard, S navigated our payment maze like he built it. Because he did. S touched 80% of our critical path decisions. When he was heads down or in meetings, our velocity halved. Every missed edge case was one we hadn’t run by him. His code just…worked.
This is the Guru Paradox: your most effective person is also your biggest bottleneck.
Imagine if S had been on vacation. Imagine if he’d quit.
I was SO damn glad to have him in the room, but it was painfully clear that our over-reliance on him had allowed this crisis to fester, and was now slowing resolution.
And here’s the insidious part no one tells you about the Guru Paradox: it's not just a technical debt problem - it's also an innovation killer.
- Want to optimize routing to shave off minutes? "Better run it by S first."
- Want to ship a third party integration? "S needs to review."
- Want to test a faster flow? "Let's wait until S can assess the risk."
Teams start avoiding change out of fear of breaking the black box only one person can fix. They stop proposing improvements and start waiting for permission. Innovation dies in the queue of "check with the Guru first" & your competitive advantage slowly calcifies around one person's bandwidth and comfort zone.
And meanwhile, your Guru slowly burns out.
The Slog
There was no magic fix. No single line of code to save us. Just a brutal 2-week slog fueled by process, caffeine, and a shared sense of urgency.
We spun up a cross-functional war room with two critical, distinct roles:
- The Owner (Me): To deeply understand and prioritize the problems, coordinate the fix, and shield the team by managing all comms with leadership and partners.
- The Executor (S + team): To fix things, period. Shielded from all meetings and distractions, with me as their single point of contact.
We didn’t chase all the hydra’s heads at once. We isolated cascading root causes (such as the broken retry logic) and knocked down the biggest dominoes first.
At its peak, we had a staggering $20M blocked. Slowly, methodically, we unwound the mess. A bug fix here, a manual data fix there & money started flowing again.
Three days later, $14M. A week later, $5M. Within 10 days, we were clearing the last million.
Building Scar Tissue
To me, our most important work wasn’t getting the queue to ‘normal’ (zero issues in payments is a mirage!). It was ensuring we’d never get that bad again.
I organized a retro where we all agreed that S’s knowledge needed to be extracted and distributed widely.
Over the next few weeks, I partnered with S and an Ops lead, and documented not just happy paths, but every single failure mode. For every error, we
1) Defined the root cause,
2) Outlined an owner, and
3) Noted resolution SLAs and escalation pathways.
We invested in monitoring that would have alerted us when this was a $20K problem, not a $20M crisis. We updated runbooks so the on-call engineer would know exactly what to do, or who to wake up.
It was a brutal few weeks, but I’m pretty damn proud of how we turned a near-catastrophe into organizational scar tissue, tougher, smarter, and more resilient than the skin it replaced.
Your Playbook For Avoiding A Meltdown
The thing is, multi-million–dollar failures like this are rarely just technical. They come from people, process, and assumptions breaking down under scaled complexity.
This is what happens when you scale 10x but your operational rigor only scales 2x.
It could have easily been a $200M crisis if we'd grown 10x larger with the same fundamentals. Payments systems don't fail gracefully, they fail catastrophically.
Here’s the playbook I wish I had:
- Identify your Guru, and Clone Their Brain: Who's the person you pull into every single incident? That’s your bottleneck. Extract their knowledge into diagrams, docs, and into other people's heads via training sessions.
- Set Pain Thresholds, Not Crisis Thresholds: Build proactive alerts which fire on internal thresholds that are far stricter than your external SLAs.
- Write a Field Guide to Failure: For every error state, document: what it means in plain English, its owner, and fix steps. Ambiguity during a crisis is accelerant on the fire.
- Crisis Culture Matters: When a crisis hits, focus is everything. Assign clear roles. Owner coordinates, Executors fix.
- Bonus tips for the owner: Pointing fingers will only slow you down and erode trust. Save the analysis for the retro. In the moment, your only job as a leader is to create a calm, focused space for the executors. Shield your team from executive panic, keep them caffeinated & remember, calm execution is fast execution.
My trial by fire with the ‘Guru Paradox’ isn’t unique. It’s a story that repeats in every scaling org and it’s a brutal reminder that your systems are only as strong as your team’s shared understanding of them.
If your team is scaling, don’t wait for a $20M wake-up call to expose the cracks.
Start today:
- identify your bottlenecks
- Document your failure modes
- Build resilience into your teams, not just your systems.
And when things inevitably break? Skip the blame. Fix it, learn from it, and future-proof your product, together.
Next week we'll switch gears from fixing broken systems to building new ones!
A question I've been asked a bunch (and tbh want to personally get crystal clear on) is: “How the heck do I actually start accepting payments?”.
We'll break this down for a solo founder just starting out, an established business, and also look into how the big guys do it.
Subscribe to get it delivered straight to your inbox next Thursday!
Till then, may your APIs never timeout and your Ledgers always balance ✌️