A reader asked: what’s more important — preventing failures or finding the cause?
Hey —
A reader emailed me this week:
“If you could only implement one thing first in a distributed system — circuit breakers or root cause analysis — which would you choose?”
My answer surprised him:
Circuit breakers.
Because finding the root cause of an outage is useless if the outage is still spreading.
Most engineers think about reliability as a debugging problem.
It isn’t.
It’s a blast-radius problem.
When PostgreSQL slows down, Kafka producers start backing up.
When producers back up, queues grow.
When queues grow, consumer latency spikes.
When latency spikes, retries increase.
Before long, one unhealthy dependency has become a system-wide incident.
Circuit breakers exist to stop that chain reaction.
They don’t fix the downstream service.
They protect everything upstream from getting dragged down with it.
But once you’ve contained the failure, you still have a second problem:
What actually broke?
That’s where root cause analysis becomes valuable.
The API gateway shows errors.
The consumer service shows timeouts.
Kafka lag starts climbing.
None of those are necessarily the cause.
They’re often just the symptoms.
The hardest production incidents aren’t the ones where something fails.
They’re the ones where five different systems look broken and only one of them actually is.
Three things worth your time:
• Day 61: Circuit Breakers for Handling Component Failures
Learn how Netflix-style circuit breakers, fallback strategies, bulkheads, and retry composition prevent a single failing dependency from taking down your entire system.
• Day 166: Intelligent Root Cause Analysis (RCA)
Build an automated RCA engine that correlates logs, traverses dependency graphs, analyzes event timing, and identifies likely root causes in seconds instead of hours.
• The real production lesson:
Reliability is a two-step process.
First contain the failure.
Then find the cause.
Most teams focus heavily on the second part and not enough on the first.
Question:
What’s the longest incident you’ve seen where engineers spent hours investigating symptoms before discovering the real root cause?
Reply with your story — I’m collecting real production incidents for future lessons.
P.S. The free Distributed Systems Interview Pack covers resilience patterns, observability, fault tolerance, root cause analysis, and the production reliability concepts that show up repeatedly in senior-level system design interviews.
— Sumedh
