“Why” is too vague a question. Read my post on medium for the true-in-life half of this post.
John Allspaw and the SNAFU catchers are studying incident retrospectives. People ask, “Why did this fail?” and you can learn a lot from that. In this talk my favorite bit is: ask “What made this not be worse?” Focus on the relative successes so that you can sustain those controls/tools/people in the future.
The useful question in a system “why” is “How does it stay true?”
We can look at a system in that perspective, we can ask, “it’s working. how does it stay working?” and the answer is, people. The people above the line are monitoring and adjusting as the external environment fluctuates and changes.
We can look at each part of the system and learn something.
“The database is up. how does it stay up?” and one answer is, “there’s a controller that starts new replicas when needed, and shuts down ones that aren’t responding.” In another case, a controller is watching and switches to failover when necessary, and another piece keeps the failover database in sync with the live one. Software below the line automates controls that people used to spend energy on.
Ask this question about various parts of your system, and you’ll learn about it. which will be useful in your next incident.
P.S.
Mike Nygard made a post along these lines a few days before I did. When the world is ready for an idea, it doesn’t come to just one person 😉