RCAs = Stable products & happy teams
RCA or Post-mortem meetings for production issues are super powerful!
A friend called to ask what I do for reducing outages and production downtime for my teams. I advised him to do deeper RCAs for issues. This got me thinking about RCAs and how follow-up actions from thorough RCAs contribute immensely to the stability of systems.
I first 'properly' learned about RCAs in 2005 after a 'memorable' outage. Here is my take on RCAs today- If a startup is at any decent scale and doesn't do RCAs or post-mortem discussions for production issues then they are doing a disservice to their customers.
Most of us learn about the famous Five Whys method early on. I highly recommend the Five Whys method. But startups need to do more to ensure the issues don't appear again.
- 🔍 Detection - In the rush of building feature after feature; many startups miss basic monitoring and alerting. Tying detection to stability makes it a first-class citizen for most tech teams.
- 💥 Impact - Knowing the immediate and final impact of an outage lets teams assess the priority of an issue. Many seemingly hard decisions become easy when the impact is understood. I encourage teams to document impact so that they build empathy for customers and this has often led to engineers thinking about product analytics like never before.
- 🛠️ Time to mitigate/fix - When I first ask people to document this, they often think about linking this to team/individual's performance. It's true that all teams have a superstar old-timer or a bright newcomer who is really fast at debugging but as leaders, we should be thinking about enabling the larger team to be as effective as the superstars. Playbooks, feature flags, logs are some proven methods to mitigate or fix the issues quickly.
- ⌛ Timeline - It takes a lot of discipline to document a timeline. But a running timeline can be important both as a live document for debugging and a historical document for improving the process.
- 5️⃣ 5 whys and action items - We are taking all the above steps so that we can take actions to ensure the issues don't happen again. Doing a 5 whys analysis helps us find many action items and an understanding of proportional impact (and hence effort) from each action item.
Production issues are the biggest productivity killers for teams and in many cases, they are directly responsible for the work-life imbalance. Hence thorough RCAs can impact your productivity as well as team morale.