This post is a collaboration between Mike Fisher and Jon Williams, Fractional CTO and Technology Consultant.
John Allspaw's 2012 blog post about blameless postmortems really blew the lid off traditional ways of thinking about incident management and even today remains one of the most popular posts on Etsy Engineering’s blog Code as Craft. Instead of the typical blame game, Etsy decided to take a more empathetic, constructive approach. They recognized that mistakes and system failures were more often the result of systemic issues rather than personal failures, which led to this concept of blameless postmortems. What this means is that when an incident occurs, they conduct an investigation that's more about learning than finger-pointing. The goal is to understand how the incident happened in order to prevent it in the future, rather than to punish the person or team responsible for it. It's all about fostering an environment where people feel safe to learn from their mistakes and grow.
The idea has been so influential that it's been adopted by other major tech companies like Atlassian, PagerDuty, and even Google. Atlassian has fully embraced this concept, embedding it into their incident management strategies. Their stance is that blameless postmortems are not only more effective in problem-solving but also essential for maintaining team morale. They believe it encourages more honest dialogue and deeper exploration of the root causes of an incident, leading to more robust and resilient systems in the future. PagerDuty has a similar approach for how to respond when failure occurs. They've built their entire incident response system around this concept, using it as a tool to increase system resiliency and performance.
Google, renowned for its Site Reliability Engineering (SRE) culture, is another tech giant that's adopted blameless postmortems. In fact, they've even written about it in their SRE book. Google uses postmortems to learn from failures and improve their systems, focusing on the contributing factors to an incident rather than blaming individuals. They reckon that a blameless culture is crucial to creating a safe environment where engineers feel comfortable discussing mistakes, allowing them to learn, adapt, and prevent similar issues in the future. It's fair to say that Etsy's influence in this area has been far-reaching and enduring, changing the way tech companies handle incidents and system failures. It is also important to note that while Etsy was an early advocate, the concept of blameless culture can be traced back to safety science and industries like aviation and healthcare that focused heavily on learning from accidents.
A decade on, the blameless postmortem and similar incident learning methodologies are widespread in the industry. The challenge that confronts us now is to pre-empt failures, to envisage potential failure scenarios ahead of time, and devise plans to eliminate or respond to them effectively. Recognizing failure as a possibility can be a daunting task. As co-author of this post, Jon Williams, a Fractional CTO and Technology Consultant, has observed many failures in his years in the industry. Despite his positive demeanor, Jon admits that he can be slow to recognize the signs of impending failure. He recollects a situation at iVillage, where six weeks away from a major project launch, his supervisor was replaced. Rather than investing time in establishing a rapport with his new boss, Jon focused on the project, which he later acknowledged as a huge blunder. Despite the successful project launch, Jon left iVillage six months later.
Jon was under the impression that as long as he delivered, failure was out of the question. His career had been punctuated with successful project deliveries, each one rewarded accordingly. When faced with conflicting priorities (new boss vs major project), Jon defaulted to his usual successful approach, not dwelling on the problem. But why didn't Jon tackle the problem head-on? Because it's unsettling, stress-inducing, and, like many of us, Jon steers clear of anxiety wherever possible. Jon had a plan but didn't solicit advice from peers to confirm he was on the right path because of confirmation bias. As technologists, we're inclined to find solutions because staying in the problem can be uncomfortable and our instinct is to resolve it swiftly. So, the question arises, how does one manage to stay in the problem?
One strategy to familiarize team members with failure is to conduct a Failure Workshop. Think of it as a tabletop exercise on failure in a safe environment. The workshop's objective is to "stay in the failure" while fostering a supportive space for peer interaction. This is similar to a pre-mortem but it keeps the participants thinking about possible failure scenarios instead of brainstorming solutions.
Discussing failure in a workshop will likely be new and foreign to your teams, so start with a pre-workshop exercise to get them thinking about failure. In this exercise, give each person the following scenario (credit to Igor Shindel who came up with this exercise for tech leaders):
It’s exactly one year in the future. Your biggest project has become a huge failure/disaster. It doesn't matter whether you think it is successful, or whether it did in fact meet or miss goals. Your stakeholders are not happy! As your future self, write an email to your current self describing:
What went wrong with your project/stakeholders?
Why (if known)?
Who could you have asked for help that you didn't?
Be as honest & straightforward as possible, don’t sugar coat it.
Post the pre-workshop exercise, arrange a series of moderated workshops where each person presents their pre-workshop exercise outcomes. Normally, these workshops require about 45 minutes of prep time to prepare three slides: Situation, Approach, and Possible Failure Scenarios. During the workshop, presenters are allotted 10 minutes to present these slides, followed by another 10 minutes for answering any clarifying questions. At the 20-minute mark, presenters go on mute and off video. The rest of the participants spend the next 30 minutes discussing potential failure scenarios without any input from the presenters. This stage is often referred to as being in a "Fishbowl".
The role of the moderator is to ensure participants don't delve into solutions. The objective is to "stay in the failure". The moderator encourages out-of-the-box thinking; imagining extreme failure scenarios is acceptable (i.e., we all get fired) given the failure has not yet occurred!
Having participated in numerous such workshops, the feedback from presenters is consistently intriguing. Being on mute and unable to respond allowed them to truly absorb diverse viewpoints. Without the ability to refute or sway the conversation, the presenters can't divert the discussion away from the tougher failure modes. Presenters also realized that it was "okay" to discuss failure, gaining insights about their projects they had not considered. As for the attendees, they invariably find the process enjoyable! Brainstorming potential failure states for someone else's project truly stimulates one's imagination.
This culture of embracing and learning from failure in a safe and supportive environment is an evolution of the blameless postmortem concept initiated by Etsy. It underscores the importance of anticipating and understanding potential failures rather than merely reacting to them. By fostering this mindset, organizations can facilitate the growth of a culture that not only deals effectively with failures when they occur, but also takes proactive measures to prevent them. This leads to more robust and resilient systems, and also promotes a work environment where individuals feel secure to take risks, innovate, and grow.
The power of a blameless postmortem culture combined with practices like Failure Workshops can create a powerful shift in the way organizations deal with incidents and failures. Embracing failure, dissecting it, and learning from it not only builds stronger systems but also fosters an environment of psychological safety, creativity, and continuous improvement. These practices have proven their worth in some of the world's leading tech companies, and continue to shape how we approach problem-solving and innovation in the tech industry today. Failure, when examined and understood, can become a valuable teacher even before it happens.
If you are interested in having Jon facilitate a Failure Workshop, reach out to him.