Lead your team through an incident

When everything goes wrong, your leadership can make all the difference. Learn how to turn incidents into opportunities for growth.

During an incident

Incidents are unpleasant teachers. Once your infra is down, your only way is to see the other end of the tunnel. You can ensure that this process is bearable for everyone involved with:

Keep calm. Incidents are high-pressure situations, and your role as a leader is to prevent panic. Take a deep breath and bring calmness into your conversation. Use a calm tone and rhythm.
Communicate impact - It is critical to Communicate impact to the customer to avoid escalations. Ensure that your impact is well understood by the customer and non-tech stakeholders. Avoid using tech jargon. For example, Instead of saying, “Analytics system is lagging by 100K Kafka messages,” say, “Analytics system is lagging by 2 hours; there’s no data loss.”
Establish an update frequency - Stakeholders need visibility into the incident mitigation process. Your first communication should be prompt and establish the frequency of updates, e.g., “We will update you in the next 30 minutes.” Adhere to the committed time.
No place for assumptions in the investigation - It’s easy to miss obvious things in high-pressure situations. Calmly question every assumption. For example: “It can not be Kafka consumer” is an assumption; you should question it until you can say, “We have verified with logs and metrics that Kafka consumer is operating normally”.

Post incident mitigation activities

After you have mitigated the incident, you should do the following on the morning after

Root cause analysis (RCA): After your team has mitigated the incident, do a root cause analysis after a few hours. I recommend using the 5 Whys technique for the RCA.
No Blame games - You must not blame an individual or team in the root cause analysis. Focus on improving testing, monitoring and overall architecture as a team.
Plan the fixes - The last thing you need is to face the same incident a second time. Often, you use a short-term fix to mitigate the incident. Depending on the severity of the incident and the nature of the short-term fix, schedule a long-term fix. Schedule fixes that RCA has revealed gaps in monitoring, logs, or tests.