Why production incidents are helpful even if you hate them
date
Nov 16, 2024
outer_link
slug
prod-incidents-help
status
Published
tags
SWE
summary
Prod incidents, everyone hates them! But it can make you a better engineer, if you use it as a learning opportunity 🤖
type
Post
Everyone hates prod incidents
If you work in a team that has a decent number of users for your product or service, chances are that you would have at least heard about a production incident happening somewhere in your company. These dreaded incidents can have huge financial and operational implications to the company if severe enough.
Due to the high stakes involved when a critical service is not functioning as intended, the pressure can be quite high during an incident. The person or team on-call will be in a high tense situation trying to at least mitigate the issue so that users are not impacted. If severe enough, every passing minute could be losing the company thousands.
How can this be desirable?
Any incident is not desirable especially not a Sev 1. However, having some experience resolving a Sev 2 or lower could be very beneficial though when something more severe happens. If you have never been in an incident, trying to resolve it, it can be quite stressful.
I have been pulled into several incident resolution calls myself. I must say that those are when I have learned the most. From quick log analysis to monitoring dashboards. Peering at performance metrics, latencies, CPU & memory usage, you name it. These are things that you usually do not touch if you solely only do development especially at junior levels.
Understanding the production engineer's perspective of these services has helped me gain greater appreciation for writing efficient software and good test cases. I usually develop better insights to cover edge cases that I might have previously missed.
With scale, every bug that is within the codebase will come up. You cannot hide from it. Having good test cases that covers all edge cases will protect you from shipping bugs and prevent incidents.
Postmortems
Many tech companies have something called the “blameless postmortems”. This is usually a write up explaining the incident, how it happened and why it happened a while highlighting potential improvements that can be done and key learnings from the incident. It is meant to be blameless, so instead of focusing on who caused the incident, we only care about the learnings so that we can avoid such situations again.
I have written a few of these myself. But I have learned the most from reading many other PMs even outside my area of work. It helps me understand the bigger picture of how all our systems work together and provides me with useful information
Conclusion
So, the next time you come across an incident, hopefully below a Sev 2, use it as an opportunity to learn. Learning how others have resolved the problem, or how they identified the root cause can help you immensely when trying to resolve your own incidents.