When service degradation and service outages go from bad to worse, it’s an awful feeling. It’s bad enough that something broke in general. But, P4 issues that turn into P1 issues can cause an unnecessary incident escalation, increase stress and impact the long-term impact of issues on the entire organization. Highly-tuned incident response slows severity progression and oftentimes prevents it altogether.
Two steps for preventing incident severity escalation
When you’re on-call, you must have a finger ready to press that acknowledge button and take ownership of an issue. Just like with human life, the time it takes to address an issue is critical to your app’s life. So, the amount of time it takes for you to hear the alert, assess your involvement and acknowledge the notification will directly impact how the issue’s severity will progress. We call this timeframe, mean time to acknowledge (MTTA). But, it’s not all about seconds and minutes, it’s also about the accuracy of alert routing, notification and context. Acknowledgment by on-call staff doesn’t necessarily mean the initially-notified person is the most equipped. Thus, acknowledgment time must include the time it takes to get the right people and the right resources into the firefight.
Getting the right people means more than always calling one person in your company who has lots of historical knowledge and can solve almost any problem. Effective incident acknowledgment is about bringing in the right people based on subject matter expertise. Getting the right resources involved means that, as part of the alert, there’s enough context and related information from the monitoring tool that, no matter who acknowledges the alert, they can take some type of action towards resolution. Or, at the very least, the team can quickly identify whether they’re reaching the right person after all.
Faster mean time to acknowledge (MTTA) means that issues have less time to cascade. Because, incidents can easily get out of hand, and often, they do. If the issue is related to a single service, all other services dependent on it will start to throw errors as well. And, a P1 outage often happens when one ancillary service brings down a primary one.
This doesn’t mean that, just because you’re fast on the acknowledge button, you’ll prevent a P1. These incidents and errors often happen in real-time, depending on the application or service. The other critical way to look at incident severity is by measuring the amount of time it takes to recover.
There are a lot of ways to look at the resolution of an application issue. But, I like to think of recovery and overall mean time to recover (MTTR) in terms of getting the application running as expected from the end-user’s perspective, not internally. And, I think of resolution as the quick, working solution to the problem. But, the resolution doesn’t necessarily mean the entire application or service is fully back to normal. The reason I strike a difference is that when something is fixed in one service, it takes time for the other service to catch up.
Let’s take a simple example. Let’s say all the power goes out for a region and the region goes down. The solution is easy, get power. But, once the power is back on, the entire region and all services in that region have to come back online before the app can be considered functional for the end-user, meaning the application has fully “recovered.” So, the resolution was as easy as restoring power. But, the effective recovery time would actually be measured at 30 minutes, plus or minus, more.
Thus, faster acknowledgment and faster remediation through highly effective user interfaces for on-call teams, alongside quick access to relevant context and resources, means recovery takes less time for most situations. A P1 issue will still happen from time to time, but the collective severity of that P1 incident is the amount of time it persists for the end-user. If the P1 started as a P3 service degradation that began to cascade, but you addressed that initial alert quickly, the dependency services will boot up faster or quickly clear queues, decreasing the delay between acknowledgment, resolution and recovery.
The primary job of your incident response tool is to automate the incident to human interaction, surface resources based on the context of the issue, and track key metrics from incident detection to resolution. The benefit of effective incident response is that minor issues have less chance of becoming critical. And, better collaboration means there’s less time spent twiddling your thumbs in anxious anticipation, eagerly awaiting a solution for getting everything back up and running.