The last few years have seen significant shifts in how engineering organizations address operations problems. Fifteen or so years ago, the function of IT Operations was generally the sole purview of an operations team that managed fleets of servers housed in data centers. That team was responsible for all aspects of the production environment – from server installation, configuration, networking and software deployment, to incident response when those servers went down.
With the more recent adoption of Site Reliability Engineering (SRE), the paradigm has shifted to treating operations as a software problem. Functions that were once performed manually are being reimagined as code and automated. Now, development teams are better able to share ownership of the production environment with operations and infrastructure teams, and are reimagining reliability as a developer-driven domain.
The Incident Response Lifecycle
One piece of the production operations puzzle is incident response. Simply stated, incident response is the action taken to detect, triage, analyze and remediate incidents. There are also follow-up steps for resuming normal operations, along with a review of the incident to identify future improvements.
The detection of incidents is ideally achieved through automated solutions (like monitoring software that notifies a human via an alerting system) or by people observing unusual circumstances and manually triggering an alert.
Once someone has been notified, the cycle proceeds to the triage phase — where the severity of the incident is determined. The incident is then prioritized accordingly.
An open incident is analyzed for root cause, contained or mitigated, and then remediated. Assuming that the business then returns to normal, a follow-up (such as a postmortem) is conducted to further analyze the incident and find ways to prevent or mitigate similar incidents in the future.
Traditional Incident Response
Traditional incident response software lacked the ability to triage, analyze and automatically remediate incidents. It was also missing contextual awareness of potential problems and their nature, as well as automated workflows that assist responders in diagnosing and mitigating incidents. Moreover, it wasn’t possible to readily implement additional automation as code with the older generation of software.
In the past, members of the operations team would be paged and they would then respond to outages. However, the engineers who responded might have lacked insight into the software that was running on the servers, while the developers might have lacked insight into how the software ran in the production environment.
Indeed, developers might not even have been aware that there was a problem with their software. This often led to an arduous troubleshooting process, in which an operations engineer would locate the developer responsible for the software and relay information about the issue, and the developer would then formulate a solution to mitigate the issue with the help of the operations engineer.
Modern Incident Response
Incident response software has come a long way in the last couple of decades. Production environments aren’t as simple as they used to be, so incident response software has evolved to address the problems that come with more complex and dynamic environments. Automation is key — and this is where development teams come into play.
Reliability platforms seek to solve this issue by empowering software engineers to take control of the incident response lifecycle. Developers can now integrate all of the tools used in the incident response lifecycle and create a cohesive workflow, from incident detection to remediation.
Organizations can prepare predefined playbooks that automatically enact specific workflows when incidents occur. These playbooks are defined as code, so anyone can see how a given incident is addressed and make updates via revision control systems such as Git. If the playbook wasn’t adequate for a given incident, it can be reviewed as a part of the postmortem process and updated accordingly.
Developer-Driven Incident Response
With these changes to the ways in which IT operations are performed – and who performs them – it’s a good time to reimagine incident response as the domain of the software developer.
Centralized, coordinated management of the incident response lifecycle will probably always be necessary — especially when it comes to handling large-scale incidents, creating incident response policies and procedures, and training teams to effectively respond to incidents.
On the other hand, it may not always be necessary to maintain a centralized incident response team that responds to every incident that occurs in an organization. Instead, incident response could be decentralized and automated, so that individual software development teams are notified and can respond when something goes wrong.
Developers understand how their software works at a code level. They understand how it interacts with other services and software running on the company platform. Since they also tend to take a software-driven approach, they’re more likely to automate parts of the lifecycle that were once manual.
Having developers respond to incidents also encourages them to take more ownership over their software when it runs in the production environment. When they encounter recurring incidents, it will inspire them to either fix their code, or (when that’s not possible) to improve the incident response lifecycle via automation.
The Role of Operations
Standardizing and automating also helps operations teams. Development teams may discover that an incident was due to external factors, or a problem with the underlying infrastructure that was out of their control. At that point, they can loop in operations or any other stakeholders, provide full context for the issue, and explain which steps have already been taken to diagnose and mitigate it.
Operations teams can create standardized playbooks-as-code that take the infrastructure and platform into account, and then share them with all teams in an organization. With these playbooks, alerts that are specific to the platform or infrastructure components can be automatically routed to operations teams, and development teams can be notified if their software is impacted.
In conclusion, by involving developers early in the incident response lifecycle, using the core practices of SRE, and taking advantage of recent improvements in incident response software, organizations can minimize alert fatigue. In addition, this can greatly reduce mean time to discovery (MTTD) and mean time to recovery (MTTR), thereby improving business operations and the overall happiness of the end user.