You’ve heard of playbooks. But what about playbooks-as-code? How can playbooks be managed as code, and what does that mean for SREs and incident response teams?
The short answer is that by automating the processes that are defined in conventional playbooks, playbooks-as-code take incident response and reliability engineering to the next level.
For the longer answer, keep reading. This article offers an overview of playbooks-as-code, including how they work and the benefits they offer. (And for an even longer answer, take a look at our eBook, “Reducing the Organizational Costs of Incident Response via Playbooks-as-Code,” which dives deeper into the concepts outlined in this blog post.)
What is an Incident Response Playbook?
In incident response, a playbook is a set of rules or procedures that define how a team should operate when reacting to a given type of incident or event.
The purpose of playbooks is to provide consistency in the incident response process. Playbooks help ensure that all of the engineers on your team will take the same approach to resolving a given type of problem.
In addition, playbooks help remove some of the guesswork and manual troubleshooting that would otherwise occur during the incident response process. They bring greater foresight and planning to incident response by laying out the most likely root cause of various types of events, and explaining exactly how to solve it. In this way, they enable faster and more reliable resolutions.
The Problems with Traditional Playbooks
Conventional playbooks come in the form of documents, or (at best) a set of procedures that is integrated into an incident response platform. In other words, they’re basically just lists.
Having this type of playbook on hand is better than nothing when you’re troubleshooting a problem. But traditional playbooks are subject to several major drawbacks:
Manual operations: There is no way to trigger or advance a traditional playbook automatically. Engineers have to open it and follow the steps one-by-one.
Lack of collaboration: Most playbooks are designed to help IT engineers or SREs solve specific problems. They are not designed to bring in other stakeholders (like developers) who may need to collaborate on a given issue. If that collaboration becomes necessary, it must be coordinated manually.
Difficulty of interpretation: Playbooks are designed to be easy to follow, but they often aren’t, in practice. They can be hard to interpret, and they sometimes assume background knowledge or domain expertise that on-call engineers may lack when following a playbook.
Playbooks vs. Playbooks-As-Code
Playbooks-as-code solve these issues by using software to define and enforce the processes within incident response workflows.
In other words, instead of just using words to describe what engineers should do, playbooks-as-code leverage code and an enforcement engine to automate the process as much as possible. They can trigger software tools to perform certain actions on their own, instead of waiting for engineers to do it. They can also collect data and monitor the status of the incident response process to ensure that the playbook is being followed properly.
The benefits of automating playbooks through code are numerous:
- Faster Mean-Time-To-Resolve (MTTR): Less time guessing how to proceed and manually managing processes translates to faster incident resolution.
- Improved software quality: Playbooks-as-code automatically collect data from across software environments and make that data available to developers as well as IT engineers, which helps teams optimize software quality based on real-world reliability feedback.
- Higher return on investment: The automation and speed provided by playbooks-as-code mean businesses can do more with less. Incident response stops being reactive and time-consuming and instead grows more and more efficient over time.
- Automated data collection and visibility: By speeding and automating incident response, playbooks-as-code help teams fix issues faster. Teams waste fewer expensive man-hours, and customers get more reliability from digital services.
- Collective ownership for incident response: Perhaps most important of all, playbooks-as-code enable all stakeholders to collaborate on incident response. Incident response stops being the job of IT engineers alone; developers and SREs can participate more readily, especially because the tool they know best – code – becomes the basis for operations.
How to Build Playbooks-As-Code
The idea of automating incident response via playbooks-as-code may sound intimidating. Which coding framework do you use? What does a playbook-as-code file actually look like?
To answer these questions, browse our repository of playbooks-as-code and read about using GitHub Actions to apply playbooks via StackPulse.
Or, download our eBook, “Reducing the Organizational Costs of Incident Response via Playbooks-as-Code,” which provides links to specific playbooks-as-code for automating workflows like setting up incident war rooms or troubleshooting Kubernetes pod issues.
Ordinary playbooks can provide some level of consistency to incident response operations, but they fall short of delivering the fast and seamless experience that teams need to minimize MTTR and deliver the highest ROI in incident response resources for the business. By automating incident response, playbooks-as-code address these shortcomings, enabling teams to work faster, more collaboratively and with better results than ever.