Incident Management 2020 – What’s Changed?

59 VIEWS

· ·

Cloud has given way to cloud-native computing over the past decade. The sweeping changes at every layer of the application stack have had a bearing on incident management. In this post we compare traditional incident management with modern incident management, and highlight what’s changed. We also look at ways DevOps teams can evolve with these changes.

Incident management follows the broader trends in the DevOps world. Fundamental changes such as microservices have a bearing on how incident management is implemented.

Accept the Distributed and Sometimes Fragmented Environment

Modern incident management needs to take into account the distributed nature of infrastructure, applications, and teams.  Microservices running on container instances that span multiple public cloud platforms is the new normal. Teams are independent, each having their own favorite programming language, open source/devops tools and processes.

Deployments are declarative, inching closer to the GitOps model that treats environments as code. There is automation at every step to enable changes to be triggered in real-time. The minutia is abstracted, giving developers the ease of running apps without worrying about the infrastructure; and Ops teams the ability to run infrastructure without configuring memory, networking, and storage. With all this power, when things do go wrong, they can go badly wrong.

What this means is that the attack surface and likelihood of failures increases exponentially. Trying to keep up with all this complexity and change, incident management becomes fragmented. It’s not possible to manually keep up with every incident. It takes an incident management solution that can span the breadth of multi cloud, all microservices, and all teams.

Embrace An Open Toolbox 

The arsenal of tools in the cloud have changed drastically over the past decade. Organizations are not willing to cede control to the biggest cloud vendors like AWS and VMware, and instead, want the freedom to move their workloads to whichever cloud suits their needs best. Kubernetes is the unanimous ‘operating system’ of the cloud. It is augmented by a suite of open source cloud-native tooling such as Istio, Jaeger, and Helm.

Monitoring has similarly gone from being done with a single tool to integrating multiple open source tools together. The ELK stack, FluentD, and Prometheus are prominent solutions. There are also cloud-based monitoring vendor tools for log analysis, SIEM, and APM. Just like the infrastructure they monitor, these tools cannot be standalone, and need to be integrated with each other via APIs.

This monitoring toolchain is incomplete without an incident management solution. An incident management solution completes the monitoring loop, and helps make the leap from insight to action.

Align with the Modern Culture

Culture is central to the DevOps methodology. The best architecture and tooling will fail without a change in the way teams are structured – the common language they speak, the expectations laid on them, and the spirit of collaboration. This is especially true in today’s scenario of remote DevOps teams. Communication can easily fall through the cracks.

The role of the on-call engineer is vital to the practice of incident management. In most cases, this isn’t a person hired for this specific role, but rather, the entire DevOps team taking turns to do on-call duty. For this setup to work well it requires careful, and up-to-date documentation and training in the form of runbooks. These are clear step-by-step instructions directing the on-call engineer what is to be done in the case of an emergency. Communication should be clear and simple so that the on-call engineer isn’t caught unprepared.

Once an incident is resolved, it is important to avoid the blame game, and instead look to glean learnings so that the same incident isn’t repeated. Even if a team member causes a big failure, not penalizing or shaming them sends a strong message that failures are part of progress, and they need to be managed well. This does wonders to transform the spirit of a DevOps team from apprehension and hesitation to one of confidence, experimentation, and innovation.

Welcome Proactive Incident Response

Despite the challenges that modern cloud-native computing presents, incident management is only becoming faster and more responsive. With the right setup, incidents can be detected within seconds and resolved in minutes.

Slack and its ChatOps approach to incident resolution is a key enabler of incident management at this speed. It enables DevOps teams to collaborate in a single shared space and ensure everyone has the most updated view of the incident in progress.

StackPulse supercharges this collaboration by enriching and correlating alerts before sending them to Slack, ensuring that DevOps teams have the complete picture of an incident.  StackPulse also provides a powerful playbook engine teams can leverage to remediate incidents, or perform additional maintenance if needed – all from Slack.


Twain began his career at Google, where, among other things, he was involved in technical support for the AdWords team. Today, as a technology journalist he helps IT magazines, and startups change the way teams build and ship applications. Twain is a regular contributor at Fixate IO.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *

%d bloggers like this: