Evolving an Incident Response Strategy as Teams and Services Grow


· ·

The typical path for a growing software development organization involves, by definition, growth. In this context, this likely means providing a greater number of applications and services for the customer to consume. An important aspect to consider as an organization grows in this manner are the processes for supporting these services.

How can an organization ensure their on-call tactics and incident response strategy are as effective as possible? Below, I’ll discuss the challenges associated with scaling an incident response strategy to support growing services and solutions to managing these scalability challenges.

The challenges in supporting more software

Supporting a greater number of applications and services results in several organizational challenges. Team structure changes, on-call teams have more to handle and applications and infrastructure become more complex. So, you’ll need to consider the following:

More applications and services, more problems

Simply put, the more software you put out there for people to use, the more issues you can expect to encounter. In other words, you can expect your incident and call volume to increase as the organization grows its services. The sheer volume of incidents being reported will result in the need for structural changes to on-call, release management and development processes, ensuring reasonable time for incident acknowledgment and incident resolution, keeping end-users happy.

Supporting code written by someone else

When a development organization is small, a dynamic can exist where developers take on-call responsibilities to deal with issues that arise from the code they’ve written. If a critical production bug is reported for a certain feature, it’s common practice to track down the developer (or developers) most familiar with that portion of the application and have them resolve this issue.

But, as a software organization grows in both products and personnel, this strategy becomes unsustainable. Developers incur an increased range of responsibilities that span multiple applications and projects, and these responsibilities no longer allow for the type of intimate familiarity with a specific system they may have once had. Due to this, an organization must adapt and deal with a new reality in which on-call personnel have to support code they didn’t write.

Automated alert routing, incident runbooks, and useful alert context in a collaborative environment can create scalable on-call incident management processes for larger teams.

Evolving your processes to support incident response

An effective on-call incident response strategy is one in which issues are acknowledged and resolved in a timely manner. As services grow, both the development and on-call processes must evolve in order for the development team to efficiently support the increased workload. Below are several steps an organization can take in an effort to provide this level of support:

Documentation is your best friend

We’re all well aware that documentation makes incident response easier. But, it’s also no secret that developers often fall short in this area. That being said, as an organization expands its applications and services, documentation becomes even more vital to efficient application support. Make sure that features are documented properly so when failures occur, on-call developers have somewhere to turn when beginning their evaluation of the issue at hand. In addition, make sure that all reported issues and resolutions are documented effectively to provide the rest of the team with a playbook for dealing with similar issues in the future.

Use Agile development techniques to your advantage

Newly-released code is more likely to fail than code released long ago. And, as discussed earlier, growing applications and services means that on-call developers will likely find themselves supporting features backed by code they didn’t write themselves. Keeping the entire development team up to date with what other team members are working on will be critical when dealing with this challenge. This means keeping the team informed of modifications that will be deployed.

Using Agile practices such as daily stand-ups (Scrum) and sprint reviews will help developers stay informed of modifications. Sprint reviews should involve demos of new features. These demos will provide context to developers that may be serving as the on-call point of contact when bugs are discovered with these new features.

Employ incident management tooling that allows for customized alerting

In addition to effective documentation and Agile development practices, on-call incident management software will help to efficiently manage higher incident volume by providing functionality for prompt notification of issues at hand. Our solution, VictorOps is one example of such software for incident remediation. Part of managing a high volume of incidents is to reduce what is known as MTTA (mean time to acknowledgment).

MTTA is the average time it takes for on-call personnel to be made aware of reported issues. VictorOps not only provides such alerting functionality but we do so in a manner that can be thoroughly customized based on easily identifiable aspects. These aspects include the time of day, the day of the week, the criticality of the issue being reported, etc. – whatever context needed for your organization.

Is an issue of such high importance that all necessary personnel need to be notified immediately regardless of the time of day? This can be configured. But, maybe a situation involves a problem discovered on the weekend that can wait until Monday. In this case, less invasive alerting can be set up to prevent a low-impact bug from rousing on-call developers out of their beds on a Saturday night.

Simply put, VictorOps enables your organization to empower on-call personnel to acknowledge incidents faster, when the situation calls for it, improving incident response while mitigating alert fatigue. This is the first step in reducing the time it takes to resolve issues strategically and efficiently, thus increasing the effectiveness of your incident management strategy.

Leverage monitoring to provide the team with the information they need, as they need it

As applications and services grow, it becomes even more critical to monitor them appropriately, collecting metrics that can be aggregated and leveraged to provide actionable insights. With so much going on as an organization continues to grow, this cannot be a manual process. Utilize application performance management and server monitoring software to allow the organization to effectively monitor applications and services for performance inconsistencies and other issues that may require prompt remediation.

For instance, VictorOps provides many useful integrations with such products and their own systems, allowing for prompt notifications being sent to on-call developers when problems arise.

Continually refine incident response processes as time goes on

Incident management processes, like any other process associated with software development, will need to evolve over time. Utilize data collected within the incident management strategy itself to help decrease the amount of time it takes to acknowledge and resolve issues within applications. When done properly, contextualized monitoring data can provide the context needed to draw actionable insights and improve processes.

Then, you can build in an adequate amount of time to analyze this data on a regular basis. In doing so, call volume can be reduced through the recognition of instances in which permanent fixes can be applied to commonly reported issues. Such insight improves application quality, lowers the workload for on-call staff (keeping them happier) and allows for development personnel to spend more time innovating.

Scaling incident response and expanding on-call operations as services grow

Strategies for incident response differ at every organization, but one thing remains certain: as organizations grow their applications and services, both the development and incident management strategies utilized within the organization must evolve. Efforts must be made to keep developers as informed as possible, as the changing state of their applications and processes must allow for incident information to travel quickly throughout the organization.
Properly documenting systems and processes, leveraging the benefits of Agile development practices, and taking advantage of the functionality provided by incident management software will allow any organization to maintain and improve issue response times as applications and services grow.

Scott Fitzpatrick has over 5 years of experience as a software developer. He has worked with many languages, including Java, ColdFusion, HTML/CSS, JavaScript and SQL. Scott is a regular contributor at Fixate IO.


Leave a Comment

Your email address will not be published. Required fields are marked *

Skip to toolbar