Leveraging Incident Response for Application Quality


· ·

Incident response tools are most often used for production applications. But, their benefits can extend far beyond that, well through the entire application development and delivery lifecycle – especially when it comes to application quality.

Generally, we think of incidents as issues that show up during application delivery or in production environments. This outlook doesn’t do justice to the tooling and processes set up around incidents, it neglects to discuss what breaks during software development and what happens after incidents have come and gone. As it relates to application quality, there are several critical points where incident response processes and tools will be super beneficial.

Automated testing

Quality engineering teams (QE) spend a lot of time creating testing strategies and writing test automation with common test scripting languages such as Selenium or Appium to perform functional tests on the application. These tools are powerful ways to test applications at the speed of computers and collect real-time feedback.

QEs aren’t often on-call. But, it can damage uptime and your engineer’s general quality of life when they return to the office the day after initiating an automated test and find out the test case itself crashed before it was able to complete. This means there needs to be a re-run, and re-runs often mean delayed releases.

QEs can be on-call, and shouldn’t be afraid to run full test automation on a Friday, or continuously for that matter. The power of creating alerts from failed test runs can help QEs quickly respond to issues and fix or re-run tests. Test failures could happen due to infrastructure issues, as all application failures can. You can build a platform to alert on test tools such as Selenium and Java, or a broken test suite/case. All of these can be triggers for alerts, and when these alerts are triggered, context can be given as to what broke. This can also be done with automated testing services like Sauce Labs and BrowserStack. QEs can now be notified about test issues, acknowledge these incidents and resolve them, quickly getting tests up and running again – reducing mean time to acknowledge and resolve (MTTA/MTTR) into mere minutes.

While the negative impact of a test run failure isn’t directly tied to end-users, it does impact them. By addressing test run failures faster, the results of the tests are more timely. And, by having timely test results, it ensures the deployment process is smooth and ensures issues aren’t overlooked for the sake of time. The latter problem is a common case in modern applications and directly impacts application quality, meaning those who are on-call for production issues get called more.

Vulnerability scanning

In automated continuous integration and delivery (CI/CD) there are often (or there should be) vulnerability scanning processes. DevOps and IT practitioners need to check the code/artifacts for known vulnerabilities with tools like Sonatype and JFrog before they go out the door. Incident response tools can (and should) be set up to alert the team of critical vulnerabilities. It’s likely these incidents won’t be set to a P1 severity as they’re not yet in production, but as far as setting urgency and priority when the team gets back in the office, these issues should be at the top of the list. Releasing code with a known critical vulnerability is a recipe for disaster that can easily be avoided with simple visibility and proactive alerting.

Continuous testing – extra credit

For extra credit, for both functional testing and vulnerability scanning, there’s a big movement to embrace continuous testing, where functional test scripts and vulnerability scanning are tested on a regular basis in production instances of the application. Continuous testing is highly useful because modern applications have services with varied application lifecycles. So, it’s hard to get a picture of the quality of the entire application without regular observation.

Also, CI/CD environments often don’t have direct parity with the production environment infrastructure. The differences in configuration (i.e. configuration drift) can cause issues that wouldn’t show up in testing. So, testing in production can help surface those hard-to-find problems. In continuous testing of vulnerabilities, critical vulnerabilities can be identified after production versions have been running. In this case, the alert would be critical, and those responsible, such as the DevSec or SecOps teams should respond immediately when new vulnerabilities show up.

Learning from what broke

Finally, the last way application quality can benefit from incident response tooling and processes is by learning from incidents that have come and gone. Previous incidents can be indicators as to how quality engineering teams can adapt their quality strategy and determine what they test for. Good incident response tools will give powerful historical data on the impact of an incident, the path to resolution and a full audit trail. So, when the vulnerability happened at the application layer, QE teams can incorporate the outcome into future tests or their general approach to testing.

Powerful incident response and full test coverage not only means getting alerts from all your monitoring tools, but it also means leveraging incident response throughout your software delivery chain, especially as it relates to application quality.

is a bad-coder-turned-technology-advocate who understands the challenges and needs of modern engineers, as well as how technology fits into the business goals of companies in a demanding high-tech world. Chris speaks and engages with end-users regularly in the areas of modern AppDev, Site Reliability Engineering, DevOps, and Developer Relations. He was one of the original founders of the developer marketing agency Fixate IO, and currently works as a Sr. Manager in HubSpot’s Developer Relations team.


Leave a Comment

Your email address will not be published. Required fields are marked *

Skip to toolbar