A risk-averse business or engineering culture can slow down delivery of products and cost business opportunities. Using risk management best practices, this post will explain ways to create a more risk-aware culture that seeks to measure risk using a data driven approach, and smooth out the technical operations of a business and speed up continuous delivery.
Risk-Averse or Risk-Aware?
Software development has changed a great deal over the last few decades. Many engineering organizations have had to adapt to meet changing business requirements and market landscape. In doing so, there is often an increase in risk as release cycles for software are shortened in order to meet these changing needs, which include continuous delivery.
This increase in risk – whether real or imagined – often leads to organizations avoiding risk rather than embracing it as a natural part of their technical operations and software development processes. Businesses that do this become risk-averse, and consequently lose a competitive edge as their release cycles are slowed and they lose out on potential opportunities.
On the other hand, businesses that embrace a culture of risk awareness seek to manage risk in a way that allows them to maintain their edge in a competitive environment. This culture of awareness can be implemented at all levels and in all units of the organization through the use of commonly used risk management processes. Coupled with modern operational and development methodologies, a practical and winning strategy can be created that will encourage innovation and keep a business running smoothly.
Risk management is the practice of identifying and evaluating risks and their potential impact to a business, and implementing controls to mitigate the impact of those risks.
What are some risk management best practices?
1. Embrace Risk
Acceptance of failure and embracing risk are key tenets of DevOps and Site Reliability Engineering (SRE). Acknowledging risk is the first step to creating a risk-aware culture. One of the key things that every C-suite, manager, and software or operations engineer needs to acknowledge is that there will always be some level of risk to any business activity.
Creating a culture of risk-awareness needs to start from the top and involve all of the stakeholders of the engineering organization. Roles and responsibilities must be clearly defined. A clear risk management plan must be documented through company policies, procedures, and guidelines. Everyone involved needs to feel like they are a part of the process and have some level of ownership over it.
One key piece of this is an Incident Response Plan that states how the organization responds to unforeseen risks, as well as how those risks are to be avoided in the future. Postmortems and retrospectives are key software development practices that should be leveraged as a part of a risk-aware culture.
But how do your stakeholders know what risks exist and how to address them?
2. Measure Risk
Another key tenet of DevOps and SRE is to measure everything. The next step in creating a risk-aware culture is to identify risks and evaluate their impact on business operations, including operations that ensure continuous delivery. One way of doing this is through the creation of a risk assessment.
Risk assessments are often required for many certification and compliance frameworks such as PCI DSS and ISO. While the creation of a full risk assessment that is compliant with these standards is beyond the scope of this article, we’ll cover it here at a high level in the context of technical operations and software development.
A simple risk assessment from this perspective would contain the following elements:
- First, identify potential sources of risk. How would one do this? From an engineering perspective, a robust Application Performance Monitoring (APM) system measuring Key Performance Metrics such as Application Latency, Error Rate, and Availability helps to pinpoint and quantify sources of potential risk. Modern monitoring stacks have come a long way over the last two decades, providing visibility into all parts of the application stack, including backend databases, application instances, and infrastructure components, all the way out to front-end clients.
- Second, determine the probability that a risk will occur. How is this accomplished? Using APM, one can determine occurrences and duration of incidents over a given time period. These would then be included as a part of the risk matrix.
- Third, determine the severity of a risk should it occur. A good way to measure this is in terms of lost revenue. Using the Availability metric as an example, we know that if the site goes down for an average of x minutes per month, then lost revenue would be x multiplied by the cost-per-minute.
- Finally, assign a risk rating based on overall impact to the business. In a simplified form, this would be labels of “Low,” “Medium,” and “High.” In the risk matrix, if there is a low likelihood of occurrence for a risk, and it is a low severity, the risk rating would be low. If there is a low likelihood of occurrence, but the impact is severe, it might be a medium level risk. Along those lines, if there is a high likelihood a risk will occur, and there is a high severity, then the risk rating would be high.
Now that you’ve identified all of your risks, the probability they will occur, their severity, and risk rating, you can approach these risks in quantifiable terms and formulate a plan to address them. In a risk-aware business culture, what would be the next step?
3. Communicate Risk
Reducing organizational silos is another tenet of DevOps and SRE. Knowledge sharing and shared ownership of risk, factors heavily into the success of a risk management plan.
Communication of risk is a critical piece of the risk management puzzle. As stated earlier, stakeholders must be involved at all levels of the company if you want to achieve a risk-aware culture. Knowledge is power, and one of the keys to achieving organizational knowledge is through communication.
A thorough review of all identified risks and steps to mitigate them is required on an ongoing basis. As mitigation efforts are underway, regular reviews of these efforts should be conducted by all stakeholders. As new risks are identified, they must be included in the risk matrix and communicated as a part of these reviews. Shining a light on risks and mitigations allows the business to make informed decisions about where to focus resources, the effort required, progress made, and the effectiveness of controls in a comprehensive risk management plan.
You’ve involved your stakeholders, and identified and communicated risks. What now?
4. Mitigate Risk
All of these efforts amount to nothing if mitigation isn’t included as part of a risk management plan. Once the risks have been reviewed and resources can be assigned to them, solutions will need to be presented and put into place to control risk. Technical skill and experience come into play during this phase, and this is where a robust SRE program can help.
Two other key tenets of DevOps and SRE are to implement gradual change, and to leverage tooling and automation. Through gradual change, the cost of failure (severity) becomes smaller, and risk is therefore reduced. Using automation, error-prone manual processes can be codified as code, which can be reviewed and updated as risks change. Responses to risks are more consistent if they are automated. Reducing or eliminating the chance of errors, and creating consistency help to address both the probability that a risk will occur, as well as the severity should it occur. Risk is therefore reduced in this way.
Managing uncertainty doesn’t need to be a scary thing. It can be analyzed and evaluated in an objective way that allows risk to be addressed using a data-driven approach. With any luck, some of the best practices outlined here will help you on your way to creating a risk aware culture.