Why an AIOps solution is key to help SREs connect data across development and operations
The role of SREs (site reliability engineers) has been changing drastically over the past decade. From being firefighters putting out fires, SREs are now looking to go right to the root cause of issues, and tackle them from step-one all the way to deployment. Further, modern approaches like AIOps are improving service levels in ways previously not thought possible. Let’s discuss the key trends that are impacting how code is deployed to production, and how SREs can use AIOps to improve the entire process.
Start with Code Creation
The software development life cycle has been shifting left. AIOps follows this trend and puts a tremendous focus on the initial parts of the process. Dev teams have become collaborative. As code is written by numerous developers all working in a distributed manner, their code contributions are managed using modern code repository solutions like Git.
Once written, code is committed from a local machine, and automated tests are run on the code. These build scans, or dry runs as they’re sometimes called, are preliminary checks for quality of code.
Following the shift-left movement, SREs now have good reason to encourage such early automated checks. The best time to spot a bug is right at development. Before, this wasn’t possible; but thanks to repository-based development, there is increased visibility and collaboration right from step-one.
Enforce Quality Control Alongside QA Teams
Once code passes the dry run, the build process is initiated by a CI server like Jenkins. This CI process also includes automated testing. Here, unit tests and integration tests are run to see how the code interacts with existing services. This is a crucial step, not just for QA but also for SREs.
While QA owns the creation of test scripts and execution of these test scripts, SRE is a key stakeholder to inform which scripts should be written, and what each test should check for. While QA tends to focus on code quality, SRE brings insight into the real-world reliability of the code.
This method of automating tests with scripts is natural to how AIOps works. It not only saves time by allowing to run far more tests in parallel but more importantly, it improves the quality of test results as human error is greatly minimized.
Automate Code Deployments
AIOps influences the entire software development pipeline. Once code is built and tested the AIOps way, it greatly improves an SRE’s confidence in the resulting code. While the code is ready to be deployed, the last mile of deployment is not complete without automation.
For deployments, AIOps encourages the use of runbooks and automated scripts that deploy code to varied targets. This again follows the trend of deployment automation we’ve seen in the industry at large. AWS CloudFormation uses a template-based approach to automate deployments, and more recently, Helm charts have been growing wildly popular for their ability to automate deployments in a Kubernetes environment.
At this stage, it is helpful to add metadata such as tags and labels to the new functionality and code snippets that are getting deployed. Not just with code, but also with all deployment-related events, event categorization is a great way to see the whole story of how a deployment was executed and its impact on the rest of the application. This helps to organize operations, and importantly, bring greater visibility into issues that crop up, post-deployment.
Keep Track of Change History
With all the automated processes and metadata available at each step thus far, AIOps gives SREs a lot to work with so they’re never left wondering what happened right after a deployment. Immediately after deployment is when most real-world performance issues are detected. With a wealth of information available in the form of logs, metrics, tags, and events, SREs can quickly perform root cause analysis. They can clearly trace the impact and origin of each issue, which services it affects, and what remedial action is required.
This ability to track change history over time and across various services is a central component of the AIOps strategy.
Monitor for Long-Term Reliability
Once a deployment is considered stable, the job is still not done. SREs now need to monitor the application and infrastructure to ensure service levels are adhered to. To enable this, AIOps puts focus on anomaly detection. This requires the use of intelligent monitoring tools that are powered by machine learning algorithms. These tools are able to identify patterns of normal behavior, and automatically pinpoint when something abnormal happens – whether that’s from within the system or from an external source.
As changes occur over time, new vulnerabilities may show up. An AIOps-capable monitoring tool should be able to proactively spot these vulnerabilities and alert the appropriate person. Going a step further, there are certain activities such as quarantining the infected or vulnerable part of the system that may be done proactively by an AIOps monitoring tool. Such autonomous remediation is the pinnacle of what is capable with AIOps. In the past, this was a dream for SREs; but with a modern AIOps platform, it is now a reality.
While the tools used across the software development pipeline may vary, the broader principles of AIOps should govern each step of the software delivery process. When deploying code to production, AIOps brings visibility, greater control, and better service levels every step of the way.
To learn more about Broadcom’s approach to the SRE model to read the white paper: Unlocking the Value of the SRE Model.