Simplicity is dead – accept it. I know it feels wrong, but the only area you can simplify is the customer’s experience when they interact with your organization. Your life in technology will never be as simple as that first “hello world” Python program.
Using modern development and deployment best practices, like microservices on cloud-native infrastructure, are exponentially more complex than any previous iteration of application architecture we’ve had in the past. CI/CD is faster and less custom code is required than ever before to add new features to applications. But, all that development speed has manifested itself into what feels like glue and duct tape holding things together.
Chaos engineering is based on the acceptance of this complexity as the new baseline. Chaos engineering is figuring out how to make the overall system as fault-tolerant and performant as possible without needing to know every in and out of every application component. There are just too many moving parts managed by too many disparate teams.
What is chaos engineering exactly?
Having a general feel for what chaos engineering is will lead to a couple of questions like “what exactly is chaos engineering in practical terms?” and “what specific roles should be encouraged to practice chaos-fu?”
The first company to really embrace chaos engineering publicly was Netflix with its infamous Chaos Monkey project, which eventually became part of the overall Simian Army. The Simian Army has an active user-base and helps companies that specialize in helping other organizations embrace the chaos (like Gremlin). Netflix even published the original Principles of Chaos documentation.
The single driving principle that needs to be embraced and executed for chaos engineering is the following:
“Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”
That’s right. Experimenting in any and all environments, including production. This isn’t the dark side of the force, this is the light side. Remember, only the Sith deal in absolutes, and chaos engineering is embracing the complete opposite of absolutes by accepting that an infinite number of things can and will go wrong.
There is an old expression, “May you live in interesting times,” that very much applies to chaos engineering.
Who needs to embrace chaos engineering?
Everyone who knows anything about the system being experimented on needs to embrace chaos engineering, there are no exceptions. Everyone won’t be engaged all the time, as the site reliability group (SRE team) will likely be running the experiments. But, everyone needs to be aware that incidents may occur and will likely be highly prioritized.
Practical steps for chaos engineering
Step zero is the understanding that chaos engineering is not a 24/7 activity. Chaos engineering is best done during peak business hours when lots of infrastructure and development staff are online and available for incident response and management. The idea being that problems are found when the right people are around to fix the incidents – ideally not causing an outage at 10:52 PM on Sunday night when millions of people are streaming a live sports event and it’s a tie game with two minutes left.
Before introducing any chaos into the target infrastructure, the first actual step is to define a steady state for the infrastructure that will be experimented on. This will include instrumenting components with application performance management (APM) tools like SignalFx, infrastructure monitoring tools like Prometheus and Grafana, and log consolidation and analysis tools like Splunk Enterprise.
Now, when introducing chaos, whether it’s manually done or handled by a tool like Chaos Monkey, the idea is to cause actual real-world failures at a reasonable rate – looking for single points of failure and performance bottlenecks. Going through previous incidents and looking for trends and types of incidents to cause is a solid place to start.
The typical activities start with taking random servers offline, killing application instances or introducing failures and lag at the network layer. The idea here is that, while none of these will individually cause an outage, various combinations will cause unknown results. The more random the combination of intended outages, the more it becomes true chaos engineering and proves it is a valuable approach.
All activities should be performed in an experimental group, with a control group left as a comparison point. Over time, the control group can get smaller and smaller. But, there always needs to be a steady-state to compare against in cases like the Chaos Gorilla (which simulates a datacenter going offline) that makes the experimental group freak out and start producing errors that haven’t been seen before.
Chaos engineering and SRE drive proactive service resilience
Controlled experimentation has a solid history of providing some of the best advances society has ever made. Yes, describing it as chaos engineering is a bit hyperbolic. But, it effectively expresses the idea that results are unknown. At the end of the day, the more successful chaos engineering is in your organization, the more confidence everyone will have in the holistic system – leading to fewer unplanned outages and faster incident response.
Now take the basic and advanced Principles of Chaos to heart and go forth to break things when the sun is up so you can sleep in peace when the moon is out.