If you’ve managed reliability for either a microservices or a monolithic app, you know that – as we detailed in an earlier blog post – both types of environments come with their own reliability challenges.
What can you do about those challenges? Which best practices should SREs adopt in order to simplify reliability for both microservices and monoliths? Read on for guidance.
Reliability Engineering for Monoliths
One of the key challenges that SREs face when working with monoliths is that monoliths don’t expose as much data below the surface. They accept a request and spit something back. You can’t trace the request as it flows across different parts of the application in the way that you can with microservices.
This means that SREs who manage monoliths need to rely heavily on monitoring tools that track what is happening on the “surface” of the monolith: how many requests it receives per second, how long it takes to respond to requests, what its CPU and memory utilization are, and so on.
You can only do so much with this data, but correlating it is one basic best practice. How does a spike in CPU usage relate to a slowdown in request rates, for example? Or, how do the metrics for different instances of the same monolith compare? How do metrics change when you deploy a new release?
Data like this won’t provide deep insight into what is happening in your monolith, but it’s the deepest level of visibility you can gain from the architecture you’re working with.
On that note, you could guess that the best best practice for reliability engineering for monoliths is to migrate monoliths to a cloud-native, microservices-based architecture. Not only are microservices more resilient in many ways than monoliths, but they are also easier to monitor and manage because you can gain more meaningful data and context about application reliability by tracing requests across the services.
Reliability Engineering for Microservices
That doesn’t mean, of course, that managing microservices environments is easy or simple for SREs. Microservices architectures are more complex. There are more moving pieces, more alerts, and more things that can go wrong due to the complex dependencies within the service architecture.
Several best practices can help SREs manage these challenges effectively when supporting microservices environments.
Identify the Most Critical Microservices
Some microservices may be more important than others. A microservice that handles authentication for all users is probably more critical than one that takes a screenshot, for example, because users being unable to log in is worse from a business perspective than users being unable to take screenshots.
As an SRE, you should know which microservices are critical from a business perspective and, when faced with multiple problems with multiple services, prioritize the highest-value services.
Observe, Don’t Just Monitor
Managing microservices applications requires collecting more than surface-level metrics. You need to be able to observe how requests flow across services within the application, so you need to leverage observability tools, not just monitoring tools.
Observe Services, Not Instances
In a microservices environment, it’s common to deploy multiple instances for each service. Because the failure of one instance is not a big deal as long as other instances remain healthy, you should focus on monitoring overall service health, not each instance.
Otherwise, you are setting yourself up for alert fatigue and false-positives.
Work with Developers, Not Against Them
As an SRE, your priority is reliability. Developers have a different priority; they want to release application updates quickly and continuously. The best SRE is one who lets them do that without introducing regressions and security problems. Therefore, you should work closely with developers to ensure they can update and deploy new microservices rapidly without compromising reliability.
Know Your Team
Your approach to reliability depends in part on the size and skillset of your team. You typically can’t solve every problem with every microservice. Instead, you need to know which services to prioritize, and who to assign when something goes wrong. Likewise, you need to know when to bring in developers, and when SREs can handle the issue on their own.
Best practices for reliability engineering vary depending on whether you’re dealing with monolithic or microservices applications. While in some ways microservices are easier to observe and manage, the ability to correlate multiple data points, triage issues, and connect with all stakeholders in the application delivery and deployment pipeline is critical for operating any type of application.