Evolving APM to Keep Pace with Cloud-Native Challenges


· · ·

At this point, Application Performance Management, or APM, is a very well-established discipline. It has been around since the early 2000s, when solutions like Wily first appeared.

The fundamentals of APM haven’t changed much since then. Yet the types of workloads that APM tools must manage have evolved significantly.

Most recently, the IT industry has undergone a turn toward cloud-native computing, which has introduced a number of new challenges that APM solutions must address in order to remain relevant.

Here’s a look at how to make sure your APM tools and strategy remain viable in the cloud-native age.

What APM Does

When I say that APM hasn’t changed much in two decades, I mean that the fundamentals of how APM functions have mostly stayed the same. APM tools work by collecting data, looking for anomalies and sending alerts when something seems off.

To be sure, the types and volume of data that APM tools work with have changed tremendously over the years. For instance, Nagios evolved from a very simple tool that simply sent pings to servers and waited for a response to see if they were down. Today’s APM tools have much more sophisticated ways of determining when something is wrong than simply pinging hosts; they check a variety of other data points.

Still, by-and-large, the fundamental approach to APM remains the same. You collect data, analyze it and react when something seems wrong.

APM for a Cloud-Native World

What has changed tremendously over the past few years, however, is the nature of applications and the environments that host them. Thanks especially to the rise of so-called cloud-native computing – defined by traits like loosely coupled systems, service-based applications and modular architectures – the typical application today looks very different than it did just a decade ago.

Let’s take a look at some of the most significant of these changes, and their impact on APM.

More data

For one, application environments now produce more data than ever. You have not just one log per application and server, but in many cases multiple logs: one for each microservice in your application, one or more for your orchestration engine, one for each server in your cluster, and so on. And that’s only log data!

For APM tools, this means not just that the sheer volume of data that must be collected and analyzed has increased, but also that the data types and format are more diverse. Modern APM tools can’t count on data to be neatly formatted and centrally aggregated for them. They must instead be able to collect and parse information, no matter where it is born or how it is structured.

Ephemeral data

Adding to this challenge is the fact that data in cloud-native environments tends to be ephemeral, meaning it is not stored persistently by default. The data inside containers is lost permanently when the containers stop running, unless it is stored somewhere else first. Data within individual server environments may be destroyed, too, if you treat them as “cattle,” starting and stopping them as scalability requirements dictate, with little regard for what is hosted on them.

APM tools, therefore, need to be able to ingest data in real time, from whichever location it lives in. If they wait to collect data every hour or even every ten minutes, important information might be lost forever. In addition, the tools must be able to deliver historical insight to help engineers research past events; even if the data associated with those events no longer exists on the production systems.

Complex dependencies

Unlike traditional environments in which you had a handful of monolithic applications running on a single server, cloud-native environments typically involve dozens of moving pieces that relate to each other in complex ways. Each application could be composed of multiple microservices that are constantly exchanging data with each other. At the same time, orchestration tools automatically move workloads around within a cluster, meaning that there is no consistent relationship between the application and the underlying host hardware: It’s always changing.

To work effectively in these environments, APM tools must be able to map complex, highly-dynamic relationships. A service-level problem such as a slow-to-respond application could be caused by any number of underlying issues: a hardware failure, an API error, a problem with the way two or more microservices interact with each other, or something else altogether. To determine the root cause of the issue quickly, APM tools have to be able to understand the complexities of the relationships between all these components.

Multi-layered systems

Along similar lines, cloud-native environments often consist of multi-layered hardware and software stacks. Again, gone are the days when you had a single application running on a single server. Today, you typically have a cluster of underlying bare-metal servers. On top of those could be virtual machines, which in turn host a series of containers; each of which runs one microservice within your application. The container orchestration tool represents another layer in the stack; so do the software-defined networking and storage systems that connect all these moving parts together.

Here again, APM tools need to be able to understand this complexity and recognize how the various components fit together, in order to identify problems. It’s far from sufficient merely to know whether the physical servers are healthy, or whether an application deployment has succeeded. You need to monitor each and all layers continuously.

Rapid change

A final challenge of cloud-native APM is the rapid pace of change. We live in the era of continuous delivery; some organizations deploy application updates hundreds of times per day. At the same time, the automatic load-balancing and provisioning performed by orchestrators means that workloads are constantly moving around or being adjusted across host infrastructure.

For APM tools, this rapid and constant change means that you can’t check in once per hour and have confidence that things are OK. Equally important, it means that there is no such thing as a “normal” baseline against which you can check for anomalies. The fact that the number of containers you have running, or the mapping of containers to nodes, looked one way yesterday but another way today, may or may not signal an actual problem. Cloud-native APM tools must be able to discern the difference between “natural” change and true anomalies.

How To Succeed with Cloud-Native APM

You don’t need to change the fundamentals of your APM strategy in order to thrive in a cloud-native world. The essentials remain the same: You collect data and look for anomalies.

You do, however, need to take advantage of next-generation APM functionality that enables you to collect and analyze data in much more sophisticated ways. The ability to work with any type of data from any location is one requirement for cloud-native APM. Another is the ability to use AIOps techniques to interpret the vast volumes of rapidly-changing data – a task that human engineers just can’t feasibly manage at scale, given the deep complexities of modern environments.

Indeed, AIOps spells the difference between APM tools that can keep pace with cloud-native monitoring requirements and those that can’t. Even AIOps is still founded on the core pillars of APM – collect data and analyze it – but it introduces to the process a level of sophistication, automation and scalability that APM tools traditionally lacked, but which is absolutely critical for cloud-native monitoring.


To learn about how Broadcom’s solution can help you monitor cloud-native applications, visit www.broadcom.com/info/aiops/kubernetes-monitoring


Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO.


Leave a Comment

Your email address will not be published. Required fields are marked *

Skip to toolbar