The role of APM and distributed tracing in observability

80 VIEWS

· ·

Application performance management (APM) and distributed tracing are practices that many teams have been using for years to help detect and mitigate performance issues within applications—while the first one was born in the era of big single-host monoliths, the latter is especially useful for distributed applications that use a microservices architecture, in which tracing is critical for pinpointing the source of performance issues.

Yet, even if APM and tracing provide value on their own, organizations today are doubling down on these practices by incorporating them into a broader strategy of observability.

This article defines APM and tracing, then explains what they have to do with observability.

What is APM?

Application performance management, sometimes also called “application performance monitoring,” is the practice of detecting and diagnosing performance issues within an application.

A typical APM workflow involves collecting data from applications (such as request rates and latency times), looking for anomalous patterns and generating alerts based on the anomalies. Engineers then drill into the data to determine the source of the performance issue (and to ascertain whether it is indeed a performance problem and not a mere fluctuation in performance trends).

In other words, the goal of APM is to detect performance problems, then diagnose their source.

What is tracing?

Tracing is the process of tracking transactions within an application as different parts of the application respond to them.

In complex applications, multiple parts of the application are typically involved in each transaction. A user might submit a request on the application frontend to start the transaction, for example, which in turn triggers a request to the database to retrieve data needed to handle the transaction. From there, the application processes the data and sends the result back to the user on the frontend.

By tracing transactions as they move between the various parts of an application, teams can gain granular visibility into application performance. Instead of simply knowing that application latency is high, for example, which is something that APM could tell you, tracing lets you pinpoint which part of the application—the frontend, the database, the business logic, or something else—is the weak link that is causing the latency problem. Most of the modern APM products use tracing data behind the scenes to connect the dots and provide causal relationships and dependencies, but they rarely offer the ability to inspect each transaction in detail. From that point of view, we can describe distributed tracing as a building block for APM, which is more a description of a use case.

The role of APM and distributed tracing in observability 2

APM tells you application latency is high. Tracing, shown in the lower image as part of observability, lets you pinpoint which part of the application is the weak link that is causing the latency problem.

APM and tracing in isolation

Conventionally, teams used APM and tracing in isolation. APM tools alerted them to performance issues. Then, when necessary, they looked into detailed tracing data, typically provided by the APM tool itself, but often also using dedicated tools to research performance problems as part of the diagnostic process.

Tracing isn’t the only way to gain context for APM. You can also look at logs or try to filter application metrics in such a way that you gain deeper insight into the root cause of a problem – a strategy that may or may not work well, depending on which type of application you are managing and how it is designed. It may be possible to look at metrics from individual containers within a microservices application to pinpoint the container that is causing a performance issue, for example. On the other hand, it may be impossible to determine, based on that approach alone, whether issues with individual containers are caused by the application code running inside those containers or an external problem like an orchestrator misconfiguration.

For this reason, tracing is typically the most effective way to gain context into APM issues in modern distributed applications, albeit not the only way.

APM, tracing and observability

Today, many teams are no longer settling for mere APM. They want observability.

The relationship between APM and observability is complicated and nuanced. In general, however, it can be summed up by saying that whereas APM focuses on well-known problem patterns and application architectures, observability aims to provide full visibility into the source of problems in any type of environment, by providing means to analyze telemetry of modern applications regardless of their architecture. It does that mainly by correlating a variety of data points with each other to more easily determine the root cause of a performance issue or, at a minimum, to point engineers toward the most likely root cause so that they can make a manual determination. By extension, observability enables the fastest possible response with the least amount of manual effort.

There is a longer definition of observability that has to do with control theory and using external outputs to assess the internal state of a system. But it’s not necessary to get into that level of detail to understand what APM has to do with observability.

As for tracing, while we already know it to be a data type for APM use cases, it has also been elected as one of the so-called “pillars of observability.” The other main data sources for observability are logs and metrics, although arguably, teams should also leverage data sources like CI/CD pipeline events to gain as much context as possible when building observability solutions.

Turning APM on its head

To put the relationship between APM, tracing, and observability another way, you could say that observability by leveraging core APM signal in the form of distributed tracing, offers a next-generation APM, without its traditional limits related to covering known unknowns only.

Whereas APM automation in the past ended with detecting and diagnosing problems at aggregate levels – good for known, typical setups, APM in the context of observability has become a means toward a larger end. The larger end is automatically diagnosing complex performance issues, which is something that APM tools with their lack of detailed and comprehensive data analysis cannot do.

Likewise, whereas tracing provides enough details to troubleshoot any complex application architecture, it can not be used in isolation, as it misses important aggregate view and infrastructure context, typically gained by analyzing metrics and logs. Tracing today is important, but it is one of the multiple required sources to gain observability into complex systems.

The rise of complex systems

On that note, we’ve mentioned complexity several times in this article. Let’s talk about that topic in a little more detail to explain why APM and tracing should not be used in isolation for most teams, who instead need observability.

In the past, application architectures and environments were relatively simple. Applications were monoliths or (at their most complex) used a service-oriented architecture that broke them up into a few big chunks. Each application was typically deployed on just one server, and automated orchestration was minimal if it existed at all.

Today’s applications, in contrast, are typically broken into a dozen or more microservices. The microservices run in containers that are automatically deployed across a large cluster of services using a tool like Kubernetes. Moreover, application teams have the freedom to design, pick language frameworks, network layers, etc.; meaning almost every single modern, distributed, cloud-native application looks and behaves differently.

In this type of environment, identifying the root cause of performance issues has become significantly more difficult. If a monolithic application slows down or crashes, there are two possible culprits: either the application itself failed, in which case you can debug it to figure out where things went wrong, or the server hosting it had a problem. (The network could also be the cause, but that’s typically easy enough to diagnose by tracking basic network performance data.)

With complex, distributed applications, there are many more variables to contend with when troubleshooting performance issues. The root cause of an application failure could lie within a host server, an orchestrator, a container, an overlay network, or a load balancer (and so on).

Making matters more complicated, problems sometimes aren’t caused by any one component, but rather by how different components interact. Containers hosting different microservices might not be able to exchange data properly due to an RBAC configuration problem, for example.

Evolving from APM to observability is essential for thriving in the face of complexities like these. With APM it is very difficult to fully leverage modern observability signal types that are typically generated by the application in the form of logs, various infrastructure metrics and transaction traces. But when you apply observability patterns with the power of analytics run on raw data coming from all layers of your app stack, you can detect root-cause issues much faster.

APM and tracing as building blocks of full observability

In short, although APM and tracing used in isolation were once useful on their own as the foundation of application performance optimization, teams responsible for managing complex applications need more than these types of tools can provide. They need observability.

And while APM and tracing help achieve observability, they are only some of the ingredients required for complete observability. Observability also requires data sources that go beyond those traditionally associated with APM, and it demands the ability to correlate complex datasets efficiently in order to gain automated insight into the root cause of performance problems.

Interested in learning more? Read The Essential Guide to Observability to understand the basics and how it helps all kinds of teams deliver ongoing service reliability.

http://www.fixate.io

Chris Tozzi has worked as a journalist and Linux systems administrator. He has particular interests in open source, agile infrastructure and networking. He is Senior Editor of content and a DevOps Analyst at Fixate IO.


Discussion

Click on a tab to select how you'd like to leave your comment

Leave a Comment

Your email address will not be published. Required fields are marked *