Modern systems look very different than they did years ago. For the most part, development organizations have moved away from building traditional monoliths towards the development of containerized applications running across a highly-distributed infrastructure. While this change has made systems inherently more resilient, the increase in overall complexity has made it more important (and more challenging) to effectively identify and address problems at their root cause when issues occur.
Part of the solution to this challenge lies in leveraging tools and platforms that can effectively monitor the health of services and infrastructure. To that end, this post will explain how to monitor services and infrastructure using one of the most popular tools currently available – Prometheus. In addition, it will outline the reasons why Prometheus alone is not enough to monitor the complex, highly-distributed system environments in use today.
What Is Prometheus?
Prometheus is an open-source monitoring and alerting toolkit that was first developed by SoundCloud in 2012. Let’s take a look at what it can monitor, its architecture, and how it works in practice.
What Can Be Monitored with Prometheus?
Organizations use Prometheus to collect metrics regarding service and infrastructure performance. Depending upon the use case, this data may include performance markers such as CPU utilization, memory usage, total requests, requests per second, request count, exception count, and more. When leveraged effectively, this data can assist organizations in identifying system issues in a timely manner.
Prometheus Server Architecture
The Prometheus architecture is central to the Prometheus server, which performs the actual monitoring functions. The Prometheus server is made up of three major components:
Time Series Database – This component is responsible for storing metrics data. This data is stored as a time series, meaning that the data is represented in the database as a series of timestamped data points belonging to the same metrics and set of labeled dimensions.
Worker for data retrieval – This component does exactly what its name implies: it pulls metrics from “targets,” which can be applications, services, or other components of the system infrastructure. It then takes these metrics and pushes them to the time series database. The data retrieval worker collects these metrics by scaping HTTP endpoints on the targets. By default, the endpoint is <hostaddress>/metrics. Configuring a target for use with Prometheus can be accomplished by leveraging exporters. At its core, an exporter is a service that fetches metrics from the target, formats them properly, and exposes the /metrics endpoint so that the data retrieval worker can pull the data for storage in the time series database.
HTTP server – The third component of the Prometheus server is an HTTP server. This server accepts queries in a Prometheus-specific query language (PromQL) to pull data from the time series database. The HTTP server can be leveraged by the Prometheus graph UI or other data visualization tool (such as Grafana) to provide developers and IT personnel with an interface for querying and visualizing these metrics in a useful, human-friendly format.
Managing Alerts from Prometheus
The Prometheus Alertmanager is also worth mentioning here. Rules can be set up within the Prometheus configuration to define limits that will trigger an alert when they are exceeded. When this happens, the Prometheus server pushes alerts to the Alertmanager. From there, the Alertmanager handles deduplicating, grouping, and routing these alerts to the proper personnel via email or other alerting integration.
Why Prometheus — On Its Own — Is Not Enough
As we know, modern development architectures have a much higher level of complexity than those of five to ten years ago. Today’s systems consist of a multitude of servers running containerized applications and services. These services are loosely coupled, calling one another in order to provide functionality to the end user. The complex nature of these systems can have the effect of obfuscating the causes of failures.
To address this challenge, organizations need granular insight into system behavior – and collecting and aggregating log event data is critical to this pursuit. This log data can be correlated with performance metrics, which will enable organizations to gain the insights and context necessary to drive efficient root cause analysis. While Prometheus collects metrics, it does not collect log data. Therefore, it does not provide the level of detail necessary to support effective incident response on its own.
Furthermore, Prometheus faces challenges when it is scaled significantly (a situation that is often unavoidable in the era of highly-distributed modern systems). Prometheus was not originally built to query and aggregate metrics from multiple instances. Configuring it to do so requires adding additional complexity to the organization’s Prometheus deployment. This complicates the process for attaining a holistic view of the entire system, which is a critical aspect of performing incident response with any level of efficiency.
Finally, Prometheus was not built to retain metrics data for long periods of time. For organizations managing complex environments, access to this type of historical data can be invaluable. For one, organizations may want to analyze these metrics to detect patterns that occur over the course of a few months or even a year so that they can gain an understanding of system usage during a specific time period. Such insights can dictate strategies for scaling during times when systems may be pushed to their limits.
Sumo Logic + Prometheus
Prometheus is an excellent tool for gathering high-level metrics that are critical for monitoring the health of services and their infrastructure. In addition to being easy to set up and use, it has a very active developer and user community. This, combined with the fact that it is freely available as open-source software, makes Prometheus a valuable option for any organization’s automated monitoring strategy.
With that said, Prometheus is not without its limitations. It is equipped to provide organizations with insights about what is going wrong within their systems, but without combining these metrics with log data, it’s difficult to attain the holistic view that is often necessary to truly understand why. Moreover, when an implementation of Prometheus is scaled, it faces data aggregation, storage, and visibility challenges, making additional tooling a necessity in many cases.
Many of these challenges can be effectively mitigated by leveraging Sumo Logic in conjunction with Prometheus. With functionality for seamlessly aggregating Prometheus data across all instances, along with capabilities for analyzing log data in conjunction with these metrics as well as support for efficient long-term data retention, Sumo Logic enables DevOps teams to effectively leverage Prometheus for complex, highly-distributed environments.