Kubernetes doesn’t come with built-in monitoring capabilities. Instead, it relies on the open source ecosystem to build tools for vital operational tasks like monitoring. CNCF is the governing body that oversees the development of such projects. Two of the most successful CNCF projects to date are Prometheus and Istio. Prometheus for monitoring, and Istio to manage network communication in the form of a service mesh. In this post, we look at how these two tools can be used together to better manage and monitor applications and services run using Kubernetes. Specifically, we dive into how Prometheus can be used to gain visibility into Istio’s metrics.
Prometheus – Monitoring for K8s
Prometheus is a time-series streaming data tool. It scrapes data from Kubernetes via HTTP endpoints. Prometheus is not a visualization tool and has only a basic visualization tool called Expression Browser that’s built-in. What it does really well is to ingest data in a time-series format and make that data available for processing and analysis.
Prometheus’ biggest strengths lie in scaling to large volumes of data, and in its ability to analyze large quantities of data using its own PromQL. In a system like Kubernetes where large volumes of data are churned out continuously, it takes a robust real-time streaming data tool like Prometheus to be on top of performance at every moment.
Prometheus gives you a multi-dimensional view of data. This means you can view the same base metrics in two different ways. A metric can be labeled in more than one way, with each label serving a different purpose and context. For visualization, Prometheus integrates with Grafana, a mature open-source visualization tool. While Prometheus monitors all parts of Kubernetes, we focus here on how it enables better Istio monitoring.
Istio – Service Mesh for K8s
Istio is the most widely used service mesh in the Kubernetes ecosystem. Istio improves observability for microservices particularly focusing on the networking between services. Istio simplifies network communication by separating the data plane from the control plane.
Envoy is the central component of Istio’s data plane. Envoy sidecar proxy agents are attached to every service, and these agents handle all network communication between services. They can be configured with rules for handling outgoing and incoming requests to every service. Apart from managing communication, the sidecar agents also collect metrics on requests and services and send them to Mixer. Mixer is where Istio aggregates all metrics for telemetry and makes them available to external tools for further analysis. Once the metrics reach Mixer, Prometheus proactively scrapes the metrics rather than listening for an update.
Metrics Generated in Istio
As mentioned earlier, Istio reports two kinds of metrics: request-level and service-level metrics. While requests are the foundation, request metrics can be combined to gain visibility into service performance. Here are the basic request metrics generated in Istio that can be analyzed using Prometheus:
- Total request count
- Request size
- Requests per second
- Request completion rate
- Failed requests
- Error rate
- Request timeout
- Retries
- Requests by service
- Response size
- Tcp bytes sent & received
These base metrics can be combined using labels to gain visibility into service performance. Here are some of the labels that can be viewed using Istio:
- Source service
- Destination service
- Service failure, unavailability
- Request routing
- Load balancing
You can view a longer list of Istio metrics here. All of these metrics and labels can be monitored using Prometheus.
Consuming Istio Metrics in Prometheus
Once these metrics are scraped by Prometheus, it opens up a world of possibilities. At the most basic level, you can view a time-series representation of each metric in Prometheus. Using PromQL you can select and query each metric. This is useful for ad hoc analysis of data. For more routine day-to-day analysis, it helps to set up Grafana dashboards of the most important metrics.
Beyond this, you can use Prometheus’ Recording rules to ‘precompute’ frequently viewed metrics for quicker analysis. Rather than viewing very granular metrics at a per-second or per-minute interval, you may want to abstract the data for a broader view. You can create a new label and a new time-series view within Prometheus. This can even help improve Grafana’s dashboard refresh rate.
Apart from abstracting metrics, Prometheus allows you to manage notifications or alerts based on these metrics. Istio sends out lots of monitoring data and this can generate an equally high volume of alerts. With Prometheus, you can set Alerting rules to define which events should trigger alerts.
The Alertmanager of Prometheus allows you to group, deduplicate, and route alerts based on your needs. For a start, an email alert is a basic way to manage alerts. However, considering Prometheus is focused on monitoring and not alerting, it’s a better option to connect to a more mature alerting solution like Slack or VictorOps.
While Prometheus is the central monitoring solution for Istio, end-to-end visibility requires more than just time-series data – it needs to be complemented with logging and tracing data. Tracing helps to uncover the path of requests as they travel across the network. This helps to identify bottlenecks. Logging helps dig deeper into specific errors and issues. The EFK stack is a capable open-source logging solution, and Jaeger is an equivalent solution for tracing.
In conclusion, Istio promises deeper observability into microservices applications powered by Kubernetes. While Istio makes the base monitoring data available, this data needs to be analyzed by Prometheus before it can be put to use. With capabilities like PromQL, recording rules, alerting rules, and Grafana for visualization, it’s no wonder Prometheus has taken the top spot when it comes to monitoring Istio.
If you’re using Prometheus, or considering using it to monitor Istio, DX APM now has the ability to remotely ingest and report Prometheus metrics. To see how to get started, check out this video.