This article is part 2 of a four-part series of articles about Elasticsearch monitoring. In this part, we’re going to look at Elasticsearch’s capabilities and potential use cases. We’re also going to discuss a brief tutorial on what it takes to monitor an Elasticsearch cluster, and we’ll identify key metrics that you need to monitor and why they’re essential to maintaining the health and performance of your Elasticsearch cluster.
(To learn more about Elasticsearch open source monitoring tools, check out part 3 of this series. You can learn how to monitor Elasticsearch with Sematext in part 4.)
Elasticsearch Use Cases
Elasticsearch has been available since 2010, and is a search engine based on the open source Apache Lucene library. Developed in Java, and supporting clients in many different languages, such as PHP, Python, C# and Ruby, Elasticsearch is currently the most popular search engine available today.
As a distributed tool, Elasticsearch is highly scalable and offers near real-time search capabilities. All of this adds up to a tool which can support a multitude of critical business needs and use cases. Organizations have used Elasticsearch as a solution to support the following:
- Log Collection and Aggregation
- Gathering and Indexing Large Datasets
- Collection and Management of Metrics and Event Data
- Analysis of Logs, Large Datasets, and Metric Data
- Full Text and Document Search Capabilities
Importance of Monitoring your Elasticsearch Cluster
As versatile, scalable and useful as Elasticsearch is, it’s essential that the infrastructure which hosts your cluster meets its needs, and that the cluster is sized correctly to support its data store and the volume of requests it must handle. Improperly sized infrastructure and misconfigurations can result in everything from sluggish performance to the entire cluster becoming unresponsive and crashing.
Appropriately monitoring your cluster can help you ensure that the cluster is appropriately sized and that it handles all data requests efficiently. We’re going to look at the cluster from five different perspectives, and discuss critical metrics to monitor from each of these perspectives and what potential problems you can avert by watching these metrics.
Five Areas of Concern
The five areas we’re going to consider are:
Let’s look at each area in turn and discuss why each area is integral to the health of your Elasticsearch cluster.
Cluster Health: Shards and Nodes
An Elasticsearch cluster can consist of one or more nodes. A node is a member of the cluster, hosted on an individual server. Adding additional nodes is what allows us to scale the cluster horizontally. Indexes organize the data within the cluster. An index is a collection of documents which share a similar characteristic.
Consider the example of an Elasticsearch cluster deployed to store log entries from an application. An index might be set up to collect all log entries for a day. Each log entry is a document which contains the contents of the log and associated metadata.
In large datasets, the size of an index might exceed the storage capacity on a single node. We also want to ensure that we have redundant copies of our index, in case something happens to a node. Elasticsearch handles this by dividing an index into a defined number of shards. Elasticsearch distributes the shards across all nodes in the cluster. By default, an Elasticsearch index has five shards with one replica. The result of this default configuration is an index divided into five shards, each with a single replica stored on a different node.
It is essential to find the right number of shards for an index because too few shards may negatively affect search efficiency and distribution of data across the nodes. Conversely, too many nodes create an excessive demand on the resources of the cluster for their management.
When monitoring your cluster, you can query the cluster health endpoint and receive information about the status of the cluster, the number of nodes, and the counts of active shards. You can also see counts for relocating shards, initializing shards and unassigned shards. An example response of such a request can be seen below.
Fig. 1. Example of Response to a Request Sent to GET _cluster/health
Relocating and initializing shards indicate rebalancing on the cluster or the creation of new shards. Rebalancing occurs when a node is added or removed from the cluster and will affect the performance of the cluster. By understanding these metrics and how they affect cluster performance, you will have more insight into the cluster and can tune the cluster for better performance. One such adjustment is adding a shard relocation delay when a node leaves the cluster, eliminating excessive overhead if it returns quickly.
Fig. 2: Dropping nodes and relocation of shards
Each of these metrics requires continuous monitoring and an understanding of how the cluster manages its resources to ensure it remains healthy. (We’ll also talk about monitoring the system health of each of the nodes in a subsequent section.)
|Important Metrics for Cluster Health|
|Status||The status of the cluster:
|Nodes||This metric includes the total number of nodes in the cluster, and includes the count of successful and failed nodes.|
|Count of Active Shards||The number of active shards within the cluster.|
|Relocating Shards||Count of shards being moved due to the loss of a node.|
|Initializing Shards||Count of shards being initialized due to the addition of an index.|
|Unassigned Shards||Count of shards for which replicas have not been created or assigned yet.|
Search Performance: Request Rate and Latency
A data source is only as good as it is useful, and we can measure the effectiveness of the cluster by measuring the rate at which the system is processing requests and how long each request is taking.
When the cluster receives a request, it may need to access data from multiple shards, across multiple nodes. Knowing the rate at which the system is processing and returning requests, how many requests are currently in progress, and how long requests are taking can provide valuable insights into the health of the cluster.
The request process itself is divided into two phases, the first is the query phase, during which the cluster distributes the request to each shard (either primary or replica) within the index. During the second, fetch phase, the results of the query are gathered, compiled and returned to the user.
Fig. 3: Query and fetch rate
We want to be aware of spikes in any of these metrics, as well as any emerging trends which might indicate growing problems within the cluster. These metrics are calculated by index and are available from the RESTful endpoints on the cluster itself.
Please refer to the table below for metrics which are available from the index endpoint which is found at /index_name/_stats where index_name is the name of the index. Performance specific metrics have been highlighted in light blue.
|Important Metrics for Request Performance|
|Number of queries currently in progress||Count of queries currently being processed by the cluster.|
|Number of fetches currently in progress||Count of fetches in progress within the cluster.|
|Total number of queries||Aggregated number of all queries processed by the cluster|
|Total time spent on queries||Total time consumed by all queries in milliseconds.|
|Total number of fetches||Aggregated number of all fetches processed by the cluster.|
|Total time spent on fetches||Total time consumed by all fetches in milliseconds.|
Index Performance: Refresh and Merge Times
As documents are updated, added and removed from an index, the cluster needs to continually update their indexes and then refresh them across all the nodes. All of this is taken care of by the cluster, and as a user, you have limited control over this process, other than to configure the refresh interval rate.
Additions, updates, and deletions are batched and flushed to disk as new segment, and as each segment consumes resources, it is important for performance that smaller segments are consolidated and merged into larger segments. Like indexing, this is managed by the cluster itself.
Fig. 4: Refresh, flush and merge stats
Monitoring the indexing rate of documents and merge time can help with identifying anomalies and related problems before they begin to affect the performance of the cluster. Considering these metrics in parallel with the health of each node can provide essential clues to potential problems within the system, or opportunities to optimize performance.
Fig. 5: Indexing rate
Index performance metrics can be retrieved from the /_nodes/stats endpoint and can be summarized at the node, index or shard level. This endpoint has a plethora of information, and the sections under merges and refresh are where you’ll find relevant metrics for index performance.
Fig. 6. Example of Response to a Request Sent to GET /_nodes/states
|Important Metrics for Index Performance|
|Total refreshes||Count of the total number of refreshes.|
|Total time spent refreshing||Aggregation of all time spent refreshing. Measure in milliseconds.|
|Current merges||Merges currently being processed.|
|Total merges||Count of the total number of merges.|
|Total time spent merging||Aggregation of all time spent merging segments.|
Node Health: Memory, Disk, and CPU Metrics
Each node runs off physical hardware and needs access to system memory, disk storage and CPU cycles for managing the data under its control and responding to requests to the cluster.
Elasticsearch is a system which is heavily reliant on memory to be performant, and so keeping a close eye on memory usage is particularly relevant to the health and performance of each node. Configuration changes to improve metrics may also adversely affect memory allocation and usage, so it’s important to remember to view system health holistically.
Fig. 7: System memory usage
Monitoring the CPU usage for a node and looking for spikes can help identify inefficient processes or potential problems within the node. CPU performance correlates closely to the garbage collection process of the Java Virtual Machine (JVM), which we’ll discuss next. Using software to correlate metrics from one area of concern to those of another can also help reduce false alarms, or better identify anomalies.
Fig. 8: System CPU usage
Fig. 9. JVM Garbage collection time
Finally, high disk reads and writes can indicate a poorly tuned system. Since accessing the disk is an expensive process in terms of time, a well-tuned system should reduce disk I/O wherever possible.
Fig. 10. Disk I/O
These metrics are measured at the node level, and reflect the performance of the instance or machine on which it is running. The most succinct source of this metrics is from the /_cat/nodes endpoint of each node, and you can pass in headers to define the metrics it should return. The following URL will return all the metrics listed above.
|Important Metrics for Node Health|
|Total disk capacity||Total disk capacity on the node’s host machine.|
|Total disk usage||Total disk usage on the node’s host machine.|
|Total available disk space||Total disk space available.|
|Percentage of disk used||Percentage of disk which is already used.|
|Current RAM usage||Current memory usage (unit of measurement).|
|RAM percentage||Percentage of memory being used.|
|Maximum RAM||Total amount of memory on the node’s host machine|
|CPU||Percentage of the CPU in use.|
JVM Health: Heap, GC, and Pool Size
As a Java-based application, Elasticsearch runs within a Java Virtual Machine (JVM). The JVM manages its memory within its ‘heap’ allocation and evicts objects from the head with a
garbage collection process.
If the needs of the application exceed the capacity of the heap, the application is forced to begin using swap space on attached storage media. While this prevents the system from crashing, it can wreak havoc on the performance of the cluster. Monitoring the available heap space to ensure the system has sufficient capacity is essential to a healthy system.
JVM memory is allocated to different memory pools. You’ll want to keep an eye on each of these pools to ensure that they have been adequately utilized and are not in danger of being overrun.
The garbage collector (GC) is much like a physical garbage collection service. We want to keep it running regularly and ensure that the system doesn’t overburden it. Ideally, a view of GC performance should indicate regular executions of a similar size. Spikes and anomalies can be indicators of much deeper problems. See also: Garbage Collection Settings for Elasticsearch Master Nodes
Fig. 11. Example of Health JVM Heap and GC Management.
The image above shows an example of the healthy “sawtooth” pattern we expect when monitoring memory usage on a healthy JVM on a single node. Routine and predictable increases and decreases in memory usage indicate that memory is being actively managed, and the garbage collector is being taxed beyond its abilities and affecting performance as a result.
JVM metrics can be retrieved from the /_nodes/stats endpoint.
|Important Metrics for JVM Health|
|Memory usage||Usage statistics for heap and non-heap processes and pools.|
|Threads||Current threads in use, and maximum number.|
|Garbage collection||Counts and total time spent with garbage collection.|
So there you have it — the top Elasticsearch metrics to monitor:
- Cluster Health – Nodes and Shards
- Search Performance – Request Latency and
- Search Performance – Request Rate
- Indexing Performance – Refresh Times
- Indexing Performance – Merge Times
- Node Health – Memory Usage
- Node Health – Disk I/O
- Node Health – CPU
- JVM Health – Heap Usage and Garbage Collection
- JVM health – JVM Pool Size
It’s hard to do justice to each area of concern when monitoring your Elasticsearch cluster. The tight coupling between different metrics and understanding how changes in configuration might affect each requires a team of experienced and well-trained engineers. Unfortunately, such a team may also require a substantial investment of time and money for an organization.
Investing in a comprehensive monitoring strategy is critical for any organization which implements Elasticsearch as a solution. Effective monitoring saves organizations downtime and lost revenue due to an unresponsive or irreparable cluster. You can learn how to monitor Elasticsearch with Sematext in part 4.