If you keep up-to-date with the latest developments in the world of databases, you are probably familiar with ClickHouse, a columnar database management system designed for OLAP. Developed by Yandex, ClickHouse was open-sourced in 2016, which makes it one of the most recent database management systems to become widely available as an open source tool.
Because ClickHouse supports real-time, high-speed reporting, it’s a powerful tool, especially for modern DevOps teams who need instantaneous, fast and flexible ways of analyzing data.
Like most DevOps tools, however, ClickHouse delivers great value only when it is properly managed and monitored. Even though the tool is designed for high performance, actually achieving high performance requires careful attention to the health of the system.
With this need in mind, this article (the first of a three-part series) explains how to get started developing a monitoring strategy by identifying which types of ClickHouse metrics to monitor. In subsequent articles (part 2 and part 3), we’ll discuss ClickHouse monitoring tools, then explore how to monitor ClickHouse with Sematext.
Essential Metrics for ClickHouse monitoring
Without further ado, let’s explore the key metrics to monitor in order to manage ClickHouse effectively. The metric names and keys identified below are based on those that are monitored via Sematext’s ClickHouse integration.
ClickHouse Events metrics
The first and broadest category of key metrics are those that reflect events within ClickHouse. There are several specific types of events to monitor within this category:
- Total query count – clickhouse.query.count. This number represents the total number of queries in your ClickHouse integration. It’s a key metric for assessing the overall level of activity in your ClickHouse system.
- Inserted rows – clickhouse.insert.rows. This metric represents the number of rows inserted in all tables and reflects the level of activity within your database, as well as database size.
- Inserted bytes – clickhouse.insert.bytes. The number of uncompressed bytes inserted in all tables. Also a reflection of activity level and database size.
Query count, inserted rows and bytes
- Merged rows – clickhouse.merge.rows. Rows read for background merges. This is the number of rows before a merge. This metric represents the number of rows before a merge.
- Uncompressed bytes merged – clickhouse.merge.bytes.uncompressed. Uncompressed bytes that were read for background merges. This is the number before a merge.
Merged rows and uncompressed bytes merged
ClickHouse network metrics
Although ClickHouse is not a networking tool, it relies on the network to transmit information. For that reason, network metrics provide a useful way of assessing ClickHouse performance and health. In particular, you will want to track the following:
- TCP Connections – clickhouse.connection.tcp.count. The total number of connections to TCP server. Helps measure the load of your ClickHouse installation.
- HTTP Connections – clickhouse.connection.http.count (long gauge). Number of connections to the HTTP server. Also a reflection of load.
- Interserver Connections – clickhouse.connection.interserver.count. This metric represents the number of connections from other replicas to fetch parts. It’s not directly tied to overall system load, but it is useful for assessing and optimizing the performance of your ClickHouse installation.
HTTP connections, TCP connections, Interserver Connections
ClickHouse uses Apache Zookeeper to help manage data, so monitoring Zookeeper is important for keeping ClickHouse running properly. You can monitor ZooKeeper metrics to help understand the state of your ClickHouse installation. Within this category, the key metrics to follow are the following:
- ZooKeeper watches – clickhouse.zk.watches. The number of watches (e.g., event subscriptions) in ZooKeeper.
- ZooKeeper wait – clickhouse.zk.wait.time. Time spent waiting for ZooKeeper operations
- ZooKeeper requests – clickhouse.zk.requests. Number of requests to ZooKeeper in progress.
ZooKeeper watches, wait time and requests
The following asynchronous metric is another essential ClickHouse metric to monitor:
- Max active part count – clickhouse.part.count.max. This metric represents the maximum number of active parts in ClickHouse partitions. If a part is active, it is used in a table; otherwise, it will be deleted. Inactive data parts remain after merging.
Data part metrics
In addition to the asynchronous ClickHouse metrics described above, you’ll want to monitor the following MergeTree data part metrics:
- Active part count – clickhouse.mergetree.table.parts. The number of active parts in MergeTree tables.
- Row count – clickhouse.mergetree.table.rows. The number of row counts in MergeTree tables.
Row count and active part count
Last but not least is the following replica status metric:
- Replica queue size – clickhouse.replica.queue.size. This metric represents the size of the queue for operations waiting to be performed. In this case, operations include inserting blocks of data, merges, and certain other actions.
Replica queue size
Effective ClickHouse monitoring requires tracking a variety of metrics that reflect the availability, activity level and performance of your ClickHouse installation. Those described above represent only the most important metrics to monitor. For a longer list of all of the ClickHouse metrics, you can collect using a tool such as Sematext, see this documentation page.