Mainframe computing continues to serve as a backbone for mission-critical business applications, even as those applications are serving an increasingly diverse collection of frontend components and service-related APIs. This new IT landscape allows businesses to be more dynamic and serve new and changing segments of the population, but it also introduces complexities when problems occur within the system. In this article, we’re going to discuss the Key Performance Indicators (KPIs) you need to be aware of within your IBM® z/OS infrastructure. We’ll identify these critical metrics, discuss their importance to your infrastructure, and talk about how you can increase observability within your organization.
MTTD, MTTR and the Importance of Performance Monitoring
Two of the most important metrics a business can monitor are MTTD and MTTR. MTTD stands for Mean Time To Detection. This metric measures the time between the system beginning to encounter an error and the business knowing that an error has occurred. Once we are aware that an error is occurring, MTTR, or Mean Time To Resolution, is a measure of the time it takes to resolve the error and return systems to normal.
Application Performance Monitoring (APM) is the primary weapon in your defensive arsenal. When you are monitoring the right metrics within your infrastructure, an effective APM solution notifies critical support personnel before or when errors begin to occur, reducing the MTTD, and engaging the appropriate resources to begin resolving the problem as soon as possible.
APM software also provides a holistic view of an entire system, allowing engineers from any part of that system to understand relationships between subsystems and better identify where any errors are originating. Advanced APM systems can also integrate machine learning models into their monitoring to understand trends within the system, and in some cases, proactively prevent problems before they rise to the level of an error or system outage.
Visibility with IBM® Z APM Connect
The metrics we’re going to talk about below can all be collected, visualized and monitored using IBM® Z Application Performance Management Connect. This system collects these metrics from OMEGAMON events which originate in each of the Z/OS subsystems. A tremendous benefit of using APM Connect is that it can be used as part of broader APM systems, providing IT operations and support personnel with detailed and targeted insights into all the systems they support.
Understanding What Needs to be Monitored
In their most basic form, all computer systems share some essential characteristics. The system should be able to process data. While data is waiting to be processed, it needs to be stored, and it also needs a way of flowing in and out of the system. Monitoring allows us to ensure that each action completes within acceptable limits.
An efficient computer system relies on each of the characteristics listed above working together in harmony. Observing these individually and collectively allows us to understand the nature of the work which the system is conducting, and can either confirm the efficiency of the system or indicate problems which need to be investigated and resolved. Let’s start by taking a general look at the classes of metrics we should have on our radar.
Data Processing: Measuring the percentage of CPU in use at any time provides insights into how efficiently data is processed. For transactional systems, we can observe the time taken for a transaction to complete, as well as the counts of successful and failed transactions over a period.
Data Storage: In high-performing systems, we need to pay particular attention to the amount of memory in use, and avoid the need for the system to use storage media to house data temporarily while it is actively processing. Writing the data to disk is referred to as paging, and has a markedly negative effect on performance.
Data I/O: System performance is affected by how quickly we can process data, and how quickly we can move data in and out of the system. Data transfer rates are a key performance indicator. We also want to monitor the size of the data packets moving through our system and be aware of the effects that different-sized packets have on storage and processing.
Specific KPIs to Monitor
We’ll be looking at four of the critical IBM® Z subsystems and identifying important KPIs which you should be actively monitoring on each system. Having access to these KPIs might not necessarily provide you with the information to resolve a problem, but they help you identify problem areas quickly and reduce your MTTR. The subsystems we’ll be looking at are:
CICS has an extensive collection of metrics which we can look at to determine how well it is processing information. We can start by looking at the total transaction count, and the number of failed or abended transactions as a percentage of the total. Looking at the transactions, we can also look at the dispatch time of each transaction, and the time which elapsed during processing. We should also consider the average size of the dataset.
Our metrics should also include the CPU percentage of utilization, the paging rate, and the response time of the system.
|Transaction Count||Abend Count||Dispatch Time||Elapsed Time|
|I/O Rate||Paging Rate||Working Set Size||CPU Utilization|
|Paging Rate||Response Time|
As a database service, we need to be concerned with how DB2 is utilizing its CPU, whether or not it’s using paging, and the rate at which data is moving in and out of the service. To help maintain the integrity of the data stored, database services routinely lock records from editing while it performs updates on those records. The system may fail to clear these locks due to system failures or other problems, and this can cause cascading problems throughout the system.
|CPU Utilization||I/O Rate||Paging Rate||Transaction Count|
|Working Set Size||Lock/Latch Wait Time||Lock Suspends||Lock Timeouts|
IMS is responsible for managing the flow of data through other systems. While IMS lacks the complexity of systems like CICS and DB2, it’s essential that it functions efficiently as part of a healthy system. We need to monitor the CPU utilization, I/O rate, paging rate, and the size of the data packets it is managing.
|CPU Utilization||I/O Rate||Paging Rate||Working Set Size|
As a messaging system, the MQ system lacks some of the complexities of other subsystems, but is still vital to the overall health of the system. We need to monitor the CPU utilization, I/O rate, paging rate, and the size of the messages moving through its queue.
|CPU Utilization||I/O Rate||Paging Rate||Working Set Size|
If your organization doesn’t yet have a consolidated dashboard which surfaces the health of your Z systems, your highest priority should be working with stakeholders and individual product owners to put such a dashboard in place. Your IBM® representative is an invaluable resource in determining the best approach to collecting these metrics from your infrastructure.
If you already have such a dashboard, then your next steps are to ensure that the data is being actively used to drive operations decisions within your organization, and investigate how to automate as much of your monitoring activity as possible