MPCDF HPC Performance Monitoring System

_images/hpcmd_graph_v2.png

Fig. 1 Schematic showing the architecture of the MPCDF HPC monitoring system. The hpcmd middleware and various Splunk analysis dashboards were written by the authors, while building upon existing rsyslog and Splunk infrastructure. Automatic analysis using machine learning techniques is currently under development.

Introduction

The HPC monitoring daemon (hpcmd) is a lightweight middleware designed to measure performance data on HPC compute nodes, to compute derived metrics, and to write the results to syslog, to standard output, or to a file.

Simple by design, hpcmd is largely written in Python. It queries standard Linux command line tools (e.g., perf, ps), virtual file systems (/proc, /sys), both available on any HPC system, and few proprietary tools (e.g., opainfo and nvidia-smi) if needed. In addition, hpcmd integrates with the SLURM batch system to gather information such as the job id, the requested number of nodes, processes, and threads, etc., to complement the actual performance data. Hpcmd is very lightweight and uses a low scheduling priority, making its overhead negligible.

Hpcmd may operate in two different modes: In the first mode, it runs as a systemd service on each node of a HPC system and performs measurements at regular and synchronized intervals, based on a system-wide configuration. In the second mode, hpcmd is launched manually with user privileges, and its measurement metrics and sampling frequency can be fully configured by the user. In particular, a user job may suspend the hpcmd running in systemd mode on a node, and launch its own hpcmd in user mode. Note that the systemd mode is considered the main use case.

On HPC cluster installations, the syslog messages generated by hpcmd can be collected using rsyslog and fed into a central database for further analysis. We choose to use the Splunk framework for aggregation and visual analysis, although other frameworks can be used as well. We developed several Splunk dashboards for typical analysis tasks of the application performance. The source code of these dashboards is provided in addition to the hpcmd package.

hpcmd has been successfully monitoring two major HPC systems with about 160.000 CPU cores in total at the MPCDF since 2018.

Documentation

Indices and tables