MPCDF HPC Performance Monitoring System¶
Introduction¶
The HPC monitoring daemon (hpcmd) is a lightweight middleware designed to measure performance data on HPC compute nodes, to compute derived metrics, and to write the results to syslog, to standard output, or to a file.
Simple by design, hpcmd is largely written in Python. It queries standard
Linux command line tools (e.g., perf
, ps
), virtual file systems
(/proc
, /sys
), both available on any HPC system, and few proprietary
tools (e.g., opainfo
and nvidia-smi
) if needed. In addition, hpcmd
integrates with the SLURM batch system to gather information such as the job
id, the requested number of nodes, processes, and threads, etc., to
complement the actual performance data. Hpcmd is very lightweight and uses a
low scheduling priority, making its overhead negligible.
Hpcmd may operate in two different modes: In the first mode, it runs as a systemd service on each node of a HPC system and performs measurements at regular and synchronized intervals, based on a system-wide configuration. In the second mode, hpcmd is launched manually with user privileges, and its measurement metrics and sampling frequency can be fully configured by the user. In particular, a user job may suspend the hpcmd running in systemd mode on a node, and launch its own hpcmd in user mode. Note that the systemd mode is considered the main use case.
On HPC cluster installations, the syslog messages generated by hpcmd can be collected using rsyslog and fed into a central database for further analysis. We choose to use the Splunk framework for aggregation and visual analysis, although other frameworks can be used as well. We developed several Splunk dashboards for typical analysis tasks of the application performance. The source code of these dashboards is provided in addition to the hpcmd package.
hpcmd has been successfully monitoring two major HPC systems with about 160.000 CPU cores in total at the MPCDF since 2018.
Documentation¶
Copyright and license¶
Copyright 2018 – 2019 Max Planck Computing and Data Facility, Giessenbachstrasse 2, 85748 Garching
hpcmd is developed by Luka Stanisic <luka.stanisic@mpcdf.mpg.de> and Klaus Reuter <klaus.reuter@mpcdf.mpg.de>. hpcmd is released under the MIT license.