Telegraf Plugins

Where the status plugins are small scripts that are run during the periodic status-update cycle, Telegraf plugins are small configuration files that can be enabled/disabled by the HPC-admin. A telegraf plugin is usually targeted at one particular kind of data - e.g. CPU usage or memory usage.

The cluserware-telegraf package can be installed on either a compute node or a head node, but the on-the-fly plugin system currently only works on compute nodes.

For larger-scale management and control, one can set the _telegraf_plugins attribute inside an attribute group and then join nodes to that group.

Building Telegraf Plugins into an Image

Similar to status plugins, there is another directory:

/opt/scyld/clusterware-telegraf/telegraf-available

that contains Penguin-provided config files. Those can be sym-linked into ./telegraf-enabled inside a disk Image.

On-The-Fly Telegraf Plugins

On compute nodes, an admin can enable/disable “on-the-fly” plugins by setting or clearing out that node's _telegraf_plugins attribute.

Note

Changes to _telegraf_plugins will force a full restart of the Telegraf daemon, so frequent changes could cause performance degradation.

Available status plugins

  • amd-rocm-smi

    • provides information from AMD ROCm-based GPUs, including GPU and memory usage.

  • chrony

    • provides information from Chrony on the time synchronization of the system.

  • cpu

    • gives aggregate and per-CPU-core utilization data.

  • cw-attribs

    • injects ClusterWare node attributes and fields into the Telegraf/Influx data stream. This can be helpful for customizing dashboards for different "types" of nodes.

    • By default, all node attributes and fields will be sent to Telegraf. This can be modified with the reserved attribute, _telegraf_omit_pattern. The pattern is an awk regex, usually of the form (word1|word2) and any matching fields will be omitted.

  • disk

    • provides disk space utilization (used and free; both inode and capacity utilization).

  • disk_head

    • disk utilization, customized for head nodes; includes more disk types in its data.

  • diskio

    • gathers per-device disk I/O rates.

  • hddtemp

    • reports data from hddtemp daemons.

  • infiniband

    • gathers information for all Infiniband devices and ports on the system.

  • intel-powerstat

    • provides data from Intel's Powerstat features.

  • interrupts

    • reports data from IRQs, including interrupts and soft-interrupts.

  • ipmisensor

    • collects data from the IPMI system. Note that more configuration may be required for this plugin to provide useful data (e.g. full URL, username, password, etc.)

  • kernel

    • gathers kernel statistics from /proc/stat.

  • kernel-vmstat

    • gathers kernel statistics from /proc/vmstat.

  • lm-sensors

    • collects sensor information from the lm-sensors package.

  • lustre

    • gathers job-level data on Lustre file system usage. Note that more configuration may be required for this plugin to provide useful data.

  • mem

    • reports memory statistics for the system, including free and used amounts as well as vmalloc, cache, high-memory areas.

  • net

    • provides aggregate and per-device network information including bytes sent and received, packets sent and recevied, errors, and more.

  • netresponse

    • reports the network response time to contact the parent head node.

  • netstat

    • gathers TCP metrics from lsof, including established connections, time-wait, and socket counts.

  • nfsclient

    • collects per-mount-point statistics for any NFS file systems (/proc/self/mountstats).

  • nvidia-smi

    • provides information from NVIDIA GPUs, including GPU and memory usage, temperatures, and more.

  • ping

    • records ping times back to the current parent head node.

  • processes

    • reports information on the number of processes on the system that are in different states (zombie, sleeping, running, etc.).

  • rsyslog

    • provides a local listening port which can receive syslog-formatted data (e.g. from rsyslog forwarding).

    • Note that by default ClusterWare does not forward compute-node data. In addition to the rsyslog telegraf plugin, admins must also have configured rsyslog forwarding. ClusterWare does include a "disabled" rsyslog config file: /etc/rsyslog.d/cw_local_telegraf.conf.disabled. This can be enabled by simply removing the .disabled from the filename.

  • swap

    • provides information about swap memory usage.

  • temp

    • collects temperature data from sensors on the system.

Head Node Functionality

The ClusterWare-telegraf plugin system has reduced functionality on head nodes. Since head nodes do not currently have attributes, there is no way to do on-the-fly changes to the Telegraf plugins and so adding entries to telegraf-enabled is the only way to add plugins to the system.

Additionally, while the compute nodes will automatically detect changes and start/restart Telegraf automatically, changes to head nodes must be handled manually.

Once the telegraf-enabled directory is ready on the head-node, admins should run reconfig-telegraf.sh to push the enabled plugins into production (this will also restart Telegraf).

/opt/scyld/clusterware-telegraf/bin/reconfig-telegraf.sh

Note

Changes to _telegraf_plugins will be processed by ClusterWare on the next status-update cycle, usually every 10 seconds unless changed by the admin. However, it may take several seconds before Telegraf actually restarts, and it then has to go through its own “data refresh” cycle (again, usually every 10 seconds, unless changed by the admin). So there could be a non-trivial delay (30-40 sec) before a new plugin's data is actually visible on a dashboard.