Telegraf Plugins¶
Where the status plugins are small scripts that are run during the periodic status-update cycle, Telegraf plugins are small configuration files that can be enabled/disabled by the HPC-admin. A telegraf plugin is usually targeted at one particular kind of data - e.g. CPU usage or memory usage.
The cluserware-telegraf package can be installed on either a compute node or a head node, but the on-the-fly plugin system currently only works on compute nodes.
For larger-scale management and control, one can set the
_telegraf_plugins
attribute inside an attribute group and then
join nodes to that group.
Building Telegraf Plugins into an Image
Similar to status plugins, there is another directory:
/opt/scyld/clusterware-telegraf/telegraf-available
that contains Penguin-provided config files. Those can be sym-linked
into ./telegraf-enabled
inside a disk Image.
On-The-Fly Telegraf Plugins
On compute nodes, an admin can enable/disable “on-the-fly” plugins by
setting or clearing out that node's _telegraf_plugins
attribute.
Note
Changes to _telegraf_plugins
will force a full restart of the
Telegraf daemon, so frequent changes could cause performance degradation.
Available status plugins
amd-rocm-smi
provides information from AMD ROCm-based GPUs, including GPU and memory usage.
chrony
provides information from Chrony on the time synchronization of the system.
cpu
gives aggregate and per-CPU-core utilization data.
cw-attribs
injects ClusterWare node attributes and fields into the Telegraf/Influx data stream. This can be helpful for customizing dashboards for different "types" of nodes.
By default, all node attributes and fields will be sent to Telegraf. This can be modified with the reserved attribute,
_telegraf_omit_pattern
. The pattern is anawk
regex, usually of the form(word1|word2)
and any matching fields will be omitted.
disk
provides disk space utilization (used and free; both inode and capacity utilization).
disk_head
disk utilization, customized for head nodes; includes more disk types in its data.
diskio
gathers per-device disk I/O rates.
hddtemp
reports data from hddtemp daemons.
infiniband
gathers information for all Infiniband devices and ports on the system.
intel-powerstat
provides data from Intel's Powerstat features.
interrupts
reports data from IRQs, including interrupts and soft-interrupts.
ipmisensor
collects data from the IPMI system. Note that more configuration may be required for this plugin to provide useful data (e.g. full URL, username, password, etc.)
kernel
gathers kernel statistics from
/proc/stat
.
kernel-vmstat
gathers kernel statistics from
/proc/vmstat
.
lm-sensors
collects sensor information from the
lm-sensors
package.
lustre
gathers job-level data on Lustre file system usage. Note that more configuration may be required for this plugin to provide useful data.
mem
reports memory statistics for the system, including free and used amounts as well as vmalloc, cache, high-memory areas.
net
provides aggregate and per-device network information including bytes sent and received, packets sent and recevied, errors, and more.
netresponse
reports the network response time to contact the parent head node.
netstat
gathers TCP metrics from
lsof
, including established connections, time-wait, and socket counts.
nfsclient
collects per-mount-point statistics for any NFS file systems (
/proc/self/mountstats
).
nvidia-smi
provides information from NVIDIA GPUs, including GPU and memory usage, temperatures, and more.
ping
records ping times back to the current parent head node.
processes
reports information on the number of processes on the system that are in different states (zombie, sleeping, running, etc.).
rsyslog
provides a local listening port which can receive syslog-formatted data (e.g. from rsyslog forwarding).
Note that by default ClusterWare does not forward compute-node data. In addition to the rsyslog telegraf plugin, admins must also have configured rsyslog forwarding. ClusterWare does include a "disabled" rsyslog config file:
/etc/rsyslog.d/cw_local_telegraf.conf.disabled
. This can be enabled by simply removing the.disabled
from the filename.
swap
provides information about swap memory usage.
temp
collects temperature data from sensors on the system.
Head Node Functionality
The ClusterWare-telegraf plugin system has reduced functionality on
head nodes. Since head nodes do not currently have attributes,
there is no way to do on-the-fly changes to the Telegraf plugins
and so adding entries to telegraf-enabled
is the only way
to add plugins to the system.
Additionally, while the compute nodes will automatically detect changes and start/restart Telegraf automatically, changes to head nodes must be handled manually.
Once the telegraf-enabled
directory is ready on the head-node,
admins should run reconfig-telegraf.sh
to push the enabled plugins
into production (this will also restart Telegraf).
/opt/scyld/clusterware-telegraf/bin/reconfig-telegraf.sh
Note
Changes to _telegraf_plugins
will be processed by
ClusterWare on the next status-update cycle, usually every
10 seconds unless changed by the admin.
However, it may take several seconds
before Telegraf actually restarts, and it then has to go through
its own “data refresh” cycle (again, usually every 10 seconds,
unless changed by the admin). So there could be a non-trivial
delay (30-40 sec) before a new plugin's data is actually
visible on a dashboard.