Status Plugins¶
Status plugins are used to report useful information about the compute node back to the server, e.g. CPU load, memory usage, available disk space, etc. This may be used for simple status monitoring to get an overall sense of how loaded the cluster is, for example:
scyld-nodectl --all status -L
And since this data is stored in the ClusterWare database, it can also be used to target actions against groups of nodes:
scyld-nodectl -s “status[ram_free] < 1GB” ls
This would select (-s
) all nodes with less than 1GB of free
memory and then call the “ls” command on those nodes (i.e. listing
those nodes); any other node command could also be executed here
(power on/off, reboot, etc.).
Default frequency is data collection every 10 seconds but this
may be overridden on a node-by-node basis with attribute _status_secs
.
For larger-scale management and control, one can set the
_status_plugins
and/or _status_secs
attributes inside an
attribute group and then join nodes to that group.
Building Status Plugins into an Image
The current approach creates new directories in the clusterware-node
package that must be installed on all nodes: scripts-available/status
and scripts-enabled/status
. The scripts-available/status
directory is populated with Penguin-provided scripts that will
provide useful status information in a variety of categories.
An admin can copy or symlink those scripts into
scripts-enabled/status
to permanently add them to the image.
For example, to include the timedatectl
plugin to the image:
% scyld-modimg -iNewImage -chroot -upload -overwrite
Downloading and unpacking image f21b65b1……0aafef663
100.0% complete, elapsed: 0:00:02.2 remaining: 0:00:00.0
elapsed: 0:00:06.0
Executing step: Chroot
Dropping into a /bin/bash shell. Exit when done.
[CW:NewImage /]# cd /opt/scyld/clusterware-node/scripts-enabled/status/
[CW:NewImage status]# ln -s ../../scripts-available/status/timedatectl.sh .
[CW:NewImage status]# exit
Upon reboot, any nodes using NewImage
will have timedatectl
enabled automatically.
Note that scripts added to scripts-enabled/status
are permanently
enabled and cannot be disabled later without rebuilding and
redeploying the disk image.
Admins can also create their own scripts (see Appendix: Creating New Plugins)
and add them to either scripts-available
or scripts-enabled
.
Placing them in scripts-available
would allow for future
on-the-fly enabling/disabling of that script.
On-The-Fly Plugins
An admin can enable and disable “on-the-fly” plugins by adding or
removing them from the _status_plugins
attribute.
If _status_plugins=nvidia
, then the system will look for the
script in scripts-available/status/nvidia.sh
.
Available status plugins
corestatus
provides basic information about the server: uptime, load average, free RAM, current time measurement, loaded modules and kernel command line, OS release, and ssh keys
Note that
ram_free
is reported in KiB, not bytes; soram_free=1000
indicates that 1024000 bytes of memory is currently free
corenetwork
provides basic network information about the server; for each network device: IP address(es) and MAC
selinux
provides basic information on whether selinux is running and what mode it is in, also reports on FIPS if it can determine that too.
ipmi
provides basic information that it can find through IPMI
virt
provides basic information if the system is running on a virtual machine
chrony
provides basic status on the Chrony (time-sync) daemon. Many queueing systems, as well as ClusterWare itself, require well-synchronized clocks across the cluster, so this status information could be useful in triaging, e.g., Slurm start-up issues.
timedatectl
provides basic status on the time and date system in the kernel. Again, many parts of a cluster need accurate time-sync to work properly.
Note
Changes to _status_plugins
will be processed on the next
status-update cycle, usually every 10 seconds unless changed
by the admin.