Status Plugins

Status plugins are used to report useful information about the compute node back to the server, e.g. CPU load, memory usage, available disk space, etc. This may be used for simple status monitoring to get an overall sense of how loaded the cluster is, for example:

scyld-nodectl --all status -L

And since this data is stored in the ClusterWare database, it can also be used to target actions against groups of nodes:

scyld-nodectl -s “status[ram_free] < 1GB” ls

This would select (-s) all nodes with less than 1GB of free memory and then call the “ls” command on those nodes (i.e. listing those nodes); any other node command could also be executed here (power on/off, reboot, etc.).

Default frequency is data collection every 10 seconds but this may be overridden on a node-by-node basis with attribute _status_secs.

For larger-scale management and control, one can set the _status_plugins and/or _status_secs attributes inside an attribute group and then join nodes to that group.

Building Status Plugins into an Image

The current approach creates new directories in the clusterware-node package that must be installed on all nodes: scripts-available/status and scripts-enabled/status. The scripts-available/status directory is populated with Penguin-provided scripts that will provide useful status information in a variety of categories. An admin can copy or symlink those scripts into scripts-enabled/status to permanently add them to the image.

For example, to include the timedatectl plugin to the image:

% scyld-modimg -iNewImage -chroot -upload -overwrite
Downloading and unpacking image f21b65b1……0aafef663
  100.0% complete, elapsed: 0:00:02.2 remaining: 0:00:00.0
  elapsed: 0:00:06.0
Executing step: Chroot
Dropping into a /bin/bash shell. Exit when done.
[CW:NewImage /]# cd /opt/scyld/clusterware-node/scripts-enabled/status/
[CW:NewImage status]# ln -s ../../scripts-available/status/timedatectl.sh .
[CW:NewImage status]# exit

Upon reboot, any nodes using NewImage will have timedatectl enabled automatically.

Note that scripts added to scripts-enabled/status are permanently enabled and cannot be disabled later without rebuilding and redeploying the disk image.

Admins can also create their own scripts (see Appendix: Creating New Plugins) and add them to either scripts-available or scripts-enabled. Placing them in scripts-available would allow for future on-the-fly enabling/disabling of that script.

On-The-Fly Plugins

An admin can enable and disable “on-the-fly” plugins by adding or removing them from the _status_plugins attribute.

If _status_plugins=nvidia, then the system will look for the script in scripts-available/status/nvidia.sh.

Available status plugins

  • corestatus

    • provides basic information about the server: uptime, load average, free RAM, current time measurement, loaded modules and kernel command line, OS release, and ssh keys

    • Note that ram_free is reported in KiB, not bytes; so ram_free=1000 indicates that 1024000 bytes of memory is currently free

  • corenetwork

    • provides basic network information about the server; for each network device: IP address(es) and MAC

  • selinux

    • provides basic information on whether selinux is running and what mode it is in, also reports on FIPS if it can determine that too.

  • ipmi

    • provides basic information that it can find through IPMI

  • virt

    • provides basic information if the system is running on a virtual machine

  • chrony

    • provides basic status on the Chrony (time-sync) daemon. Many queueing systems, as well as ClusterWare itself, require well-synchronized clocks across the cluster, so this status information could be useful in triaging, e.g., Slurm start-up issues.

  • timedatectl

    • provides basic status on the time and date system in the kernel. Again, many parts of a cluster need accurate time-sync to work properly.

Note

Changes to _status_plugins will be processed on the next status-update cycle, usually every 10 seconds unless changed by the admin.