Health-Check Plugins

Health-check plugins are intended for more straightforward information about a node's health; information that should not be changing as frequently. They are stored as a sub-field of the "status" information of a node and can be viewed with:

scyld-nodectl --all status -L

As with status plugins, scripts-available/health is populated with Penguin-provided scripts on a variety of topics.

Default frequency is data collection every 300 seconds but this may be overridden on a node-by-node basis with attribute _health_secs.

For larger-scale management and control, one can set the _health_plugins and/or _health_secs attributes inside an attribute group and then join nodes to that group.

Building Health Plugins into an Image

As with status plugins, admins can sym-link from scripts-available/health into scripts-enabled/health inside a disk image.

On-The-Fly Health Plugins

Similar to status plugins, admins can set the attribute _health_plugins to indicate a list of on-the-fly health plugins.

If _health_plugins=rasmem, then the system will look for the script in scripts-available/health/rasmem.sh.

Available Health Plugins

  • disk

    • provides a basic disk-usage check based on a threshold. If the storage on the node is greater than the threshold, the node is considered "unhealthy".

    • Set the attribute _hc_disk_avail_threshold to set the threshold (can be done at the node- or group-level); can be “123” (amount in KB) or “75%” (for percentage based calculations).

  • mem

    • provides a basic "memory health" check based on a threshold. If the current memory used is greater than the threshold, the node is considered "unhealthy".

    • Set the attribute _hc_mem_avail_threshold to set the threshold (can be done at the node- or group-level); can be “500” (amount in KB) or “75%”.

  • pingtest

    • provides a basic ""network health" check based on whether the node can successfully ping one or more servers. Each server is pinged 3 times and if any of the pings fail, the node is considered "unhealthy". If the servers can be pinged but the average ping time is greater than the threshold, the node is also considered to be "unhealthy".

    • Set the attribute _hc_ping_servers to give a comma-separated list of servers to ping (defaults to the parent head node).

    • Set the attribute _hc_ping_msecs to identify the average ping threshold (default is 5 msec).

  • rasmem

    • uses the ras-mc-ctl tool to provide a basic “memory hardware” check.

  • timesync

    • provides a basic "time sync" check based on whether the timedatectl

    tool considers the node to be synchronized to upstream time servers. If the tool reports that it is not synchronized, then the node is considered "unhealthy".

Note

Changes to _health_plugins will likely be detected on the next status cycle, usually every 10 seconds, but will not be processed until the next health-update cycle, usually every 300 seconds unless changed by the admin.