Health-Check Plugins¶
Health-check plugins are intended for more straightforward information about a node's health; information that should not be changing as frequently. They are stored as a sub-field of the "status" information of a node and can be viewed with:
scyld-nodectl --all status -L
As with status plugins, scripts-available/health
is populated with
Penguin-provided scripts on a variety of topics.
Default frequency is data collection every 300 seconds but this may be
overridden on a node-by-node basis with attribute _health_secs
.
For larger-scale management and control, one can set the
_health_plugins
and/or _health_secs
attributes inside an
attribute group and then join nodes to that group.
Building Health Plugins into an Image
As with status plugins, admins can sym-link from scripts-available/health
into scripts-enabled/health
inside a disk image.
On-The-Fly Health Plugins
Similar to status plugins, admins can set the attribute _health_plugins
to indicate a list of on-the-fly health plugins.
If _health_plugins=rasmem
, then the system will look for the
script in scripts-available/health/rasmem.sh
.
Available Health Plugins
disk
provides a basic disk-usage check based on a threshold. If the storage on the node is greater than the threshold, the node is considered "unhealthy".
Set the attribute
_hc_disk_avail_threshold
to set the threshold (can be done at the node- or group-level); can be “123” (amount in KB) or “75%” (for percentage based calculations).
mem
provides a basic "memory health" check based on a threshold. If the current memory used is greater than the threshold, the node is considered "unhealthy".
Set the attribute
_hc_mem_avail_threshold
to set the threshold (can be done at the node- or group-level); can be “500” (amount in KB) or “75%”.
pingtest
provides a basic ""network health" check based on whether the node can successfully ping one or more servers. Each server is pinged 3 times and if any of the pings fail, the node is considered "unhealthy". If the servers can be pinged but the average ping time is greater than the threshold, the node is also considered to be "unhealthy".
Set the attribute
_hc_ping_servers
to give a comma-separated list of servers to ping (defaults to the parent head node).Set the attribute
_hc_ping_msecs
to identify the average ping threshold (default is 5 msec).
rasmem
uses the ras-mc-ctl tool to provide a basic “memory hardware” check.
timesync
provides a basic "time sync" check based on whether the
timedatectl
tool considers the node to be synchronized to upstream time servers. If the tool reports that it is not synchronized, then the node is considered "unhealthy".
Note
Changes to _health_plugins
will likely be detected on
the next status cycle, usually every 10 seconds, but will not be
processed until the next health-update cycle, usually every
300 seconds unless changed by the admin.