Managing Node Failures¶
Node failures are an unfortunate reality of any computer system, and failures in a Scyld ClusterWare cluster are inevitable and hopefully rare. Various strategies and techniques are available to lessen the impact of node failures.
Protecting an Application from Node Failure¶
There is only one good solution for protecting your application from node failure, and that is checkpointing. Checkpointing is where at regular intervals your application writes to disk what it has done so far, and at startup checks the file on disk so that it can start off where it was when it last wrote the file.
The way to checkpoint that gives you the highest chance of recovering is to send the data back to the master node and have it checkpoint there, and also make regular backups of your files on the master node.
When setting up checkpointing, it is important to think carefully about how often you want to checkpoint. Some jobs that don’t have much data that needs to be saved can checkpoint as often as every 5 minutes, whereas if you have a large data set, it might be smarter to checkpoint every hour, day, week, or longer. It depends a lot on your application. If you have a lot of data to checkpoint, you don’t want to do it often as that will drastically increase your run time. However, you also want to make sure that if you only checkpoint once every two days, that you can live with losing two days worth of work if there is ever a problem.
Compute Node Failure¶
A compute node can fail for any of a variety of reasons, e.g., broken
node hardware, a broken network, software bugs, or inadequate hardware
resources. A common example of the latter is a condition known as Out
Of Memory, or OOM, which occurs when one or more applications on the
node have consumed all available RAM memory and no swap space is
available. The Linux kernel detects an OOM condition, attempts to report
what is happening to the cluster’s syslog server, and begins to kill
processes on the node in an attempt to eliminate the process that is
triggering the problem. While this kernel response may occasionally be
successful, more commonly it will kill one or more processes that are
important for proper node behavior (e.g., a job manager daemon, the
crucial Scyld bpslave
daemon, or even a daemon that is required for
the kernel’s syslog messages to get communicated to the cluster’s syslog
server). When that happens, the node may still remain up in a
technical sense, but the node is useless and must be rebooted.
When Compute Nodes Fail¶
When a compute node fails, all jobs running on that node will fail. If there was an MPI job running that was using that node, the entire job will fail on all the nodes on which the MPI program was running.
Even though the running jobs running on that node failed, jobs running on other nodes that weren’t communicating with jobs on the failed node will continue to run without a problem.
If the problem with the node is easily fixed and you want to bring the
node back into the cluster, then you can try to reboot it using
bpctl -S
nodenumber -R
. If the compute node has failed in a
more catastrophic way, then such a graceful reboot will not work, and
you will need to powercycle or manually reset the hardware. When the
node returns to the up state, new jobs can be spawned that will use
it.
If you wish to switch out the node for a new physical machine, then you
must replace the broken node’s MAC addresses with the new machine’s MAC
addresses. When you boot the new machine, it either appears as a new
cluster node that is appended to the end of the list of nodes (if the
config
file says nodeassign append and there is room for new
nodes), or else the node’s MAC addresses get written to the
/var/beowulf/unknown_addresses
file. Alternatively, manually edit
the config
to change the MAC addresses of the broken node to the MAC
addresses of the new machine, followed by the command
systemctl reload clusterware
. Reboot this
node, or use IPMI to powercycle it, and the new machine reboots in the
correct node order.
Compute Node Data¶
What happens to data on a compute node after the node goes down depends on how you have set up the file system on the node. If you are only using a RAMdisk on your compute nodes, then all data stored on your compute node will be lost when it goes down.
If you are using the harddrive on your compute nodes, there are a few
more variables to take into account. If you have your cluster configured
to run mke2fs
on every compute node boot, then all data that was
stored on ext2
file systems on the compute nodes will be destroyed.
If mke2fs
does not execute, then fsck
will try to recover the
ext2
file systems; however, there are no guarantees that the file
system will be recoverable.
Note that even if fsck
is able to recover the file system, there is
a possibility that files you were writing to at the moment of node
failure may be in a corrupt or unstable state.
Master Node Failure¶
A master node can fail for the same reasons a compute node can fail, i.e., hardware faults or software faults. An Out-Of-Memory condition is more rare on a master node because the master node is typically configured with more physical RAM, more swap space, and is less commonly a participant in user application execution than is a compute node. However, in a Scyld ClusterWare cluster the master node plays an important role in the centralized management of the cluster, so the loss of a master node for any reason has more severe consequences than the loss of a single compute node. One common strategy for reducing the impact of a master node failure is to employ multiple master nodes in the cluster. See Managing Multiple Master Nodes for details.
Another moderating strategy is to enable Run-to-Completion. If the
bpslave
daemon that runs on each compute node detects that its
master node has become unresponsive, then the compute node becomes an
orphan. What happens next depends upon whether or not the compute
nodes have been configured for Run-to-Completion.
When Master Nodes Fail - Without Run-to-Completion¶
The default behavior of an orphaned bpslave
is to initiate a reboot.
All currently executing jobs on the compute node will therefore fail.
The reboot generates a new DHCP request and a PXEboot. If multiple
master nodes are available, then eventually one master node will
respond. The compute node reconnects to this master - perhaps the same
master that failed and has itself restarted, or perhaps a different
master - and the compute node will be available to accept new jobs.
Currently, Scyld only offers Cold Re-parenting of a compute node, in which a compute node must perform a full reboot in order to “fail-over” and reconnect to a master. See Managing Multiple Master Nodes for details.
When Master Nodes Fail - With Run-to-Completion¶
You can enable Run-to-Completion by enabling the ClusterWare script:
beochkconfig 85run2complete on
. When enabled, if the compute node
becomes orphaned because its bpslave
daemon has lost contact with
its master node’s bpmaster
daemon, then the compute node does not
immediately reboot. Instead, the bpslave
daemon keeps the node up
and running as best it can without the cooperation of an active master
node. In an ideal world, most or all jobs running on that compute node
will continue to execute until they complete or until they require some
external resource that causes them to hang indefinitely.
Run-to-Completion enjoys greatest success when the private cluster
network uses file server(s) that require no involvement of any compute
node’s active master node. In particular, this means not using the
master node as an NFS server, and not using a file server that is
accessed using IP-forwarding through the master node. Otherwise, an
unresponsive master also means an unresponsive file server, and that
circumstance is often fatal to a job. Keep in mind that the default
/etc/beowulf/fstab
uses $MASTER as the NFS server. You should edit
/etc/beowulf/fstab
to change $MASTER to the IP address of the
dedicated (and hopefully long-lived) non-master NFS server.
Stopping or restarting the clusterware service, or just rebooting the
compute nodes doing bpctl -S all -R
, will not put the compute nodes
into an orphan state. These actions instruct each compute node to
perform an immediate graceful shutdown and to restart with a PXEboot
request to its active master node. Similarly, rebooting the master node
will also stop the service with a systemctl stop clusterware
as part of the
master shutdown,
and the compute nodes will immediately reboot and attempt to PXEboot
before the master node has fully rebooted and thus ready to service the
nodes. This will be a problem unless another master node is running on
the private cluster network that will respond to the PXEboot request, or
unless the nodes’ BIOS have been configured to perpetually retry the
PXEboot, or unless you explicitly force all the compute nodes to
immediately become orphans prior to rebooting the master with
bpctl -S all -O
, thereby delaying the nodes’ reboots until the
master has time to reboot.
Once a compute node has become orphaned, it can only rejoin the cluster
by rebooting, i.e., a so-called Cold Re-parenting. There are two modes
that bpslave
can employ:
No automatic reboot. The cluster administrator must reboot each orphaned node using IPMI or by manually powercycling the server(s).
Reboot the node after being “effectively idle” for a span of N seconds. This is the default mode. The default N is 300 seconds, and the default “effectively idle” is cpu usage below 1% of one cpu’s available cpu cycles.
Edit the 85run2complete
script to change the defaults.
Alternatively, the bpctl
can set (or reset) the run-to-completion
modes and values. See man bpctl
.
The term “effectively idle” means a condition wherein the cpu usage on
the compute node is so small as to be interpreted as insignificant,
e.g., attributed to various daemons such as bpslave
, sendstats
,
and pbs_mom
, which periodically awaken, check fruitlessly for
pending work, and quickly go back to sleep. An orphaned node’s
bpslave
periodically computes cpu usage across short time intervals.
If the cpu usage is below a threshold percentage P (default 1%) of one
cpu’s total available cpu cycles, then the node is deemed “effectively
idle” across that short time interval. If and when the “effectively
idle” condition persists for the full N seconds time span (default 300
seconds), then the node reboots. If the cpu usage exceeds that threshold
percentage during any one of those short time intervals, then the
time-until-reboot is reset back to the full N seconds.
If the cluster uses TORQUE as a job manager, Run-to-Completion works best if TORQUE is configured for High Availability. Refer to PBS TORQUE documentation for details.
RHEL/CentOS 7 has deprecated the earlier Heartbeat software in preference to Corosync.