Managing Node Failures¶
Node failures are an unfortunate reality of any computer system, and failures in a Scyld ClusterWare cluster are inevitable and hopefully rare. Various strategies and techniques are available to lessen the impact of node failures.
Protecting an Application from Node Failure¶
There is only one good solution for protecting your application from node failure, and that is checkpointing. Checkpointing is where at regular intervals your application writes to disk what it has done so far, and at startup checks the file on disk so that it can start off where it was when it last wrote the file.
The way to checkpoint that gives you the highest chance of recovering is to send the data back to the master node and have it checkpoint there, and also make regular backups of your files on the master node.
When setting up checkpointing, it is important to think carefully about how often you want to checkpoint. Some jobs that don't have much data that needs to be saved can checkpoint as often as every 5 minutes, whereas if you have a large data set, it might be smarter to checkpoint every hour, day, week, or longer. It depends a lot on your application. If you have a lot of data to checkpoint, you don't want to do it often as that will drastically increase your run time. However, you also want to make sure that if you only checkpoint once every two days, that you can live with losing two days worth of work if there is ever a problem.
Compute Node Failure¶
A compute node can fail for any of a variety of reasons, e.g., broken
node hardware, a broken network, software bugs, or inadequate hardware
resources. A common example of the latter is a condition known as Out
Of Memory, or OOM, which occurs when one or more applications on the
node have consumed all available RAM memory and no swap space is
available. The Linux kernel detects an OOM condition, attempts to report
what is happening to the cluster's syslog server, and begins to kill
processes on the node in an attempt to eliminate the process that is
triggering the problem. While this kernel response may occasionally be
successful, more commonly it will kill one or more processes that are
important for proper node behavior (e.g., a job manager daemon, the
bpslave daemon, or even a daemon that is required for
the kernel's syslog messages to get communicated to the cluster's syslog
server). When that happens, the node may still remain up in a
technical sense, but the node is useless and must be rebooted.
When Compute Nodes Fail¶
When a compute node fails, all jobs running on that node will fail. If there was an MPI job running that was using that node, the entire job will fail on all the nodes on which the MPI program was running.
Even though the running jobs running on that node failed, jobs running on other nodes that weren't communicating with jobs on the failed node will continue to run without a problem.
If the problem with the node is easily fixed and you want to bring the
node back into the cluster, then you can try to reboot it using
bpctl -S nodenumber
-R. If the compute node has failed in a
more catastrophic way, then such a graceful reboot will not work, and
you will need to powercycle or manually reset the hardware. When the
node returns to the up state, new jobs can be spawned that will use
If you wish to switch out the node for a new physical machine, then you
must replace the broken node's MAC addresses with the new machine's MAC
addresses. When you boot the new machine, it either appears as a new
cluster node that is appended to the end of the list of nodes (if the
config file says nodeassign append and there is room for new
nodes), or else the node's MAC addresses get written to the
/var/beowulf/unknown_addresses file. Alternatively, manually edit
config to change the MAC addresses of the broken node to the MAC
addresses of the new machine, followed by the command
service beowulf reload. Reboot this
node, or use IPMI to powercycle it, and the new machine reboots in the
correct node order.
Compute Node Data¶
What happens to data on a compute node after the node goes down depends on how you have set up the file system on the node. If you are only using a RAMdisk on your compute nodes, then all data stored on your compute node will be lost when it goes down.
If you are using the harddrive on your compute nodes, there are a few
more variables to take into account. If you have your cluster configured
mke2fs on every compute node boot, then all data that was
ext2 file systems on the compute nodes will be destroyed.
mke2fs does not execute, then
fsck will try to recover the
ext2 file systems; however, there are no guarantees that the file
system will be recoverable.
Note that even if
fsck is able to recover the file system, there is
a possibility that files you were writing to at the moment of node
failure may be in a corrupt or unstable state.
Master Node Failure¶
A master node can fail for the same reasons a compute node can fail, i.e., hardware faults or software faults. An Out-Of-Memory condition is more rare on a master node because the master node is typically configured with more physical RAM, more swap space, and is less commonly a participant in user application execution than is a compute node. However, in a Scyld ClusterWare cluster the master node plays an important role in the centralized management of the cluster, so the loss of a master node for any reason has more severe consequences than the loss of a single compute node. One common strategy for reducing the impact of a master node failure is to employ multiple master nodes in the cluster. See Managing Multiple Master Nodes for details.
Another moderating strategy is to enable Run-to-Completion. If the
bpslave daemon that runs on each compute node detects that its
master node has become unresponsive, then the compute node becomes an
orphan. What happens next depends upon whether or not the compute
nodes have been configured for Run-to-Completion.
When Master Nodes Fail - Without Run-to-Completion¶
The default behavior of an orphaned
bpslave is to initiate a reboot.
All currently executing jobs on the compute node will therefore fail.
The reboot generates a new DHCP request and a PXEboot. If multiple
master nodes are available, then eventually one master node will
respond. The compute node reconnects to this master - perhaps the same
master that failed and has itself restarted, or perhaps a different
master - and the compute node will be available to accept new jobs.
Currently, Scyld only offers Cold Re-parenting of a compute node, in which a compute node must perform a full reboot in order to "fail-over" and reconnect to a master. See Managing Multiple Master Nodes for details.
When Master Nodes Fail - With Run-to-Completion¶
You can enable Run-to-Completion by enabling the ClusterWare script:
beochkconfig 85run2complete on. When enabled, if the compute node
becomes orphaned because its
bpslave daemon has lost contact with
its master node's
bpmaster daemon, then the compute node does not
immediately reboot. Instead, the
bpslave daemon keeps the node up
and running as best it can without the cooperation of an active master
node. In an ideal world, most or all jobs running on that compute node
will continue to execute until they complete or until they require some
external resource that causes them to hang indefinitely.
Run-to-Completion enjoys greatest success when the private cluster
network uses file server(s) that require no involvement of any compute
node's active master node. In particular, this means not using the
master node as an NFS server, and not using a file server that is
accessed using IP-forwarding through the master node. Otherwise, an
unresponsive master also means an unresponsive file server, and that
circumstance is often fatal to a job. Keep in mind that the default
/etc/beowulf/fstab uses $MASTER as the NFS server. You should edit
/etc/beowulf/fstab to change $MASTER to the IP address of the
dedicated (and hopefully long-lived) non-master NFS server.
Stopping or restarting the beowulf service, or just rebooting the
compute nodes doing
bpctl -S all -R, will not put the compute nodes
into an orphan state. These actions instruct each compute node to
perform an immediate graceful shutdown and to restart with a PXEboot
request to its active master node. Similarly, rebooting the master node
will also stop the service with a
service beowulf stop as part of the
and the compute nodes will immediately reboot and attempt to PXEboot
before the master node has fully rebooted and thus ready to service the
nodes. This will be a problem unless another master node is running on
the private cluster network that will respond to the PXEboot request, or
unless the nodes' BIOS have been configured to perpetually retry the
PXEboot, or unless you explicitly force all the compute nodes to
immediately become orphans prior to rebooting the master with
bpctl -S all -O, thereby delaying the nodes' reboots until the
master has time to reboot.
Once a compute node has become orphaned, it can only rejoin the cluster
by rebooting, i.e., a so-called Cold Re-parenting. There are two modes
bpslave can employ:
- No automatic reboot. The cluster administrator must reboot each orphaned node using IPMI or by manually powercycling the server(s).
- Reboot the node after being "effectively idle" for a span of N seconds. This is the default mode. The default N is 300 seconds, and the default "effectively idle" is cpu usage below 1% of one cpu's available cpu cycles.
85run2complete script to change the defaults.
bpctl can set (or reset) the run-to-completion
modes and values. See
The term "effectively idle" means a condition wherein the cpu usage on
the compute node is so small as to be interpreted as insignificant,
e.g., attributed to various daemons such as
pbs_mom, which periodically awaken, check fruitlessly for
pending work, and quickly go back to sleep. An orphaned node's
bpslave periodically computes cpu usage across short time intervals.
If the cpu usage is below a threshold percentage P (default 1%) of one
cpu's total available cpu cycles, then the node is deemed "effectively
idle" across that short time interval. If and when the "effectively
idle" condition persists for the full N seconds time span (default 300
seconds), then the node reboots. If the cpu usage exceeds that threshold
percentage during any one of those short time intervals, then the
time-until-reboot is reset back to the full N seconds.
If the cluster uses TORQUE as a job manager, Run-to-Completion works best if TORQUE is configured for High Availability.
TORQUE high availability is best achieved using multiple master nodes, a
shared file server that is separate from the master nodes, ClusterWare
Run-to-Completion, and a carefully applied configuration recipe. This
provides an environment that supports Failover: where the loss of a
master node allows for the probable completion of currently executing
jobs, and the reconnecting of the orphaned compute nodes to a different
master node with its cooperating
A critical element of the configuration is using shared storage for the
pbs_server daemon that executes on each master node and for
pbs_mom daemon that executes on each compute node. For example,
on the file server we use one directory for TORQUE job data and another
for TORQUE's server_priv private data:
mkdir -p /share/data mkdir -p /share/torque_server_priv
/etc/exports contains entries to share these directories
with the cluster:
/share/data *(rw,sync,no_root_squash) /share/torque_server_priv *(rw,sync,no_root_squash)
This example shares to the world, although you may want to make your
access rules more restrictive. Don't forget to reload the new exports
On the client side, i.e., the master nodes and compute nodes, these file
server directories are mounted as
/var/spool/torque/server_priv. Each master node has mountpoints:
mkdir -p /data mkdir -p /var/spool/torque/server_priv
and the master nodes'
/etc/fstab has entries:
fileserver:/share/data /data nfs defaults 0 0 fileserver:/share/torque/server_priv /var/spool/torque/server_priv nfs defaults 0 0
where fileserver is the hostname of the file server. The mounts can
now be enabled immediately using
Each master node's
/etc/beowulf/fstab has a similar entry that
/data on each compute node as it boots:
10.1.1.4:/share/data /data nfs nolock,nonfatal 0 0
except here we use
10.1.1.4 as the IP address of the file server,
rather than use the fileserver hostname, because a compute node's name
service isn't available compute node boot time that would translate
fileserver into its IP address.
On each master node, make sure the TORQUE script is enabled:
beochkconfig 90torque on. Next, verify that all compute nodes are
/var/spool/torque/server_priv/nodes. For example, for two
nodes, n0 and n1, where each node has four cores and thus you want
TORQUE to schedule jobs on all cores, the file contents would be:
n0 np=4 n1 np=4
/var/spool/torque/server_priv/nodes is empty and you have no
queues configured, you can configure a single batch queue as well as
server_priv/nodes file by reconfiguring torque:
service beowulf start # Temporarily start Beowulf services, and after all nodes are 'up': service torque reconfigure # For now, stop Beowulf services while we continue configuring TORQUE: service beowulf stop
Next, configure the
pbs_mom daemons on the compute nodes to use the
/bin/cp command to send output files to
/data. Edit the file
/var/spool/torque/mom_priv/config to include:
$pbsserver master $usecp *:/home /home $usecp *:/data /data
Now install the Heartbeat service on all master nodes, if it is not already installed, and enable it to start at master node boot time:
yum install heartbeat chkconfig heartbeat on
and disable Beowulf and TORQUE from starting at master node boot time, since Heartbeat will manage these services for us:
chkconfig beowulf off chkconfig torque off
On all masters, configure the Heartbeat service to manage Beowulf and
TORQUE services, with the primary master named master0. The file
/etc/ha.d/haresources needs to contain a line:
master0 beowulf torque
Start Beowulf and TORQUE by starting Heartbeat on the primary master
service heartbeat start. The Heartbeat daemon reads
/etc/ha.d/haresources file and starts services in order: first
beowulf, then torque. Then
service heartbeat start on the
other master nodes, who have also been configured to understand that
master0 is the Heartbeat master.