Managing Node Failures

Node failures are an unfortunate reality of any computer system, and failures in a Scyld ClusterWare cluster are inevitable and hopefully rare. Various strategies and techniques are available to lessen the impact of node failures.

Protecting an Application from Node Failure

There is only one good solution for protecting your application from node failure, and that is checkpointing. Checkpointing is where at regular intervals your application writes to disk what it has done so far, and at startup checks the file on disk so that it can start off where it was when it last wrote the file.

The way to checkpoint that gives you the highest chance of recovering is to send the data back to the master node and have it checkpoint there, and also make regular backups of your files on the master node.

When setting up checkpointing, it is important to think carefully about how often you want to checkpoint. Some jobs that don't have much data that needs to be saved can checkpoint as often as every 5 minutes, whereas if you have a large data set, it might be smarter to checkpoint every hour, day, week, or longer. It depends a lot on your application. If you have a lot of data to checkpoint, you don't want to do it often as that will drastically increase your run time. However, you also want to make sure that if you only checkpoint once every two days, that you can live with losing two days worth of work if there is ever a problem.

Compute Node Failure

A compute node can fail for any of a variety of reasons, e.g., broken node hardware, a broken network, software bugs, or inadequate hardware resources. A common example of the latter is a condition known as Out Of Memory, or OOM, which occurs when one or more applications on the node have consumed all available RAM memory and no swap space is available. The Linux kernel detects an OOM condition, attempts to report what is happening to the cluster's syslog server, and begins to kill processes on the node in an attempt to eliminate the process that is triggering the problem. While this kernel response may occasionally be successful, more commonly it will kill one or more processes that are important for proper node behavior (e.g., a job manager daemon, the crucial Scyld bpslave daemon, or even a daemon that is required for the kernel's syslog messages to get communicated to the cluster's syslog server). When that happens, the node may still remain up in a technical sense, but the node is useless and must be rebooted.

When Compute Nodes Fail

When a compute node fails, all jobs running on that node will fail. If there was an MPI job running that was using that node, the entire job will fail on all the nodes on which the MPI program was running.

Even though the running jobs running on that node failed, jobs running on other nodes that weren't communicating with jobs on the failed node will continue to run without a problem.

If the problem with the node is easily fixed and you want to bring the node back into the cluster, then you can try to reboot it using bpctl -S nodenumber -R. If the compute node has failed in a more catastrophic way, then such a graceful reboot will not work, and you will need to powercycle or manually reset the hardware. When the node returns to the up state, new jobs can be spawned that will use it.

If you wish to switch out the node for a new physical machine, then you must replace the broken node's MAC addresses with the new machine's MAC addresses. When you boot the new machine, it either appears as a new cluster node that is appended to the end of the list of nodes (if the config file says nodeassign append and there is room for new nodes), or else the node's MAC addresses get written to the /var/beowulf/unknown_addresses file. Alternatively, manually edit the config to change the MAC addresses of the broken node to the MAC addresses of the new machine, followed by the command service beowulf reload. Reboot this node, or use IPMI to powercycle it, and the new machine reboots in the correct node order.

Compute Node Data

What happens to data on a compute node after the node goes down depends on how you have set up the file system on the node. If you are only using a RAMdisk on your compute nodes, then all data stored on your compute node will be lost when it goes down.

If you are using the harddrive on your compute nodes, there are a few more variables to take into account. If you have your cluster configured to run mke2fs on every compute node boot, then all data that was stored on ext2 file systems on the compute nodes will be destroyed. If mke2fs does not execute, then fsck will try to recover the ext2 file systems; however, there are no guarantees that the file system will be recoverable.

Note that even if fsck is able to recover the file system, there is a possibility that files you were writing to at the moment of node failure may be in a corrupt or unstable state.

Master Node Failure

A master node can fail for the same reasons a compute node can fail, i.e., hardware faults or software faults. An Out-Of-Memory condition is more rare on a master node because the master node is typically configured with more physical RAM, more swap space, and is less commonly a participant in user application execution than is a compute node. However, in a Scyld ClusterWare cluster the master node plays an important role in the centralized management of the cluster, so the loss of a master node for any reason has more severe consequences than the loss of a single compute node. One common strategy for reducing the impact of a master node failure is to employ multiple master nodes in the cluster. See Managing Multiple Master Nodes for details.

Another moderating strategy is to enable Run-to-Completion. If the bpslave daemon that runs on each compute node detects that its master node has become unresponsive, then the compute node becomes an orphan. What happens next depends upon whether or not the compute nodes have been configured for Run-to-Completion.

When Master Nodes Fail - Without Run-to-Completion

The default behavior of an orphaned bpslave is to initiate a reboot. All currently executing jobs on the compute node will therefore fail. The reboot generates a new DHCP request and a PXEboot. If multiple master nodes are available, then eventually one master node will respond. The compute node reconnects to this master - perhaps the same master that failed and has itself restarted, or perhaps a different master - and the compute node will be available to accept new jobs.

Currently, Scyld only offers Cold Re-parenting of a compute node, in which a compute node must perform a full reboot in order to "fail-over" and reconnect to a master. See Managing Multiple Master Nodes for details.

When Master Nodes Fail - With Run-to-Completion

You can enable Run-to-Completion by enabling the ClusterWare script: beochkconfig 85run2complete on. When enabled, if the compute node becomes orphaned because its bpslave daemon has lost contact with its master node's bpmaster daemon, then the compute node does not immediately reboot. Instead, the bpslave daemon keeps the node up and running as best it can without the cooperation of an active master node. In an ideal world, most or all jobs running on that compute node will continue to execute until they complete or until they require some external resource that causes them to hang indefinitely.

Run-to-Completion enjoys greatest success when the private cluster network uses file server(s) that require no involvement of any compute node's active master node. In particular, this means not using the master node as an NFS server, and not using a file server that is accessed using IP-forwarding through the master node. Otherwise, an unresponsive master also means an unresponsive file server, and that circumstance is often fatal to a job. Keep in mind that the default /etc/beowulf/fstab uses $MASTER as the NFS server. You should edit /etc/beowulf/fstab to change $MASTER to the IP address of the dedicated (and hopefully long-lived) non-master NFS server.

Stopping or restarting the beowulf service, or just rebooting the compute nodes doing bpctl -S all -R, will not put the compute nodes into an orphan state. These actions instruct each compute node to perform an immediate graceful shutdown and to restart with a PXEboot request to its active master node. Similarly, rebooting the master node will also stop the service with a service beowulf stop as part of the master shutdown, and the compute nodes will immediately reboot and attempt to PXEboot before the master node has fully rebooted and thus ready to service the nodes. This will be a problem unless another master node is running on the private cluster network that will respond to the PXEboot request, or unless the nodes' BIOS have been configured to perpetually retry the PXEboot, or unless you explicitly force all the compute nodes to immediately become orphans prior to rebooting the master with bpctl -S all -O, thereby delaying the nodes' reboots until the master has time to reboot.

Once a compute node has become orphaned, it can only rejoin the cluster by rebooting, i.e., a so-called Cold Re-parenting. There are two modes that bpslave can employ:

  1. No automatic reboot. The cluster administrator must reboot each orphaned node using IPMI or by manually powercycling the server(s).
  2. Reboot the node after being "effectively idle" for a span of N seconds. This is the default mode. The default N is 300 seconds, and the default "effectively idle" is cpu usage below 1% of one cpu's available cpu cycles.

Edit the 85run2complete script to change the defaults. Alternatively, the bpctl can set (or reset) the run-to-completion modes and values. See man bpctl.

The term "effectively idle" means a condition wherein the cpu usage on the compute node is so small as to be interpreted as insignificant, e.g., attributed to various daemons such as bpslave, sendstats, and pbs_mom, which periodically awaken, check fruitlessly for pending work, and quickly go back to sleep. An orphaned node's bpslave periodically computes cpu usage across short time intervals. If the cpu usage is below a threshold percentage P (default 1%) of one cpu's total available cpu cycles, then the node is deemed "effectively idle" across that short time interval. If and when the "effectively idle" condition persists for the full N seconds time span (default 300 seconds), then the node reboots. If the cpu usage exceeds that threshold percentage during any one of those short time intervals, then the time-until-reboot is reset back to the full N seconds.

If the cluster uses TORQUE as a job manager, Run-to-Completion works best if TORQUE is configured for High Availability.

TORQUE high availability is best achieved using multiple master nodes, a shared file server that is separate from the master nodes, ClusterWare Run-to-Completion, and a carefully applied configuration recipe. This provides an environment that supports Failover: where the loss of a master node allows for the probable completion of currently executing jobs, and the reconnecting of the orphaned compute nodes to a different master node with its cooperating pbs_server.

A critical element of the configuration is using shared storage for the TORQUE pbs_server daemon that executes on each master node and for the pbs_mom daemon that executes on each compute node. For example, on the file server we use one directory for TORQUE job data and another for TORQUE's server_priv private data:

mkdir -p /share/data
mkdir -p /share/torque_server_priv

and the /etc/exports contains entries to share these directories with the cluster:

/share/data *(rw,sync,no_root_squash)
/share/torque_server_priv *(rw,sync,no_root_squash)

This example shares to the world, although you may want to make your access rules more restrictive. Don't forget to reload the new exports with exportfs -r.

On the client side, i.e., the master nodes and compute nodes, these file server directories are mounted as /data and /var/spool/torque/server_priv. Each master node has mountpoints:

mkdir -p /data
mkdir -p /var/spool/torque/server_priv

and the master nodes' /etc/fstab has entries:

fileserver:/share/data /data  nfs  defaults 0 0
fileserver:/share/torque/server_priv /var/spool/torque/server_priv nfs defaults 0 0

where fileserver is the hostname of the file server. The mounts can now be enabled immediately using mount -a.

Each master node's /etc/beowulf/fstab has a similar entry that mounts /data on each compute node as it boots:

10.1.1.4:/share/data   /data   nfs   nolock,nonfatal  0 0

except here we use 10.1.1.4 as the IP address of the file server, rather than use the fileserver hostname, because a compute node's name service isn't available compute node boot time that would translate fileserver into its IP address.

On each master node, make sure the TORQUE script is enabled: beochkconfig 90torque on. Next, verify that all compute nodes are listed in /var/spool/torque/server_priv/nodes. For example, for two nodes, n0 and n1, where each node has four cores and thus you want TORQUE to schedule jobs on all cores, the file contents would be:

n0 np=4
n1 np=4

If /var/spool/torque/server_priv/nodes is empty and you have no queues configured, you can configure a single batch queue as well as the server_priv/nodes file by reconfiguring torque:

service beowulf start
# Temporarily start Beowulf services, and after all nodes are 'up':
service torque reconfigure
# For now, stop Beowulf services while we continue configuring TORQUE:
service beowulf stop

Next, configure the pbs_mom daemons on the compute nodes to use the /bin/cp command to send output files to /data. Edit the file /var/spool/torque/mom_priv/config to include:

$pbsserver      master
$usecp          *:/home /home
$usecp          *:/data /data

Now install the Heartbeat service on all master nodes, if it is not already installed, and enable it to start at master node boot time:

yum install heartbeat
chkconfig heartbeat on

and disable Beowulf and TORQUE from starting at master node boot time, since Heartbeat will manage these services for us:

chkconfig beowulf off
chkconfig torque off

On all masters, configure the Heartbeat service to manage Beowulf and TORQUE services, with the primary master named master0. The file /etc/ha.d/haresources needs to contain a line:

master0 beowulf torque

Start Beowulf and TORQUE by starting Heartbeat on the primary master node master0: service heartbeat start. The Heartbeat daemon reads the /etc/ha.d/haresources file and starts services in order: first beowulf, then torque. Then service heartbeat start on the other master nodes, who have also been configured to understand that master0 is the Heartbeat master.