Configuring the Cluster¶

The Scyld ClusterWare configuration is defined by the contents of several flat ASCII files. Most of these files reside in the /etc/beowulf/ directory. Various ClusterWare scripts (which mostly reside in /usr/lib/beoboot/bin), daemons, and commands read (and some occasionally update) these flat files.

The root user can manipulate the configuration manually using a text editor.

Configuring the Cluster Manually¶

This section discusses how to configure a cluster. Penguin Computing strongly recommends that the administrator use Manual editing of configuration files, especially the centerpiece /etc/beowulf/config file, should only be done with care, together with sufficient understanding of the ramifications of the manual manipulations.

Caution

If manual edits are made to the config file for a running cluster, then after saving the file, but sure to execute service beowulf reload, which will immediately send a SIGHUP signal to the bpmaster and beoserv daemons that notifies each to re-read the config file.

Configuration Files¶

/etc/beowulf/config¶

The file /etc/beowulf/config is the principal configuration file for the cluster. The config file is organized using keywords and values, which are used to control most aspects of running the cluster, including the following:

The name, IP address and netmask of the network interface connected to the private cluster network
The network port numbers used by ClusterWare for various services
The IP address range to assign to the compute nodes
The MAC (hardware) address of each identified node accepted into the cluster
The node number and IP address assigned to each hardware address
The default kernel and kernel command line to use when creating a boot file
A list of kernel modules to be available for loading on compute nodes at runtime
A list of shared library directories to cache on the compute nodes
A list of files to prestage on the compute nodes
Compute node filesystem startup policy
The name of the final boot file to send to the compute nodes at boot time
The hostname and hostname aliases of compute nodes
Compute node policies for handling local disks and filesystems, responding to master node failure, etc.

The following sections briefly discuss some key aspects of the configuration file. See the Reference Guide (or man beowulf-config) for details on the specific keywords and values in /etc/beowulf/config.

Setting the IP Address Range¶

The IP address range should be kept to a minimum, as all the cluster utilities will loop through this range. Having a few spare addresses is a good idea to allow for growth in the cluster. However, having a large number of addresses that will never be used will be an unnecessary waste of resources.

Identifying New Nodes¶

When a new node boots, it issues a DHCP request to the network in order to get an IP address assigned to it. The master's beoserv detects these DHCP packets, and its response is dependent upon the current nodeassign policy. With a default append policy, beoserv appends a new node entry to the end of the /etc/beowulf/config file. This new entry identifies the node's MAC address(es), and the relative ordering of the node entry defines the node's number and what IP address is assigned to it. With a manual policy, beoserv appends the new node's MAC address to the file /var/beowulf/unknown_addresses, and then assigns a temporary IP address to the node that is outside the iprange address range and which does not integrate this new node into the cluster. It is expected that the cluster administrator will eventually assign this new MAC address to a cluster node, giving it a node entry with an appropriate position and node number. Upon cluster restart, when the node reboots (after a manual reset or an IPMI powercycle), the node will assume its assigned place in the cluster. With a locked policy, the new node gets ignored completely: no recording of its MAC address, and no IP address assignment.

Assigning Node Numbers and IP Addresses¶

Two config file keywords control the assignment of IP addresses to compute nodes on the private cluster network: nodes and iprange. The nodes keyword specifies the max number of compute nodes, and the iprange specifies the range of IP addresses that are assigned to those compute nodes.

By default and in general practice, node numbers and IP addresses are assigned to the compute nodes in the order that their node entries appear in the config file, beginning with node 0 and the first IP address specified by the iprange entry in the config file. For example, the config file entries:

nodes 8
iprange 10.20.30.100 10.20.30.107
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B

specify a network that contains a maximum of eight nodes, with four nodes currently known, and with an IP address range that falls between the 10.20.30.100 lowerbound and the 10.20.30.107 upperbound. Here the node with MAC address 00:01:02:03:04:1C is node 2 and will be assigned an IP address 10.20.30.102.

ClusterWare treats the upperbound IP address as optional, so all that is necessary to specify is:

nodes 8
iprange 10.20.30.100

and ClusterWare calculates the upperbound IP address. This is especially useful when dealing with large nodes counts, e.g.:

nodes 1357
iprange 10.20.30.100

when it becomes increasingly clumsy for the cluster administrator to accurately calculate the upperbound address.

An optional node number can explicitly specify an override node number:

node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 2 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B

explicitly (and redundantly) specifies the node 2 numbering. Alternatively:

node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 5 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B

explicitly names that node as node 5 with IP address 10.20.30.105, and the next node (with MAC address 00:01:02:03:04:1D will now be node 6 with IP address 10.20.30.106.

In another variation, commenting-out the MAC address(es) leaves a node numbering gap for node 2, and MAC address 00:01:02:03:04:1D continues to be known as node 3:

node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node # 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B

However, if the node with that commented-out MAC address 00:01:02:03:04:1C does attempt to PXE boot, then beoserv assigns a new node number (4) to that physical node and automatically appends a new node entry to the list (assuming the nodeassign policy is append, and assuming the iprange and nodes entries allow room for expansion). This appending results in:

node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node # 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
node 00:01:02:03:04:1C 00:01:02:03:05:2B

If you want to have beoserv ignore that physical node and keep the remaining nodes numbered without change, then use the keyword off:

node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node off 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B

A node entry can identify itself as a non-Scyld node and can direct beoserv to respond to the node in a variety of ways, including telling the node to boot from a local harddrive, or provisioning the node with specific kernel and initrd images. See Managing Non-Scyld Nodes for details.

Caching Shared Libraries¶

To add a shared library to the list of libraries cached on the compute nodes, specify the pathname of the individual file or the pathname of the entire directory in which the file resides using the libraries directive. An open() syscall on a compute node to open a file thus named, or to open a file that resides in a named directory, will cause that file to be pulled from the master node to the compute node and saved in the local RAM filesystem.

The prestage directive names specific files to be pulled onto each compute node at node boot time. If a file pathname resides in one of the libraries directories, then BProc's filecache functionality pulls the file from the master node. Otherwise, the specified file is pushed from the master to the compute node at startup, with directories created as needed.

Specifying node names and aliases¶

The nodename keyword in the master's /etc/beowulf/config affects the behavior of the ClusterWare NSS. Using the nodename keyword, one may redefine the primary host-name of the cluster, define additional hostname aliases for compute nodes, and define additional hostname (and hostname aliases) for entities loosely associated with the compute node's cluster position.

nodename [name-format] <IPv4 Offset or base> <netgroup>

The presence of the optional IPv4 argument defines if the entry is for "compute nodes" (i.e. the entry will resolve to the 'dot-number' name) or if the entry is for non-cluster entities that are loosely associated with the compute node. In the case where there is an IPv4 argument, the nodename keyword defines an additional hostname name that maps to an IPv4 address loosely associated with the node number. In case where IPv4 argument is present, the nodename keyword defines hostname and hostname aliases for the clustering interface (i.e. the compute nodes). Subsequent nodename entries without an IPv4 argument specify additional hostname aliases for compute nodes. In either case, the format string must contain a conversion specification for node number substitution. The conversion specification is introduced by a '%'. An optional following digit in the range 1..5 specifies a zero-padded minimum field width. The specification is completed with an 'N'. An unspecified or zero field width allows numeric interpretation to match compute node host names. For example, n%N will match n23, n+23, and n0000023. By contrast, n%3N will only match n001 or node023, but not n1 or n23.

Compute node command-line options¶

The kernelcommandline directive is a method of passing various options to the compute node's kernel and to Beowulf on the node. There are a large number of different command line options that you can employ. This section covers some of them.

Some options are interpreted by the kernel on the compute node and ignored by Beowulf:

apic

This option turns on APIC support on the compute node. APIC is the newer of two different mechanisms Intel provides for invoking interrupts. It works better with SMP systems than the older mechanism, called XT-PIC. However, not every motherboard and chipset works correctly with APIC, so this option is disabled by default to avoid problems for those machines that do not support it.

If you find that your cluster nodes kernel panic or crash immediately upon boot, you probably want to turn off APIC by specifying noapic in the command line options. If you have many devices that generate interrupts (such as hard disk controllers, network adapters, etc.) you may want to try turning on APIC to see if there is any performance advantage for your cluster.

panic=<seconds>

This option allows you to specify how many seconds the kernel should wait to reboot after a kernel panic. For example, if you specify panic=60, then the kernel will wait 60 seconds before rebooting. Note that Beowulf automatically adds panic=30 to final boot images.

apm=<action>

This option allows you to specify APM options on the compute node. Acceptable <action> values are on (to turn APM completely on), off (to turn it completely off), debug (to turn on debugging), and power-off (to turn on only the power-off part of APM).

APM is not SMP-safe in the kernel; it will auto-disable itself if turned completely on for an SMP box. However, the power-off part of APM is SMP safe; thus, if you want to be able to power-off SMP boxes, you can do so by specifying apm=power-off. Note that apm=power-off is specified in the default kernelcommandline directive.

console=<device>, <options>

This option is used to select which device(s) to use for console output. For <device> use tty0 for the foreground virtual console, ttyX (e.g., tty1) for any other virtual console, and ttySx (e.g., ttyS0 for a serial port.

For the serial port, <options> defines the baud rate/parity/bits of the port in the format "BBBBPN", where "BBBB" is the speed, "P" is parity (n/o/e), and "N" is bits. The default setting is 9600n8, and the maximum baud rate is 115200. For example, to use the serial port at the maximum baud rate, specify console=ttyS0,115200n8r

Other options are interpreted by Beowulf on the compute node:

rootfs_size=<size>

A compute node employs a RAM-based root filesystem for local non-persistent storage, typically used to contain BProc's filecache libraries and other files, the /tmp directory, and other directories that are not mounted using some variety of global storage (e.g., NFS or Panfs) or on local harddrives. This tmpfs root filesystem consumes physical memory only as needed, which commonly is about 100- to 200-MBytes unless user workloads impose greater demands on (for example) /tmp space. However, by default the rootfs is allowed to grow to consume a maximum of 50% of physical memory, which has the potential of allowing users to consume (perhaps inadvertently) an excessive amount of RAM that would otherwise be available to applications' virtual memory needs.

This 50% default can be overridden by the judicious use of the <size> option, where <size> can be expressed as numeric bytes, megabytes (appending "m" or "M"), or gigabytes (appending "g" or "G"), or as a percentage of total physical memory (appending numeric value and "%"). Examples:

rootfs_size=2048m
rootfs_size=1G
rootfs_size=15%

Note that this override is rarely needed, and it must be utilized with care. An inappropriately constrained root filesystem will cripple the node, just as an inadequate amount of physical memory that is available for virtual memory will trigger Out-Of-Memory failures. The cluster administrator is encouraged to limit user filespace usage in other ways, such as declaring /etc/security/limits.conf limits on the max number of open files and/or the maximum filesize.

rootfs_timeout=<seconds>; getfile_timeout=<seconds>

The beoclient daemon on each compute node manages the early boot process, such as using tftp to read the kernel image and initrd files from the master node's beoserv daemon, and using tcp to read the initial root filesystem image (rootfs) from beoserv. After the node boots, BProc's filecache functionality on the compute node also uses tcp to read files from the master, as needed by applications.

The default timeout for these tcp reads is 30 seconds. If this timeout is too short, then add one of these options to the kernelcommandline to override the default. The option getfile_timeout overrides the timeout for all beoclient tcp read operations. The option rootfs_timeout overrides the timeout only for the tcp read of the root filesystem at node boot time.

syslog_server=<IPaddress>

By default, a compute node forwards its kernel messages and syslog messages back to the master node's syslog or rsyslog service, which then appends these log messages to the master's /var/log/messages file. Alternatively, the cluster administrator may choose to instead forward these compute node log messages to another server by using the syslog_server option to identify the <IPaddress> of that server. This should be an IPv4 address, e.g., syslog_server=10.20.30.2.

Scyld ClusterWare automatically configures the master node's log service to handle incoming log messages from remote compute nodes. However, the cluster administrator must manually configure the alternate syslog server:

For the syslog service (Scyld ClusterWare 4 and 5), edit /etc/sysconfig/syslog on the alternate server to add "-r -x" to the variable SYSLOGD_OPTIONS.

For the rsyslog service (Scyld ClusterWare 6), edit /etc/sysconfig/rsyslog on the alternate server to add "-x" to the variable SYSLOGD_OPTIONS, and edit /etc/rsyslog.conf to un-comment the following lines to expose them, i.e., just as Scyld ClusterWare has done in the master node's /etc/rsyslog.conf file:

$ModLoad imudp.so
$UDPServerRun 514

Finally, restart the service on both the master node and the alternate syslog server before restarting the cluster.

legacy_syslog=<num>

The legacy behavior of the the compute node's syslog handling has been to introduce a date-time string to the message text, then forward the message to the syslog server (typically on the master node), which would add its own date-time string. This redundant timestamp violates the RFC 3164 format standard, and recent ClusterWare releases strips the compute node's timestamp before sending the text to the master server. If for some reason a local cluster administrator wishes to revert to the previous behavior, then add legacy_syslog=1. The default is legacy_syslog=0.

Specifying kernel modules for use on compute nodes¶

Each bootmodule entry identifies a kernel module to be added to the initrd that is passed to each compute node at boot time. These entries typically name possible Ethernet drivers used by nodes supplied by Penguin Computing. If the cluster contains nodes not supplied by Penguin Computing, then the cluster administrator should examine the default list and add new bootmodule entries as needed.

At boot time, Beowulf scans the node's PCI bus to determine what devices are present and what driver is required for each device. If the specified driver is named by a bootmodule entry, then Beowulf loads the module and all its dependencies. However, some needed modules are not found by this PCI scan, e.g., those used to manage specific filesystem types. These modules require adding an additional config file entry: modprobe. For example:

modprobe xfs

Note that each named modprobe module must also be named as a bootmodule.

You may also specify module-specific arguments to be applied at module load time, e.g.,

modarg forcedeth optimization_mode=1

RHEL6 introduced externally visible discrete firmware files that are associated with specific kernel software drivers. When modprobe attempts to load a kernel module that contains such a software driver, and that driver determines that the controller hardware needs one or more specific firmware images (which are commonly found in /lib/firmware), then the kernel first looks at its list of built-in firmware files. If the desired file is not found in that list, then the kernel sends a request to the udevd daemon to locate the file and to pass its contents back to the driver, which then downloads the contents to the controller. This functionality is problematic if the kernel module is an /etc/beowulf/config bootmodule and is an Ethernet driver that is necessary to boot a particular compute node in the cluster. The number of /lib/firmware/ files associated with every possible bootmodule module is too large to embed into the initrd image common to all compute nodes, as that burdens every node with a likely unnecessarily oversized initrd to download. Accordingly, the cluster administrator must determine which specific firmware file(s) are actually required for a particular cluster and are not yet built-in to the kernel, then add firmware directive(s) for those files.

A bootmodule firmware problem exhibits itself as a compute node which does not boot because the needed Ethernet driver cannot be modprobe'd because it cannot load a specific firmware file. After a timeout waiting for udevd to unsuccessfully find the file, the compute node typically reboots - endlessly, as it continues to be unable to load the needed firmware file.

The cluster administrator can use the firmware directive to add specific firmware files to the compute node initrd, as needed. The compute node kernel writes the relevant firmware filename information to its console, e.g. a line of the form:

Failed to load firmware "bnx2/bnx2-mips-06-6.2.1.fw"

Ideally, the administrator gains access to the node's console to see the specific filename, then adds a directive to /etc/beowulf/config:

firmware bnx2/bnx2-mips-06-6.2.1.fw

and rebuilds the initrd:

[root@cluster ~] # service beowulf reload

(Note: reload, not restart)

If the node continues to fail to boot, then the failure is likely due to another missing firmware file. Check the node's console output again, and add the specified file to the firmware directive.

If the cluster administrator cannot easily see the node's console output to determine what firmware files are needed, then if the administrator knows the likely bootmodule module culprit, then the administrator can brute-force every known firmware file for that module using a directive of the form:

firmware bnx2

that names an entire /lib/firmware/ subdirectory. This will likely create a huge initrd that will (if the correct bootmodule module is specified) successfully boot the compute node. The administrator should then examine the node's syslog output, which is typically seen in /var/log/messages, to determine the specific individual firmware filenames that were actually needed, and then the administrator replaces the subdirectory name with the now-known specific firmware filenames. Subsequently, the cluster administrator should contact Penguin Computing Support to inform us what those needed firmware files are, so that we can build-in these files into future kernel images and thus allow the cluster administrator to remove the firmware directives and thus reduce the initrd size, which contains not only the firmware images, but additionally includes various executable binaries and libraries that are only needed for this dynamic udevd functionality.

/etc/beowulf/fdisk¶

The /etc/beowulf/fdisk directory is created by the beofdisk utility when it evaluates local disks on individual compute nodes and creates partition tables for them. For each unique drive geometry discovered among the local disks on the compute nodes, beofdisk creates a file within this directory. The file naming convention is "head;ccc;hhh;sss", where "ccc" is the number of cylinders on the disk, "hhh" is the number of heads, and "sss" is the number of sectors per track.

These files contain the partition table information as read by beofdisk. Normally, these files should not be edited by hand.

You may create separate versions of this directory that end with the node number (for example, /etc/beowulf/fdisk.3). The master's BeoBoot software will look for these directories before using the general /etc/beowulf/fdisk directory.

For more information, see the section on beofdisk in the Reference Guide.

/etc/beowulf/fstab¶

This is the filesystem table for the mount points of the partitions on the compute nodes. It should be familiar to anyone who has dealt with an /etc/fstab file in a standard Linux system, though with a few Scyld ClusterWare extensions. For details, see the Reference Guide or execute man beowulf-fstab.

You may create separate node-specific versions by appending the node number, e.g., /etc/beowulf/fstab.3 for node 3. The master's beoboot node_up script looks first for a node_specific fstab.N file, then if no such file exists will use the default /etc/beowulf/fstab file.

Caution

On compute nodes, NFS directories must be mounted using either the IP address or the $MASTER keyword; the master node's hostname cannot be used. This is because /etc/beowulf/fstab is evaluated before the Scyld ClusterWare name service is initialized, which means hostnames cannot be resolved on a compute node at that point.

/etc/beowulf/backups/¶

This directory contains time-stamped backups of older versions of various configuration files, e.g., /etc/beowulf/config and /etc/beowulf/fstab, to assist in the recovery of a working configuration after an invalid edit.

/etc/beowulf/conf.d/¶

This directory contains various configuration files that are involved when booting a compute node. In particular, the node_up script pushes the master node's /etc/beowulf/conf.d/limits.conf to each compute node as /etc/security/limits.conf, and pushes /etc/beowulf/conf.d/sysctl.conf to each compute node as /etc/sysctl.conf. If /etc/beowulf/conf.d/limits.conf does not exist, then node_up creates an initial file as a concatenation of the master node's /etc/security/limits.conf plus all files in the directory /etc/security/limits.d/. Similarly, node_up creates an initial /etc/beowulf/conf.d/sysctl.conf (if it doesn't already exist) as a copy of the master's /etc/sysctl.conf. The cluster administrator may subsequently modify these initial "best guess" configuration files as needed for compute nodes.

Command Line Tools¶

bpstat¶

The command bpstat can be used to quickly check the status of the cluster nodes and/or see what processes are running on the compute nodes. See the Reference Guide for details on usage.

bpctl¶

To reboot or set the state of a node via the command line, one can use the bpctl command. For example, to reboot node 5:

[root@cluster ~] # bpctl -S 5 -R

As the administrator, you may at some point have reason to prevent other users from running new jobs on a specific node, but you do not want to shut it down. For this purpose we have the unavailable state. When a node is set to unavailable non-root users will be unable to start new jobs on that node, but existing jobs will continue running. To do this, set the state to unavailable using the bpctl command. For example, to set node 5 to unavailable:

[root@cluster ~] # bpctl -S 5 -s unavailable

node_down¶

If you are mounting local filesystems on the compute nodes, you should shut down the node cleanly so that the filesystems on the harddrives stay in a consistent state. The node_down script in /usr/lib/beoboot/bin does exactly this. It takes two arguments; the first is the node number, and the second is the state to which you want the node to go. For example, to cleanly reboot node 5:

[root@cluster ~] # /usr/lib/beoboot/bin/node_down 5 reboot

Alternatively, to cleanly power-off node 5:

[root@cluster ~] # /usr/lib/beoboot/bin/node_down 5 pwroff

The node_down script works by first setting the node's state to unavailable, then remounting the filesystems on the compute node read-only, then calling bpctl to change the node state. This can all be done by hand, but the script saves some keystrokes.

To configure node_down to use IPMI, set the ipmi value in /etc/beowulf/config to enabled as follows:

[root@cluster ~] # beoconfig ipmi enabled

Configuring CPU speed/power for Compute Nodes¶

Modern motherboards and processors support a degree of administrator management of CPU frequency within a range defined by the motherboard's BIOS. Scyld ClusterWare provides the /etc/beowulf/init.d/30cpuspeed script and its associated /etc/beowulf/conf.d/cpuspeed.conf configuration file to implement this management for compute nodes. The local cluster administrator is encouraged to review the cpuspeed.conf config file's section labeled Scaling governor values and potentially adjust the environment variable SCALINGGOV as desired, and then to enable the 30cpuspeed script:

[root@cluster ~] # beochkconfig 30cpuspeed on

The administrator should also ensure that no other cpuspeed or cpupower script is enabled for compute nodes.

In brief, the administrator can choose among four CPU scaling governor settings:

performance, which directs the CPUs to execute at the maximum frequency supported by the motherboard and processor, as specified by the motherboard BIOS.
powersave, which directs the CPUs to execute at the minimum frequency supported by the motherboard and processor.
ondemand, which directs the kernel to adjust the CPU frequency between the minimum and maximum. An idle CPU executes at the minimum. As a load appears, the frequency increases relatively quickly to the maximum, and if and when the load subsides, then the frequency decreases back to the minimum. This is the default setting.
conservative, which similarly directs the kernel to adjust the CPU frequency between the minimum and maximum, albeit making those adjustments with somewhat longer latency than is done for ondemand.

The upside of the performance scaling governor is that applications running on compute nodes always enjoy the maximum CPU frequencies that are supported by the node hardware. The downside is that even idle CPUs consume that same maximum power and thus generate maximum heat. For the scaling governors performance, ondemand, and conservative, a computebound workload drives the CPU frequencies (and power and heat) to the maximum, and thus computebound application performance will exhibit little or no difference among those governors. However, a workload of rapid context switching and frequent idle time will show perhaps 10-20% lower performance for ondemand versus performance, and possibly an even larger decline with conservative. The powersave governor is typically only employed when a need to minimize the cluster power consumption and/or minimize thermal levels outweighs a need to achieve maximum performance.

A broader discussion can be found in the /usr/share/doc/kernel-doc-2.6.32/Documentation/cpu-freq/ documents, e.g., governors.txt. Install the RHEL6 or CentOS6 base distribution's kernel-doc package to access these documents.

Adding New Kernel Modules¶

The modprobe command uses /usr/lib/`uname -r`/modules.dep.bin to determine the pathnames of the specified kernel module and that module's dependencies. The depmod command builds the human-readable modules.dep and the binary module.dep.bin files, and it should be executed on the master node after installing any new kernel module.

Executing modprobe on a compute node requires additional caution. The first use of modprobe retrieves the current modules.dep.bin from the master node using bproc's filecache functionality. Since any subsequent depmod on the master node rebuilds modules.dep.bin, then a subsequent modprobe on a compute node will only see the new modules.dep.bin if that file is copied to the node using bpcp, or if the node is rebooted and thereby silently retrieves the new file.

In general, you should not execute depmod on a compute node, since that command will only see those few kernel modules that have previously been retrieved from the master node, which means the node's newly built modules.dep.bin will only be a sparse subset of the master node's full module.dep.bin. Bproc's filecache functionality will always properly retrieve a kernel module from the master node, as long as the node's module.dep.bin properly specifies the pathname of that module, so the key is to have the node's module.dep.bin be a current copy of the master's file.

Many device drivers are included with Scyld ClusterWare and are supported out-of-the-box for both the master and the compute nodes. If you find that a device, such as your Ethernet adapter, is not supported and a Linux source code driver exists for it, then you will need to build the driver modules for the master.

To do this, you will need to install the RPM of kernel source code (if you haven't already done so). Next, compile the source code using the following extra GNU C Compiler (gcc) options.

-D__BOOT_KERNEL_SMP=1 -D__BOOT_KERNEL_UP=0

The compiled modules must be installed in the appropriate directories under /lib/modules. For example, if you are currently running under the 2.6.9-67.0.4.ELsmp kernel version, the compiled module for an Ethernet driver would be put in the following directory:

/lib/modules/2.6.9-67.0.4.ELsmp/kernel/drivers/net

Any kernel module that is required to boot a compute node, e.g., most commonly the Ethernet driver(s) used by compute nodes, needs special treatment. Edit the config file /etc/beowulf/config to add the name of the driver to the bootmodule list; you can add more bootmodule lines if needed. See Compute Node Boot Options.

Next, you need to configure how the device driver gets loaded. You can set it up so that the device driver only loads if the specific device is found on the compute node. To do this, you need to add the PCI vendor/device ID pair to the PCI table information in the /usr/share/hwdata/pcitable file. You can figure out what these values are by using a combination of lspci and lspci -n.

So that your new kernel module is always loaded on the compute nodes, include the module in the initial RAM disk by adding a modprobe line to /etc/beowulf/config. The line should look like the following:

modprobe <module>

where <module> is the kernel module in question.

Finally, you can regenerate the BeoBoot images by running service beowulf reload. For more details, see Compute Node Boot Options.

Accessing External License Servers¶

To configure the firewall for accessing external license servers, enable ipforward in the /etc/beowulf/config file. The line should read as follows:

ipforward yes

You must then reboot the compute nodes and restart the cluster services. To do so, run the following two commands as root in quick succession:

[root@cluster ~] # bpctl -S all -R
[root@cluster ~] # service beowulf restart

Tip

If IP forwarding is enabled in /etc/beowulf/config but is still not working, then check /etc/sysctl.conf to see if it is disabled.

Check for the line "net.ipv4.ip_forward = 1". If the value is set to 0 (zero) instead of 1, then IP forwarding will be disabled, even if it is enabled in /etc/beowulf/config.

Configuring SSH for Remote Job Execution¶

Most applications that leverage /usr/bin/ssh on compute nodes can be configured to use /usr/bin/rsh. In the event that your application requires SSH access to compute nodes, ClusterWare provides this ability through /etc/beowulf/init.d/81sshd. To start sshd on compute nodes, enable the 81sshd script and reboot your nodes:

[root@cluster ~] # beochkconfig 81sshd on
[root@cluster ~] # bpctl -S all -R

When each node boots, 81sshd starts sshd on the node, and the master's root user will be able to SSH to a compute node without a password, e.g.:

[root@cluster ~] # ssh n0 ls

By default, compute node sshd daemons do not allow for password-based authentication -- only key-based authentication is available -- and only the root user's SSH keys have been configured.

If a non-root user needs SSH access to compute nodes, the user's SSH keys will need to be configured. For example, create a DSA key using ssh-keygen, and hit Enter when prompted for a password if you want password-less authentication:

[user1@cluster ~] $ ssh-keygen -t dsa

Since the master's /home directory is mounted (by default) as /home on the compute nodes, just copy the public key to ~/.ssh/authorized_keys:

[user1@cluster ~] $ cp -a ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys

Now the user can run commands over SSH to any node using shared key authentication:

[user1@cluster ~] $ ssh n0 date

If you wish to modify sshd's settings, you can edit /etc/beowulf/conf.d/sshd_config and then reboot the nodes. Node-specific sshd configuration settings can be saved as /etc/beowulf/conf.d/sshd_config.$NODE.

Client behavior for SSH on the nodes can be adjusted by editing the global /etc/beowulf/conf.d/ssh_config or a node-specific /etc/beowulf/conf.d/ssh_config.$NODE. This SSH client configuration will only be useful when using SSH from node to node. For example:

[user1@cluster ~] $ ssh n0 ssh n1 ls

Note that /etc/beowulf/conf.d/sshd_config and ssh_config only affect SSH behavior on compute nodes. The master's SSH configuration will not be affected.

Interconnects¶

There are many different types of network fabric one can use to interconnect the nodes of your cluster. The least expensive and most common is Fast (100Mbps) and Gigabit (1000Mbps) Ethernet. Other cluster-specific network types, such Infiniband, offer lower latency, higher bandwidth and features such as RDMA (Remote Direct Memory Access).

Ethernet¶

Switching fabric is always the most important (and expensive) part of any interconnected sub-system. Ethernet switches with up to 48 ports are extremely cost effective; however, anything larger becomes expensive quickly. Intelligent switches (those with software monitoring and configuration) can be used effectively to partition sets of nodes into separate clusters using VLANs; this allows nodes to be easily reconfigured between clusters if necessary.

Adding a New Ethernet Driver¶

Drivers for most Ethernet adapters are included with the Linux distribution, and are supported out of the box for both the master and the compute nodes. If you find that your card is not supported, and a Linux source code driver exists for it, you need to compile it against the master's kernel, and then add it to the cluster config file using the bootmodule keyword. See the Reference Guide for a discussion on the cluster config file.

For details on adding new kernel modules, see Adding New Kernel Modules.

Gigabit Ethernet vs. Specialized Cluster Interconnects¶

Surprisingly, the packet latency for Gigabit Ethernet is approximately the same as for Fast Ethernet. In some cases, the latency may even be slightly higher, as the network is tuned for high bandwidth with low system impact utilization. Thus Gigabit Ethernet will not give significant improvement over Fast Ethernet to fine-grained communication-bound parallel applications, where specialized interconnects have a significant performance advantage.

However, Gigabit Ethernet can be very efficient when doing large I/O transfers, which may dominate the overall run-time of a system.

Other Interconnects¶

Infiniband is a new, standardized interconnect for system area networking. While the hardware interface is an industry standard, the details of the hardware device interface are vendor specific and change rapidly. Contact Scyld Customer Support for details on which Infiniband host adapters and switches are currently supported.

With the exception of unique network monitoring tools for each, the administrative and end user interaction is unchanged from the base Scyld ClusterWare system.