Configuring the Cluster¶
The Scyld ClusterWare configuration is defined by the contents of several flat ASCII
files. Most of these files reside in the /etc/beowulf/
directory.
Various ClusterWare scripts (which mostly reside in
/usr/lib/beoboot/bin
), daemons, and commands read (and some
occasionally update) these flat files.
The root user can manipulate the configuration manually using a text editor.
Configuring the Cluster Manually¶
This section discusses how to configure a cluster. Penguin Computing
strongly recommends that the administrator use Manual editing of
configuration files, especially the centerpiece /etc/beowulf/config
file, should only be done with care, together with sufficient
understanding of the ramifications of the manual manipulations.
Caution
If manual edits are made to the
config
file for a running cluster, then after saving the file, but sure to executesystemctl reload clusterware
, which will immediately send a SIGHUP signal to thebpmaster
andbeoserv
daemons that notifies each to re-read theconfig
file.
Configuration Files¶
/etc/beowulf/config¶
The file /etc/beowulf/config
is the principal configuration file for
the cluster. The config
file is organized using keywords and values,
which are used to control most aspects of running the cluster, including
the following:
The name, IP address and netmask of the network interface connected to the private cluster network
The network port numbers used by ClusterWare for various services
The IP address range to assign to the compute nodes
The MAC (hardware) address of each identified node accepted into the cluster
The node number and IP address assigned to each hardware address
The default kernel and kernel command line to use when creating a boot file
A list of kernel modules to be available for loading on compute nodes at runtime
A list of shared library directories to cache on the compute nodes
A list of files to prestage on the compute nodes
Compute node filesystem startup policy
The name of the final boot file to send to the compute nodes at boot time
The hostname and hostname aliases of compute nodes
Compute node policies for handling local disks and filesystems, responding to master node failure, etc.
The following sections briefly discuss some key aspects of the
configuration file. See the Reference Guide (or man beowulf-config
) for
details on the specific keywords and values in /etc/beowulf/config
.
Setting the IP Address Range¶
The IP address range should be kept to a minimum, as all the cluster utilities will loop through this range. Having a few spare addresses is a good idea to allow for growth in the cluster. However, having a large number of addresses that will never be used will be an unnecessary waste of resources.
Identifying New Nodes¶
When a new node boots, it issues a DHCP request to the network in order
to get an IP address assigned to it. The master’s beoserv
detects
these DHCP packets, and its response is dependent upon the current
nodeassign policy. With a default append policy, beoserv
appends
a new node entry to the end of the /etc/beowulf/config
file. This
new entry identifies the node’s MAC address(es), and the relative
ordering of the node entry defines the node’s number and what IP
address is assigned to it. With a manual policy, beoserv
appends
the new node’s MAC address to the file
/var/beowulf/unknown_addresses
, and then assigns a temporary IP
address to the node that is outside the iprange address range and
which does not integrate this new node into the cluster. It is expected
that the cluster administrator will eventually assign this new MAC
address to a cluster node, giving it a node entry with an appropriate
position and node number. Upon cluster restart, when the node reboots
(after a manual reset or an IPMI powercycle), the node will assume its
assigned place in the cluster. With a locked policy, the new node gets
ignored completely: no recording of its MAC address, and no IP address
assignment.
Assigning Node Numbers and IP Addresses¶
Two config
file keywords control the assignment of IP addresses to
compute nodes on the private cluster network: nodes and iprange. The
nodes keyword specifies the max number of compute nodes, and the
iprange specifies the range of IP addresses that are assigned to those
compute nodes.
By default and in general practice, node numbers and IP addresses are assigned to the compute nodes in the order that their node entries appear in the config file, beginning with node 0 and the first IP address specified by the iprange entry in the config file. For example, the config file entries:
nodes 8
iprange 10.20.30.100 10.20.30.107
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
specify a network that contains a maximum of eight nodes, with four
nodes currently known, and with an IP address range that falls between
the 10.20.30.100 lowerbound and the 10.20.30.107 upperbound. Here the
node with MAC address 00:01:02:03:04:1C
is node 2 and will be
assigned an IP address 10.20.30.102.
ClusterWare treats the upperbound IP address as optional, so all that is necessary to specify is:
nodes 8
iprange 10.20.30.100
and ClusterWare calculates the upperbound IP address. This is especially useful when dealing with large nodes counts, e.g.:
nodes 1357
iprange 10.20.30.100
when it becomes increasingly clumsy for the cluster administrator to accurately calculate the upperbound address.
An optional node number can explicitly specify an override node number:
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 2 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
explicitly (and redundantly) specifies the node 2 numbering. Alternatively:
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node 5 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
explicitly names that node as node 5 with IP address 10.20.30.105, and
the next node (with MAC address 00:01:02:03:04:1D
will now be node 6
with IP address 10.20.30.106.
In another variation, commenting-out the MAC address(es) leaves a node
numbering gap for node 2, and MAC address 00:01:02:03:04:1D
continues to be known as node 3:
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node # 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
However, if the node with that commented-out MAC address
00:01:02:03:04:1C
does attempt to PXE boot, then beoserv
assigns
a new node number (4) to that physical node and automatically appends a
new node entry to the list (assuming the nodeassign policy is
append, and assuming the iprange and nodes entries allow room for
expansion). This appending results in:
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node # 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
node 00:01:02:03:04:1C 00:01:02:03:05:2B
If you want to have beoserv
ignore that physical node and keep the
remaining nodes numbered without change, then use the keyword off:
node 00:01:02:03:04:1A 00:01:02:03:05:2A
node 00:01:02:03:04:1B 00:01:02:03:05:2B
node off 00:01:02:03:04:1C 00:01:02:03:05:2B
node 00:01:02:03:04:1D 00:01:02:03:05:2B
A node entry can identify itself as a non-Scyld node and can direct
beoserv
to respond to the node in a variety of ways, including
telling the node to boot from a local harddrive, or provisioning the
node with specific kernel and initrd images.
See Managing Non-Scyld Nodes for details.
Specifying node names and aliases¶
The nodename keyword in the master’s /etc/beowulf/config
affects
the behavior of the ClusterWare NSS. Using the nodename keyword, one
may redefine the primary host-name of the cluster, define additional
hostname aliases for compute nodes, and define additional hostname (and
hostname aliases) for entities loosely associated with the compute
node’s cluster position.
nodename [name-format] <IPv4 Offset or base> <netgroup>
The presence of the optional IPv4 argument defines if the entry is for
“compute nodes” (i.e. the entry will resolve to the ‘dot-number’ name)
or if the entry is for non-cluster entities that are loosely associated
with the compute node. In the case where there is
an IPv4 argument,
the nodename keyword defines an additional hostname name that maps to
an IPv4 address loosely associated with the node number. In case where
IPv4 argument is present, the nodename keyword defines hostname and
hostname aliases for the clustering interface (i.e. the compute nodes).
Subsequent nodename entries without an IPv4 argument specify
additional hostname aliases for compute nodes. In either case, the
format string must contain a conversion specification for node number
substitution. The conversion specification is introduced by a ‘%’. An
optional following digit in the range 1..5 specifies a zero-padded
minimum field width. The specification is completed with an ‘N’. An
unspecified or zero field width allows numeric interpretation to match
compute node host names. For example, n%N will match n23, n+23,
and n0000023. By contrast, n%3N will only match n001 or node023,
but not n1 or n23.
Compute node command-line options¶
The kernelcommandline directive is a method of passing various options to the compute node’s kernel and to Beowulf on the node. There are a large number of different command line options that you can employ. This section covers some of them.
Some options are interpreted by the kernel on the compute node and ignored by Beowulf:
- apic
This option turns on APIC support on the compute node. APIC is the newer of two different mechanisms Intel provides for invoking interrupts. It works better with SMP systems than the older mechanism, called XT-PIC. However, not every motherboard and chipset works correctly with APIC, so this option is disabled by default to avoid problems for those machines that do not support it.
If you find that your cluster nodes kernel panic or crash immediately upon boot, you probably want to turn off APIC by specifying noapic in the command line options. If you have many devices that generate interrupts (such as hard disk controllers, network adapters, etc.) you may want to try turning on APIC to see if there is any performance advantage for your cluster.
- panic=<seconds>
This option allows you to specify how many seconds the kernel should wait to reboot after a kernel panic. For example, if you specify panic=60, then the kernel will wait 60 seconds before rebooting. Note that Beowulf automatically adds panic=30 to final boot images.
- apm=<action>
This option allows you to specify APM options on the compute node. Acceptable <action> values are on (to turn APM completely on), off (to turn it completely off), debug (to turn on debugging), and power-off (to turn on only the power-off part of APM).
APM is not SMP-safe in the kernel; it will auto-disable itself if turned completely on for an SMP box. However, the power-off part of APM is SMP safe; thus, if you want to be able to power-off SMP boxes, you can do so by specifying apm=power-off. Note that apm=power-off is specified in the default kernelcommandline directive.
- console=<device>, <options>
This option is used to select which device(s) to use for console output. For <device> use tty0 for the foreground virtual console, ttyX (e.g., tty1) for any other virtual console, and ttySx (e.g., ttyS0 for a serial port.
For the serial port, <options> defines the baud rate/parity/bits of the port in the format “BBBBPN”, where “BBBB” is the speed, “P” is parity (n/o/e), and “N” is bits. The default setting is 9600n8, and the maximum baud rate is 115200. For example, to use the serial port at the maximum baud rate, specify console=ttyS0,115200n8r
Other options are interpreted by Beowulf on the compute node:
- rootfs_size=<size>
A compute node employs a RAM-based root filesystem for local non-persistent storage, typically used to contain BProc’s filecache libraries and other files, the
/tmp
directory, and other directories that are not mounted using some variety of global storage (e.g., NFS or Panfs) or on local harddrives. This tmpfs root filesystem consumes physical memory only as needed, which commonly is about 100- to 200-MBytes unless user workloads impose greater demands on (for example)/tmp
space. However, by default the rootfs is allowed to grow to consume a maximum of 50% of physical memory, which has the potential of allowing users to consume (perhaps inadvertently) an excessive amount of RAM that would otherwise be available to applications’ virtual memory needs.This 50% default can be overridden by the judicious use of the <size> option, where <size> can be expressed as numeric bytes, megabytes (appending “m” or “M”), or gigabytes (appending “g” or “G”), or as a percentage of total physical memory (appending numeric value and “%”). Examples:
rootfs_size=2048m rootfs_size=1G rootfs_size=15%
Note that this override is rarely needed, and it must be utilized with care. An inappropriately constrained root filesystem will cripple the node, just as an inadequate amount of physical memory that is available for virtual memory will trigger Out-Of-Memory failures. The cluster administrator is encouraged to limit user filespace usage in other ways, such as declaring
/etc/security/limits.conf
limits on the max number of open files and/or the maximum filesize.- rootfs_timeout=<seconds>; getfile_timeout=<seconds>
The
beoclient
daemon on each compute node manages the early boot process, such as using tftp to read the kernel image and initrd files from the master node’sbeoserv
daemon, and using tcp to read the initial root filesystem image (rootfs) from beoserv. After the node boots, BProc’s filecache functionality on the compute node also uses tcp to read files from the master, as needed by applications.The default timeout for these tcp reads is 30 seconds. If this timeout is too short, then add one of these options to the kernelcommandline to override the default. The option getfile_timeout overrides the timeout for all beoclient tcp read operations. The option rootfs_timeout overrides the timeout only for the tcp read of the root filesystem at node boot time.
- syslog_server=<IPaddress>
By default, a compute node forwards its kernel messages and syslog messages back to the master node’s
syslog
orrsyslog
service, which then appends these log messages to the master’s/var/log/messages
file. Alternatively, the cluster administrator may choose to instead forward these compute node log messages to another server by using the syslog_server option to identify the <IPaddress> of that server. This should be an IPv4 address, e.g., syslog_server=10.20.30.2.Scyld ClusterWare automatically configures the master node’s log service to handle incoming log messages from remote compute nodes. However, the cluster administrator must manually configure the alternate syslog server:
For the
syslog
service (Scyld ClusterWare 4 and 5), edit/etc/sysconfig/syslog
on the alternate server to add “-r -x” to the variable SYSLOGD_OPTIONS.For the
rsyslog
service (Scyld ClusterWare 6), edit/etc/sysconfig/rsyslog
on the alternate server to add “-x” to the variable SYSLOGD_OPTIONS, and edit/etc/rsyslog.conf
to un-comment the following lines to expose them, i.e., just as Scyld ClusterWare has done in the master node’s/etc/rsyslog.conf
file:$ModLoad imudp.so $UDPServerRun 514
Finally, restart the service on both the master node and the alternate syslog server before restarting the cluster.
- legacy_syslog=<num>
The legacy behavior of the the compute node’s syslog handling has been to introduce a date-time string to the message text, then forward the message to the syslog server (typically on the master node), which would add its own date-time string. This redundant timestamp violates the RFC 3164 format standard, and recent ClusterWare releases strips the compute node’s timestamp before sending the text to the master server. If for some reason a local cluster administrator wishes to revert to the previous behavior, then add legacy_syslog=1. The default is legacy_syslog=0.
Specifying kernel modules for use on compute nodes¶
Each bootmodule entry identifies a kernel module to be added to the initrd that is passed to each compute node at boot time. These entries typically name possible Ethernet drivers used by nodes supplied by Penguin Computing. If the cluster contains nodes not supplied by Penguin Computing, then the cluster administrator should examine the default list and add new bootmodule entries as needed.
At boot time, Beowulf scans the node’s PCI bus to determine what devices
are present and what driver is required for each device. If the
specified driver is named by a bootmodule entry, then Beowulf loads
the module and all its dependencies. However, some needed modules are
not found by this PCI scan, e.g., those used to manage specific
filesystem types. These modules require adding an additional config
file entry: modprobe. For example:
modprobe xfs
Note that each named modprobe module must also be named as a bootmodule.
You may also specify module-specific arguments to be applied at module load time, e.g.,
modarg forcedeth optimization_mode=1
RHEL7 introduced externally visible discrete firmware files that
are associated with specific kernel software drivers. When modprobe
attempts to load a kernel module that contains such a software driver,
and that driver determines that the controller hardware needs one or
more specific firmware images (which are commonly found in
/lib/firmware
), then the kernel first looks at its list of built-in
firmware files. If the desired file is not found in that list, then the
kernel sends a request to the udevd
daemon to locate the file and to
pass its contents back to the driver, which then downloads the contents
to the controller. This functionality is problematic if the kernel
module is an /etc/beowulf/config
bootmodule and is an Ethernet
driver that is necessary to boot a particular compute node in the
cluster. The number of /lib/firmware/
files associated with every
possible bootmodule module is too large to embed into the initrd
image common to all compute nodes, as that burdens every node with a
likely unnecessarily oversized initrd
to download. Accordingly, the
cluster administrator must determine which specific firmware file(s) are
actually required for a particular cluster and are not yet built-in to
the kernel, then add firmware directive(s) for those files.
A bootmodule firmware problem exhibits itself as a compute node which
does not boot because the needed Ethernet driver cannot be
modprobe
’d because it cannot load a specific firmware file. After a
timeout waiting for udevd
to unsuccessfully find the file, the
compute node typically reboots - endlessly, as it continues to be unable
to load the needed firmware file.
The cluster administrator can use the firmware directive to add
specific firmware files to the compute node initrd
, as needed. The
compute node kernel writes the relevant firmware filename information to
its console, e.g. a line of the form:
Failed to load firmware "bnx2/bnx2-mips-06-6.2.1.fw"
Ideally, the administrator gains access to the node’s console to see the
specific filename, then adds a directive to /etc/beowulf/config
:
firmware bnx2/bnx2-mips-06-6.2.1.fw
and rebuilds the initrd:
[root@cluster ~] # systemctl reload clusterware
(Note: reload, not restart)
If the node continues to fail to boot, then the failure is likely due to another missing firmware file. Check the node’s console output again, and add the specified file to the firmware directive.
If the cluster administrator cannot easily see the node’s console output to determine what firmware files are needed, then if the administrator knows the likely bootmodule module culprit, then the administrator can brute-force every known firmware file for that module using a directive of the form:
firmware bnx2
that names an entire /lib/firmware/
subdirectory. This will likely
create a huge initrd
that will (if the correct bootmodule module
is specified) successfully boot the compute node. The administrator
should then examine the node’s syslog output, which is typically seen in
/var/log/messages
, to determine the specific individual firmware
filenames that were actually needed, and then the administrator replaces
the subdirectory name with the now-known specific firmware filenames.
Subsequently, the cluster administrator should contact Penguin Computing
Support to inform us what those needed firmware files are, so that we
can build-in these files into future kernel images and thus allow the
cluster administrator to remove the firmware directives and thus
reduce the initrd
size, which contains not only the firmware images,
but additionally includes various executable binaries and libraries that
are only needed for this dynamic udevd
functionality.
/etc/beowulf/fdisk¶
The /etc/beowulf/fdisk
directory is created by the beofdisk
utility when it evaluates local disks on individual compute nodes and
creates partition tables for them. For each unique drive geometry
discovered among the local disks on the compute nodes, beofdisk
creates a file within this directory. The file naming convention is
“head;ccc;hhh;sss”, where “ccc” is the number of cylinders on the disk,
“hhh” is the number of heads, and “sss” is the number of sectors per
track.
These files contain the partition table information as read by
beofdisk
. Normally, these files should not be edited by hand.
You may create separate versions of this directory that end with the
node number (for example, /etc/beowulf/fdisk.3
). The master’s
BeoBoot
software will look for these directories before using the
general /etc/beowulf/fdisk
directory.
For more information, see the section on beofdisk
in the
Reference Guide.
/etc/beowulf/fstab¶
This is the filesystem table for the mount points of the partitions on
the compute nodes. It should be familiar to anyone who has dealt with an
/etc/fstab
file in a standard Linux system, though with a few Scyld ClusterWare
extensions. For details, see the Reference Guide or execute
man beowulf-fstab
.
You may create separate node-specific versions by appending the node
number, e.g., /etc/beowulf/fstab.3
for node 3. The master’s beoboot
node_up
script looks first for a node_specific fstab.
N
file, then if no such file exists will use the default
/etc/beowulf/fstab
file.
Caution
On compute nodes, NFS directories must be mounted using either the IP address or the $MASTER keyword; the master node’s hostname cannot be used. This is because
/etc/beowulf/fstab
is evaluated before the Scyld ClusterWare name service is initialized, which means hostnames cannot be resolved on a compute node at that point.
/etc/beowulf/backups/¶
This directory contains time-stamped backups of older versions of
various configuration files, e.g., /etc/beowulf/config
and
/etc/beowulf/fstab
, to assist in the recovery of a working
configuration after an invalid edit.
/etc/beowulf/conf.d/¶
This directory contains various configuration files that are involved
when booting a compute node. In particular, the node_up
script
pushes the master node’s /etc/beowulf/conf.d/limits.conf
to each
compute node as /etc/security/limits.conf
, and pushes
/etc/beowulf/conf.d/sysctl.conf
to each compute node as
/etc/sysctl.conf
. If /etc/beowulf/conf.d/limits.conf
does not
exist, then node_up
creates an initial file as a concatenation of
the master node’s /etc/security/limits.conf
plus all files in the
directory /etc/security/limits.d/
. Similarly, node_up
creates an
initial /etc/beowulf/conf.d/sysctl.conf
(if it doesn’t already
exist) as a copy of the master’s /etc/sysctl.conf
. The cluster
administrator may subsequently modify these initial “best guess”
configuration files as needed for compute nodes.
Command Line Tools¶
bpstat¶
The command bpstat
can be used to quickly check the status of the
cluster nodes and/or see what processes are running on the compute
nodes. See the Reference Guide for details on usage.
bpctl¶
To reboot or set the state of a node via the command line, one can use
the bpctl
command. For example, to reboot node 5:
[root@cluster ~] # bpctl -S 5 -R
As the administrator, you may at some point have reason to prevent other
users from running new jobs on a specific node, but you do not want to
shut it down. For this purpose we have the unavailable state. When a
node is set to unavailable non-root users will be unable to start new
jobs on that node, but existing jobs will continue running. To do this,
set the state to unavailable using the bpctl
command. For example,
to set node 5 to unavailable:
[root@cluster ~] # bpctl -S 5 -s unavailable
node_down¶
If you are mounting local filesystems on the compute nodes, you should
shut down the node cleanly so that the filesystems on the harddrives
stay in a consistent state. The node_down
script in
/usr/lib/beoboot/bin
does exactly this. It takes two arguments; the
first is the node number, and the second is the state to which you want
the node to go. For example, to cleanly reboot node 5:
[root@cluster ~] # /usr/lib/beoboot/bin/node_down 5 reboot
Alternatively, to cleanly power-off node 5:
[root@cluster ~] # /usr/lib/beoboot/bin/node_down 5 pwroff
The node_down
script works by first setting the node’s state to
unavailable, then remounting the filesystems on the compute node
read-only, then calling bpctl
to change the node state. This can all
be done by hand, but the script saves some keystrokes.
To configure node_down
to use IPMI, set the ipmi
value in
/etc/beowulf/config
to enabled as follows:
[root@cluster ~] # beoconfig ipmi enabled
Configuring CPU speed/power for Compute Nodes¶
Modern motherboards and processors support a degree of administrator
management of CPU frequency within a range defined by the motherboard’s
BIOS. Scyld ClusterWare provides the /etc/beowulf/init.d/30cpuspeed
script and its
associated /etc/beowulf/conf.d/cpuspeed.conf
configuration file to
implement this management for compute nodes. The local cluster
administrator is encouraged to review the cpuspeed.conf
config
file’s section labeled Scaling governor values and potentially adjust
the environment variable SCALINGGOV as desired, and then to enable the
30cpuspeed
script:
[root@cluster ~] # beochkconfig 30cpuspeed on
The administrator should also ensure that no other cpuspeed or cpupower script is enabled for compute nodes.
In brief, the administrator can choose among four CPU scaling governor settings:
performance, which directs the CPUs to execute at the maximum frequency supported by the motherboard and processor, as specified by the motherboard BIOS.
powersave, which directs the CPUs to execute at the minimum frequency supported by the motherboard and processor.
ondemand, which directs the kernel to adjust the CPU frequency between the minimum and maximum. An idle CPU executes at the minimum. As a load appears, the frequency increases relatively quickly to the maximum, and if and when the load subsides, then the frequency decreases back to the minimum. This is the default setting.
conservative, which similarly directs the kernel to adjust the CPU frequency between the minimum and maximum, albeit making those adjustments with somewhat longer latency than is done for ondemand.
The upside of the performance scaling governor is that applications running on compute nodes always enjoy the maximum CPU frequencies that are supported by the node hardware. The downside is that even idle CPUs consume that same maximum power and thus generate maximum heat. For the scaling governors performance, ondemand, and conservative, a computebound workload drives the CPU frequencies (and power and heat) to the maximum, and thus computebound application performance will exhibit little or no difference among those governors. However, a workload of rapid context switching and frequent idle time will show perhaps 10-20% lower performance for ondemand versus performance, and possibly an even larger decline with conservative. The powersave governor is typically only employed when a need to minimize the cluster power consumption and/or minimize thermal levels outweighs a need to achieve maximum performance.
A broader discussion can be found in the
/usr/share/doc/kernel-doc-2.6.32/Documentation/cpu-freq/
documents,
e.g., governors.txt
. Install the RHEL7 or CentOS7 base
distribution’s kernel-doc package to access these documents.
Adding New Kernel Modules¶
The modprobe
command uses /usr/lib/`uname -r`/modules.dep.bin
to
determine the pathnames of the specified kernel module and that module’s
dependencies. The depmod
command builds the human-readable
modules.dep
and the binary module.dep.bin
files, and it should
be executed on the master node after installing any new kernel module.
Executing modprobe
on a compute node requires additional caution.
The first use of modprobe
retrieves the current modules.dep.bin
from the master node using bproc’s filecache functionality. Since any
subsequent depmod
on the master node rebuilds modules.dep.bin
,
then a subsequent modprobe
on a compute node will only see the new
modules.dep.bin
if that file is copied to the node using bpcp
,
or if the node is rebooted and thereby silently retrieves the new file.
In general, you should not execute depmod
on a compute node, since
that command will only see those few kernel modules that have previously
been retrieved from the master node, which means the node’s newly built
modules.dep.bin
will only be a sparse subset of the master node’s
full module.dep.bin
. Bproc’s filecache functionality will always
properly retrieve a kernel module from the master node, as long as the
node’s module.dep.bin
properly specifies the pathname of that
module, so the key is to have the node’s module.dep.bin
be a current
copy of the master’s file.
Many device drivers are included with Scyld ClusterWare and are supported out-of-the-box for both the master and the compute nodes. If you find that a device, such as your Ethernet adapter, is not supported and a Linux source code driver exists for it, then you will need to build the driver modules for the master.
To do this, you will need to install the RPM of kernel source code (if you haven’t already done so). Next, compile the source code using the following extra GNU C Compiler (gcc) options.
-D__BOOT_KERNEL_SMP=1 -D__BOOT_KERNEL_UP=0
The compiled modules must be installed in the appropriate directories
under /lib/modules
. For example, if you are currently running under
the 2.6.9-67.0.4.ELsmp kernel version, the compiled module for an
Ethernet driver would be put in the following directory:
/lib/modules/2.6.9-67.0.4.ELsmp/kernel/drivers/net
Any kernel module that is required to boot a compute node, e.g., most
commonly the Ethernet driver(s) used by compute nodes, needs special
treatment. Edit the config file /etc/beowulf/config
to add the name
of the driver to the bootmodule list; you can add more bootmodule
lines if needed. See Compute Node Boot Options.
Next, you need to configure how the device driver gets loaded. You can
set it up so that the device driver only loads if the specific device is
found on the compute node. To do this, you need to add the PCI
vendor/device ID pair to the PCI table information in the
/usr/share/hwdata/pcitable
file. You can figure out what these
values are by using a combination of lspci
and lspci -n
.
So that your new kernel module is always loaded on the compute nodes,
include the module in the initial RAM disk by adding a modprobe line
to /etc/beowulf/config
. The line should look like the following:
modprobe <module>
where <module> is the kernel module in question.
Finally, you can regenerate the BeoBoot
images by running
systemctl reload clusterware
. For more details,
see Compute Node Boot Options.
Accessing External License Servers¶
To configure the firewall for accessing external license servers, enable
ipforward
in the /etc/beowulf/config
file. The line should read
as follows:
ipforward yes
You must then reboot the compute nodes and restart the cluster services. To do so, run the following two commands as root in quick succession:
[root@cluster ~] # bpctl -S all -R
[root@cluster ~] # systemctl restart clusterware
Tip
If IP forwarding is enabled in /etc/beowulf/config
but is still
not working, then check /etc/sysctl.conf
to see if it is
disabled.
Check for the line “net.ipv4.ip_forward = 1”. If the value is set
to 0 (zero) instead of 1, then IP forwarding will be disabled, even
if it is enabled in /etc/beowulf/config
.
Configuring SSH for Remote Job Execution¶
Most applications that leverage /usr/bin/ssh
on compute nodes can be
configured to use /usr/bin/rsh
. In the event that your application
requires SSH access to compute nodes, ClusterWare provides this ability
through /etc/beowulf/init.d/81sshd
. To start sshd
on compute
nodes, enable the 81sshd
script and reboot your nodes:
[root@cluster ~] # beochkconfig 81sshd on
[root@cluster ~] # bpctl -S all -R
When each node boots, 81sshd
starts sshd
on the node, and the
master’s root user will be able to SSH to a compute node without a
password, e.g.:
[root@cluster ~] # ssh n0 ls
By default, compute node sshd
daemons do not allow for
password-based authentication – only key-based authentication is
available – and only the root user’s SSH keys have been configured.
If a non-root user needs SSH access to compute nodes, the user’s SSH keys will need to be configured. For example, create a DSA key using ssh-keygen, and hit Enter when prompted for a password if you want password-less authentication:
[user1@cluster ~] $ ssh-keygen -t dsa
Since the master’s /home
directory is mounted (by default) as
/home
on the compute nodes, just copy the public key to
~/.ssh/authorized_keys:
[user1@cluster ~] $ cp -a ~/.ssh/id_dsa.pub ~/.ssh/authorized_keys
Now the user can run commands over SSH to any node using shared key authentication:
[user1@cluster ~] $ ssh n0 date
If you wish to modify sshd
’s settings, you can edit
/etc/beowulf/conf.d/sshd_config
and then reboot the nodes.
Node-specific sshd
configuration settings can be saved as
/etc/beowulf/conf.d/sshd_config.$NODE
.
Client behavior for SSH on the nodes can be adjusted by editing the
global /etc/beowulf/conf.d/ssh_config
or a node-specific
/etc/beowulf/conf.d/ssh_config.$NODE
. This SSH client configuration
will only be useful when using SSH from node to node. For example:
[user1@cluster ~] $ ssh n0 ssh n1 ls
Note that /etc/beowulf/conf.d/sshd_config
and ssh_config
only
affect SSH behavior on compute nodes. The master’s SSH configuration
will not be affected.
Interconnects¶
There are many different types of network fabric one can use to interconnect the nodes of your cluster. The least expensive and most common is Fast (100Mbps) and Gigabit (1000Mbps) Ethernet. Other cluster-specific network types, such Infiniband, offer lower latency, higher bandwidth and features such as RDMA (Remote Direct Memory Access).
Ethernet¶
Switching fabric is always the most important (and expensive) part of any interconnected sub-system. Ethernet switches with up to 48 ports are extremely cost effective; however, anything larger becomes expensive quickly. Intelligent switches (those with software monitoring and configuration) can be used effectively to partition sets of nodes into separate clusters using VLANs; this allows nodes to be easily reconfigured between clusters if necessary.
Adding a New Ethernet Driver¶
Drivers for most Ethernet adapters are included with the Linux distribution, and are supported out of the box for both the master and the compute nodes. If you find that your card is not supported, and a Linux source code driver exists for it, you need to compile it against the master’s kernel, and then add it to the cluster config file using the bootmodule keyword. See the Reference Guide for a discussion on the cluster config file.
For details on adding new kernel modules, see Adding New Kernel Modules.
Gigabit Ethernet vs. Specialized Cluster Interconnects¶
Surprisingly, the packet latency for Gigabit Ethernet is approximately the same as for Fast Ethernet. In some cases, the latency may even be slightly higher, as the network is tuned for high bandwidth with low system impact utilization. Thus Gigabit Ethernet will not give significant improvement over Fast Ethernet to fine-grained communication-bound parallel applications, where specialized interconnects have a significant performance advantage.
However, Gigabit Ethernet can be very efficient when doing large I/O transfers, which may dominate the overall run-time of a system.
Other Interconnects¶
Infiniband is a new, standardized interconnect for system area networking. While the hardware interface is an industry standard, the details of the hardware device interface are vendor specific and change rapidly. Contact Scyld Customer Support for details on which Infiniband host adapters and switches are currently supported.
With the exception of unique network monitoring tools for each, the administrative and end user interaction is unchanged from the base Scyld ClusterWare system.