beowulf-config

Name

/etc/beowulf/config -- Scyld ClusterWare Configuration file

Description

The Beowulf config file /etc/beowulf/config defines the structure of a Scyld ClusterWare cluster and provides a central location for many of the operational parameters. The file contains the settings for beoboot, node initialization, BProc communication parameters, and other aspects of cluster operation.

The syntax of the ClusterWare configuration files is standardized and is intended for human editing with embedded comments. Tools are provided for reading and writing from common programming and scripting languages, with writing retaining comments and formatting.

Tip

Care must be taken when editing or otherwise modifying /etc/beowulf/config, e.g., avoid editing while new compute nodes are coming online and ClusterWare itself is adding or modifying 'node' lines. Also note that incorrect editing may leave the cluster unuseable.

Config File Format

The config file is a line-oriented sequence of configuration entries. Each configuration entry starts with a keyword followed by parameters. A line is terminated by a newline or '#'. The latter character starts a comment.

The keyword and following parameters have the same syntax rules: they may be preceded by whitespace and continue to the next whitespace or the end of the line.

Keywords and following parameters may include whitespace by quoting between a matching pair of '"' (double quote) or ''' (single quote) characters. A '\' (backslash) removes the special meaning of the following quote character.

Note that comments and newlines take precedence over any other processing, thus a '#' may not be used in a keyword or embedded in a parameter, and a backslash followed by a newline does not join lines.

Each configuration option is contained on a single line, with a keyword and optional parameters. Blank lines are ignored. Comments begin with an unquoted '#' and continue to the end of the line.

Keywords

bootmodule modulename
The bootmodule keyword specifies that the kernel binary module modulename be included in the compute nodes' initrd image. These are typically network drivers needed to fully initialize a booting node. At node startup, the beoclient daemon on a compute node scans the node's /proc/bus/pci/devices list and automatically executes a modprobe for every modulename driver named by a PCI device so discovered. However, note that if the PCI scan does not find a need for a particular driver, then no automatic modprobe occurs. Add an additional modprobe keyword to forcibly load the modulename.
firmware firmfile
The firmware keyword specifies that the firmfile file, which typically resides on the master node in /lib/firmware/firmfile, be included in the compute nodes' initrd image, if known to be needed by a particular bootmodule modulename. Adding one or more firmware keywords significantly increases the size of the initrd image. See the Administrator's Guide for details.
fsck fsck-policy

The fsck keyword specifies the file system checking policy to be used at node boot time. The valid policies are "never", "safe" or "full".

never
The file system on the compute nodes will not be checked on boot.
safe
The file system on the compute nodes will go through a safe check every time the compute node boots.
full
The file system on the compute nodes will go through a full check every time the compute node boots. The full check might possibly remove files from the filesystem if they cannot be repaired.
host MACaddress IPaddress [hostname(s)]
The host keyword assigns an IP addresses to a specific client device identified by its MAC address, if and when that client makes a DHCP request to the master. The IP addresses must be in dotted notation (e.g., 192.168.1.100), and it must be within the range of one of the hostrange IP address ranges. These host clients are not Scyld nodes, which are identified by node keywords and are assigned IP addresses from the iprange range. Rather, typically they are devices like smart Ethernet switches that connect to the cluster private network and issue a DHCP request to obtain an IP address. Up to six optional hostname names may be assigned to a client, and these names are recognized by the Beo NSS service.
hostrange [name] IPaddress-lwb IPaddress-upb
The hostrange is used in conjunction with the host keyword. It declares a range of IP addresses that may later be used for host clients doing DHCP requests. An optional name] may be associated with this range. Multiple hostrange keywords may be present.
ignore MACaddress
The ignore keyword specifies a MAC address (e.g., 00:11:22:AA:BB:CC) that beoserv should ignore DHCP and PXE requests from. Multiple ignore keywords are allowed.
initrdimage [noderange] imagename
The initrdimage keyword specifies the full path to the initrd image that should be used when creating the final boot images for the compute nodes. If noderange is specified, then this imagename applies only to the specified range of nodes; otherwise, imagename applies to all nodes.
insmod module-name [options]
The insmod keyword specifies a kernel module to be loaded (usually a network driver). Options for the module may be specified as well.
interface interfacename
The interface keyword specifies the name of the interface that connects the master node to the compute nodes. This is used by the cluster services and management tools such as the bpmaster daemon and the beoserv daemon. Common values are "eth0" or "eth1". If present, entries after the interface name specify the IP address and netmask that the interface should be configured to.
iprange [nodenumber] IPaddress1 IPaddress2
The iprange keyword specifies the range of IP addresses to be assigned to nodes. If the optional nodenumber is given, the first address in the range will be assigned to that node, the second address to the next node, etc. If no node number is given, the address assignment will begin with the node following the node that was last assigned. If no nodes have been assigned, the assignment will begin with node 0.
kernelcommandline [noderange] options
The kernelcommandline keyword specifies any options you wish to have passed to the kernel on the compute nodes. These are the same options that are normally passed with "append=" in lilo, or on the lilo prompt while the machine is booting (e.g., "kernelcommandline apm=power-off"). If noderange is specified, then these options apply only to the specified range of nodes; otherwise, options apply to all nodes.
kernelimage [noderange] imagename
The kernelimage keyword specifies the full path to the kernel that should be used when creating the final boot images for the compute nodes. If noderange is specified, then this imagename applies only to the specified range of nodes; otherwise, imagename applies to all nodes.
libraries librarypath1 [, librarypath2, ...]
The libraries keyword specifies a list of libraries that should be cached on the compute nodes when an application on the node references the library. The library path can be a directory or file. If a file name is specified, then that specific file may be cached, if needed. If a directory name is specified, then every file in that directory may be cached. If the directory name ends with "/", then subdirectories under the specified directory may be cached.
logfacility facility
The logfacility keyword specifies the log facility that the BProc master daemon should use. Some example log facility names are "daemon", "syslog", and "local0" (see the syslog documentation for more information). The default log facility is "daemon".
masterdelay SECS
The masterdelay keyword specifies the timeout value in seconds for a non-primary master node to delay sending a response to an incoming dhcp request. The default value is 15 seconds.
masterorder nodes IPaddress_primary IPaddress_secondary

The masterorder keyword specifies the cluster IP addresses of the primary master node and the secondary master node(s) for a given set of nodes. This is used by the beoserv daemon for Master-Failover (cold reparenting). A compute node's PXE request broadcasts across the cluster network. The primary master node is given masterpxedelay seconds to respond, after which the first secondary master node will respond. If multiple secondary master nodes are specified, then each waits in turn for masterpxedelay seconds for a preferred master to respond. Similarly, the compute node's subsequent DHCP broadcast gets serviced in the same order, with each secondary master waiting masterdelay seconds for a preferred master to respond.

Example:

masterorder 0,5,10-20 10.1.0.1 10.2.0.1
masterorder 1-4,21-30 10.2.0.1 10.1.0.1

If master 10.1.0.1 is down or fails to respond to PXE/DHCP requests to compute node 10, then master 10.2.0.1 becomes the primary parent for compute node 10.

masterpxedelay SECS
The masterpxedelay keyword specifies the timeout value in seconds for a non-primary master node to delay sending a response to an incoming PXE request. The default value is 5 seconds.
mcastbcast interface
The mcastbcast keyword directs the beoserv daemon to use broadcast instead of multicast when transmitting files over the interface. This is useful when network equipment has trouble with heavy multicast traffic.
mcastthrottle interface rate
The mcastthrottle keyword controls the rate at which data is transmitted over the specified interface. The rate is given in megabits per second. This is useful when the compute node interfaces cannot keep up with the master interface when sending large files.
mkfs mkfs-policy

The mkfs keyword specifies the policy to use when building a Linux file system on the compute nodes. The valid policies are "never", "if_needed", or "always".

never
The filesystem on the compute nodes will never be recreated on boot.
if_needed
The filesystem on the compute nodes will only be recreated if the filesystem check fails.
always
The filesystem on the compute nodes will be recreated on every boot. fsck will be assumed to be set to "never" when this is set.
modarg options
The modarg keyword specifies options to be used for modules that are loaded during the boot process without options. This is useful for specifying options to modules that get loaded during the PCI scan.
moddep module-list
The moddep keyword is used to specify module dependencies. The first module listed is dependent on the remaining modules in the space separated list. The first module will be loaded after all other listed modules. Module dependency information is normally automatically generated by the beoboot script.
modprobe modulename [options]
The modprobe keyword specifies the name of the kernel module to be loaded with dependency checking, along with any specified module options. Note that the modulename must also be named by a bootmodule keyword.
node [nodenumber] MACaddress

The node keyword is used to assign MAC addresses to node numbers. There should be one of these lines for each node in your cluster. Note the following:

  • If a value is not provided for the nodenumber argument, the first node entry is node 0, the second is node 1, the third is node 2, etc.
  • The value "off" can be used for the MACaddress argument to leave a place holder for that node number.
  • To skip a node number, use the value "node" or "node off" for the MACaddress argument.
  • To skip a node number and make sure it will never be automatically filled in by something later in the future, use the value "node reserved" for the MACaddress argument.
nodeacceses [ -M | -S nodenumber | all ] arglist

The nodeaccess keyword overrides the default access permissions for the master node (-M), for all compute nodes (all), or for a specific compute node (nodenumber). The remaining arglist is passed directly to the bpctl command for parsing and execution. See the Administrator's Guide for details about node access permissions.

Example:

nodeaccess -M -m 0110
nodeaccess -S 5 -g physics
nodeaccess -S 6 -g physics
nodeassign nodeassign-method

The nodeassign keyword specifies the node assignment strategy used when the beoserv daemon receives a new, unknown MAC address from a computer that is not currently entered in the node database. The total number of entries in the node database is limited to the number specified with the nodes keyword (see above).

The valid node assignment methods are "append", "insert", "manual", or "locked". Note the following:

  • "Append" and "insert" are the only two choices that allow new nodes to be automatically given node numbers and welcomed into the cluster.
  • Any failures of automatic node assignment through "append" or "insert" (such as when the node table is full) will cause the node assignment to be treated as "manual".
append
This is the default setting. The system will append new MAC addresses to the end of the node list in the /etc/beowulf/config file. This is done by seeking out the highest already-assigned node number and attempting to go one number beyond it. If the highest node number in the cluster has already been assigned, the "append" method will fail and the "manual" method will take precedence.
insert
The system will insert new MAC addresses into the node list in the /etc/beowulf/config file, starting with the lowest vacant node number. If no spaces are available, the "append" method will be used instead. Typically, a user would choose "insert" when replacing a single node if they want the new node entry to appear in the same place as the old node entry. If the node table is full, the "insert" method will fail and the "manual" method will take precedence.
manual
The system will enter new MAC addresses in the /var/beowulf/unknown_addresses file, and require the user to manually assign the new nodes. The node entries will appear in the "Unknown" list in the BeoSetup GUI, which simplifies the node assignment process. An alternative to using the BeoSetup GUI is to manually edit the /etc/beowulf/config file and copy in the new MAC addresses from the /var/beowulf/unknown_addresses file.
locked
The system will ignore DHCP requests from any MAC addresses not already listed in the /etc/beowulf/config file. This prevents nodes from getting added to the cluster accidentally. This is particularly useful in a cluster with multiple masters, because it enables the Cluster Administrator to control which master responds to a new node request. When you are troubleshooting issues related to the cluster not "seeing" new nodes, one of the first things to check is whether nodeassign is set to "locked".

See the Administrator's Guide for additional information on configuring nodes with the BeoSetup GUI and on manual node configuration.

nodename name-format [IPv4 Offset or Base] [netgroup]

The nodename keyword defines the primary hostname, as well as additional hostname-aliases for compute nodes. It can also be used to define hostnames and hostname-aliases for non-compute node entities with a per compute node relationship (e.g., to define a hostname and IP address for the IPMI management interface on each compute node). The presence of the (optional) IPv4 parameter determines if the entry is for compute nodes or for non-compute node entities. If no 'nodename' keyword is defined for compute nodes, then compute nodes' primary hostname is of the 'dot-number' format (e.g., node 10's primary hostname is '.10').

name-format

Define a hostname or hostname-alias. The first instance of the nodename keyword with no IPv4 parameter defines the primary hostname format for compute nodes. While the user may define the primary hostname, the FIRST hostname alias shall always be of the 'dot-number' format. This allows compute nodes to always resolve their address from the 'dot-number' notation. Additional nodename entries without an IPv4 parameter define additional hostname aliases.

The name-format string must contain a conversion specification for node number substitution. The conversion specification is introduced by a percent sign (the '%' symbol). An optional following digit in the range 1..5 specifies a zero-padded minimum field width. The specification is completed with an 'N'. An unspecified or zero field width allows numeric interpretation to match compute node host names. For example, "n%N" will match "n23", "n+23", and "n0000023". By contrast, "n%3N" will only match "n001" or "n023", but not "n1" or "n23".

IPv4 Offset or Base
The presence of the optional IPv4 argument defines if the entry is for "compute nodes" (i.e. the entry will resolve to the 'dot-number' name) or if the entry is for non-cluster entities that are loosely associated with the compute node. If the argument has a leading zero, then the parameter specifies an IPv4 Offset. If the argument does not lead with a zero, then the argument specifies a 'base' from which IP addresses are computed, by adding the 'node-number' associated with the non-compute node entity.
Netgroup
The netgroup parameter specifies a netgroup that contains all the entries generated by the nodename entry
nodes numnodes
The nodes keyword specifies the total possible number of nodes in the cluster. This should normally be set to match the iprange. However, if multiple ipranges are specified, then this value should represent the total number of nodes in all the iprange entries.
pingtimeout SECS

The bpmaster daemon that executes on the master node sends periodic "ping" messages to the bpslave daemon that executes on each compute node, and each bpslave dutifully responds. This interaction serves as mutual bpmasterbpslave assurance that the other daemon and the network link is still alive and well. If bpslave does not see this "ping" message for SECS seconds, then the bpslave goes into "orphan mode". If run-to-completion is enabled (see the Administrator's Guide for details), then the node attempts to remain alive and functioning, despite its apparent inability to communicate with the master node. If run-to-completion is not enabled (which is the default), then the node reboots immediately. If bpmaster does not see a ping reply for SECS seconds, then it syslogs this event and breaks its side of the network connection to the compute node.

The default pingtimeout value is 32 seconds. In rare cases, a particular workload may trigger such a "ping timeout" and its associated spontaneous reboot, and using a pingtimeout keyword to increase the timeout value may stop the spontaneous rebooting.

pci vendorid deviceid drivername
The pci keyword specifies what driver should be used in support of the specified PCI device. A device is identified by a unique vendor ID and device ID pair. The vendor and device ID's can be either in decimal or hexadecimal with the "0x" notation. You should have one of these lines for each PCI ID (a vendor ID combined with a device ID) for each device on your compute nodes that is not already recognized. Any module dependencies or arguments should be specified with moddep and modarg.
prestage pathname
The prestage keyword names a specific file that each compute node pulls from the master at node boot time. Multiple instances of prestage can be used. If the pathname is a file in one of the libraries directories, then the pathname gets pulled into the compute node's library cache. Otherwise, the file (and its directory hierarchy) is copied from the master to the compute nodes.
server transport-protocol port
The server keyword specifies the port numbers that ClusterWare uses for specified transport protocols. Each transport protocol uses a unique default port number. In the event that a default port value conflicts with a port number used by another service (typically, specified in /etc/services), a server keyword must specify an override value. The allowable transport-protocol keywords are "beofs2" (default port 932), "bproc" (default port 933), "beonss" (default port 3045), and "beostats" (default port 5545). (The keyword "tcp" is deprecated - use "beofs2" instead.)

Examples

iprange 192.168.1.0 192.168.1.50
nodename ipmi-n%N 0.0.1.0

In the above example, the hostname "ipmi-n0" has an address of 192.168.2.50. That is, the compute node's address (192.168.1.50 for compute node 0) plus the IPv4 Offset of 0.0.1.0. The hostname "ipmi-n12" has an address of 192.168.2.12, which is compute node 12's address plus the IPv4 Offset of 0.0.1.0.

nodename ib0-n%N 0.1.0.0 infiniband

In the above example, define a hostname for the infiniband interface for each compute node. Using the iprange values in the previous example, the infiniband interface for compute node 0 has a primary hostname of "ib-n0" and resolves to the address 192.169.1.0: node 0's basic iprange IP address, plus the increment 0.1.0.0. The infiniband interface for compute node 10 has a primary hostname of "ib-n10" and resolves to the address 192.169.1.10. Each of the "ib0-n%N" hostnames belong to the "infiniband" netgroup.

nodename computenode%N
nodename cnode%3N

In the above example, the primary hostname for compute node 0 is "computenode0", and the primary hostname for compute node 12 is "computenode12". The second nodename entry defines additional hostname aliases. The FIRST hostname alias will always be the 'dot-number' notation, so compute node 12's first hostname alias is ".12", and the second hostname alias will be "cnode012". The '%' followed by a three specifies a three-digit field width format for the entry.

The following is an example of a complete Beowulf Configuration File

# Beowulf Configuration file

# Network interface used for Beowulf
# Only first argument to interface is important
interface eth1 192.168.1.1 255.255.255.1

# These two should probably agree for most users
iprange 192.168.1.100 192.168.1.107
nodes 8

# Default location of boot images
bootfile /var/beowulf/boot.img
kernelimage /boot/vmlinuz-2.4.17-0.18.12.beo
kernelcommandline apm=power-off

# Default libraries
libraries /lib /usr/lib

# Default file system policies.
fsck full
mkfs if_needed

# beoserv settings
server beofs2 932

# Default Modules
bootmodule 3c59x 8139too dmfe eepro100 epic100 hp100 natsemi
bootmodule ne2k-pci pcnet32 sis900 starfire sundance tlan
bootmodule tulip via-rhine winbond-840 yellowfin

# Non-kernel integrated drivers
bootmodule e100 bcm5700 # gm

# Node assignment method
nodeassign append

# PCI Gigabit Ethernet.
#  * AceNIC and SysKonnect firmwares are very large.
#  * Some of these are distributed separate from the kernel
bootmodule dl2k hamachi e1000 ns83820 # acenic sk98lin

node 00:50:8B:D3:25:4D
node 00:50:8B:D3:07:8B
ignore 00:50:8B:D3:31:FB
node 00:50:8B:D3:62:A0
node 00:50:8B:D3:00:66
node 00:50:8B:D3:30:42
node 00:50:8B:D3:98:EA