/etc/beowulf/config – Scyld ClusterWare Configuration file
The Beowulf config file
/etc/beowulf/config defines the structure of
a Scyld ClusterWare cluster and provides a central location for many of the
operational parameters. The file contains the settings for
BProc communication parameters, and other
aspects of cluster operation.
The syntax of the ClusterWare configuration files is standardized and is intended for human editing with embedded comments. Tools are provided for reading and writing from common programming and scripting languages, with writing retaining comments and formatting.
Care must be taken when editing or otherwise modifying
/etc/beowulf/config, e.g., avoid editing while new compute nodes are coming online and ClusterWare itself is adding or modifying ‘node’ lines. Also note that incorrect editing may leave the cluster unuseable.
Config File Format¶
The config file is a line-oriented sequence of configuration entries. Each configuration entry starts with a keyword followed by parameters. A line is terminated by a newline or ‘#’. The latter character starts a comment.
The keyword and following parameters have the same syntax rules: they may be preceded by whitespace and continue to the next whitespace or the end of the line.
Keywords and following parameters may include whitespace by quoting between a matching pair of ‘”’ (double quote) or ‘’’ (single quote) characters. A ‘\’ (backslash) removes the special meaning of the following quote character.
Note that comments and newlines take precedence over any other processing, thus a ‘#’ may not be used in a keyword or embedded in a parameter, and a backslash followed by a newline does not join lines.
Each configuration option is contained on a single line, with a keyword and optional parameters. Blank lines are ignored. Comments begin with an unquoted ‘#’ and continue to the end of the line.
- bootmodule modulename
bootmodulekeyword specifies that the kernel binary module modulename be included in the compute nodes’ initrd image. These are typically network drivers needed to fully initialize a booting node. At node startup, the
beoclientdaemon on a compute node scans the node’s
/proc/bus/pci/deviceslist and automatically executes a
modprobefor every modulename driver named by a PCI device so discovered. However, note that if the PCI scan does not find a need for a particular driver, then no automatic
modprobeoccurs. Add an additional
modprobekeyword to forcibly load the modulename.
- firmware firmfile
firmwarekeyword specifies that the firmfile file, which typically resides on the master node in
/lib/firmware/firmfile, be included in the compute nodes’ initrd image, if known to be needed by a particular
bootmodulemodulename. Adding one or more
firmwarekeywords significantly increases the size of the initrd image. See the Administrator’s Guide for details.
- fsck fsck-policy
fsckkeyword specifies the file system checking policy to be used at node boot time. The valid policies are “never”, “safe” or “full”.
The file system on the compute nodes will not be checked on boot.
The file system on the compute nodes will go through a safe check every time the compute node boots.
The file system on the compute nodes will go through a full check every time the compute node boots. The full check might possibly remove files from the filesystem if they cannot be repaired.
- host MACaddress IPaddress [hostname(s)]
hostkeyword assigns an IP addresses to a specific client device identified by its MAC address, if and when that client makes a DHCP request to the master. The IP addresses must be in dotted notation (e.g., 192.168.1.100), and it must be within the range of one of the
hostrangeIP address ranges. These
hostclients are not Scyld nodes, which are identified by
nodekeywords and are assigned IP addresses from the
iprangerange. Rather, typically they are devices like smart Ethernet switches that connect to the cluster private network and issue a DHCP request to obtain an IP address. Up to six optional hostname names may be assigned to a client, and these names are recognized by the Beo NSS service.
- hostrange [name] IPaddress-lwb IPaddress-upb
hostrangeis used in conjunction with the
hostkeyword. It declares a range of IP addresses that may later be used for
hostclients doing DHCP requests. An optional name] may be associated with this range. Multiple
hostrangekeywords may be present.
- ignore MACaddress
ignorekeyword specifies a MAC address (e.g., 00:11:22:AA:BB:CC) that
beoservshould ignore DHCP and PXE requests from. Multiple
ignorekeywords are allowed.
- initrdimage [noderange] imagename
initrdimagekeyword specifies the full path to the initrd image that should be used when creating the final boot images for the compute nodes. If noderange is specified, then this imagename applies only to the specified range of nodes; otherwise, imagename applies to all nodes.
- insmod module-name [options]
insmodkeyword specifies a kernel module to be loaded (usually a network driver). Options for the module may be specified as well.
- interface interfacename
interfacekeyword specifies the name of the interface that connects the master node to the compute nodes. This is used by the cluster services and management tools such as the
bpmasterdaemon and the
beoservdaemon. Common values are “eth0” or “eth1”. If present, entries after the interface name specify the IP address and netmask that the interface should be configured to.
- iprange [nodenumber] IPaddress1 IPaddress2
iprangekeyword specifies the range of IP addresses to be assigned to nodes. If the optional nodenumber is given, the first address in the range will be assigned to that node, the second address to the next node, etc. If no node number is given, the address assignment will begin with the node following the node that was last assigned. If no nodes have been assigned, the assignment will begin with node 0.
- kernelcommandline [noderange] options
kernelcommandlinekeyword specifies any options you wish to have passed to the kernel on the compute nodes. These are the same options that are normally passed with “append=” in
lilo, or on the
liloprompt while the machine is booting (e.g., “kernelcommandline apm=power-off”). If noderange is specified, then these options apply only to the specified range of nodes; otherwise, options apply to all nodes.
- kernelimage [noderange] imagename
kernelimagekeyword specifies the full path to the kernel that should be used when creating the final boot images for the compute nodes. If noderange is specified, then this imagename applies only to the specified range of nodes; otherwise, imagename applies to all nodes.
- libraries librarypath1 [, librarypath2, …]
librarieskeyword specifies a list of libraries that should be cached on the compute nodes when an application on the node references the library. The library path can be a directory or file. If a file name is specified, then that specific file may be cached, if needed. If a directory name is specified, then every file in that directory may be cached. If the directory name ends with “/”, then subdirectories under the specified directory may be cached.
- logfacility facility
logfacilitykeyword specifies the log facility that the
BProcmaster daemon should use. Some example log facility names are “daemon”, “syslog”, and “local0” (see the
syslogdocumentation for more information). The default log facility is “daemon”.
- masterdelay SECS
masterdelaykeyword specifies the timeout value in seconds for a non-primary master node to delay sending a response to an incoming dhcp request. The default value is 15 seconds.
- masterorder nodes IPaddress_primary IPaddress_secondary
masterorderkeyword specifies the cluster IP addresses of the primary master node and the secondary master node(s) for a given set of nodes. This is used by the
beoservdaemon for Master-Failover (cold reparenting). A compute node’s PXE request broadcasts across the cluster network. The primary master node is given
masterpxedelayseconds to respond, after which the first secondary master node will respond. If multiple secondary master nodes are specified, then each waits in turn for
masterpxedelayseconds for a preferred master to respond. Similarly, the compute node’s subsequent DHCP broadcast gets serviced in the same order, with each secondary master waiting
masterdelayseconds for a preferred master to respond.
masterorder 0,5,10-20 10.1.0.1 10.2.0.1 masterorder 1-4,21-30 10.2.0.1 10.1.0.1
If master 10.1.0.1 is down or fails to respond to PXE/DHCP requests to compute node 10, then master 10.2.0.1 becomes the primary parent for compute node 10.
- masterpxedelay SECS
masterpxedelaykeyword specifies the timeout value in seconds for a non-primary master node to delay sending a response to an incoming PXE request. The default value is 5 seconds.
- mcastbcast interface
mcastbcastkeyword directs the
beoservdaemon to use broadcast instead of multicast when transmitting files over the interface. This is useful when network equipment has trouble with heavy multicast traffic.
- mcastthrottle interface rate
mcastthrottlekeyword controls the rate at which data is transmitted over the specified interface. The rate is given in megabits per second. This is useful when the compute node interfaces cannot keep up with the master interface when sending large files.
- mkfs mkfs-policy
mkfskeyword specifies the policy to use when building a Linux file system on the compute nodes. The valid policies are “never”, “if_needed”, or “always”.
The filesystem on the compute nodes will never be recreated on boot.
The filesystem on the compute nodes will only be recreated if the filesystem check fails.
The filesystem on the compute nodes will be recreated on every boot.
fsckwill be assumed to be set to “never” when this is set.
- modarg options
modargkeyword specifies options to be used for modules that are loaded during the boot process without options. This is useful for specifying options to modules that get loaded during the PCI scan.
- moddep module-list
moddepkeyword is used to specify module dependencies. The first module listed is dependent on the remaining modules in the space separated list. The first module will be loaded after all other listed modules. Module dependency information is normally automatically generated by the
- modprobe modulename [options]
modprobekeyword specifies the name of the kernel module to be loaded with dependency checking, along with any specified module options. Note that the modulename must also be named by a
- node [nodenumber] MACaddress
nodekeyword is used to assign MAC addresses to node numbers. There should be one of these lines for each node in your cluster. Note the following:
If a value is not provided for the nodenumber argument, the first node entry is node 0, the second is node 1, the third is node 2, etc.
The value “off” can be used for the MACaddress argument to leave a place holder for that node number.
To skip a node number, use the value “node” or “node off” for the MACaddress argument.
To skip a node number and make sure it will never be automatically filled in by something later in the future, use the value “node reserved” for the MACaddress argument.
- nodeacceses [ -M | -S nodenumber | all ] arglist
nodeaccesskeyword overrides the default access permissions for the master node (
-M), for all compute nodes (
all), or for a specific compute node (nodenumber). The remaining arglist is passed directly to the
bpctlcommand for parsing and execution. See the Administrator’s Guide for details about node access permissions.
nodeaccess -M -m 0110 nodeaccess -S 5 -g physics nodeaccess -S 6 -g physics
- nodeassign nodeassign-method
nodeassignkeyword specifies the node assignment strategy used when the
beoservdaemon receives a new, unknown MAC address from a computer that is not currently entered in the node database. The total number of entries in the node database is limited to the number specified with the
nodeskeyword (see above).
The valid node assignment methods are “append”, “insert”, “manual”, or “locked”. Note the following:
“Append” and “insert” are the only two choices that allow new nodes to be automatically given node numbers and welcomed into the cluster.
Any failures of automatic node assignment through “append” or “insert” (such as when the node table is full) will cause the node assignment to be treated as “manual”.
This is the default setting. The system will append new MAC addresses to the end of the node list in the
/etc/beowulf/configfile. This is done by seeking out the highest already-assigned node number and attempting to go one number beyond it. If the highest node number in the cluster has already been assigned, the “append” method will fail and the “manual” method will take precedence.
The system will insert new MAC addresses into the node list in the
/etc/beowulf/configfile, starting with the lowest vacant node number. If no spaces are available, the “append” method will be used instead. Typically, a user would choose “insert” when replacing a single node if they want the new node entry to appear in the same place as the old node entry. If the node table is full, the “insert” method will fail and the “manual” method will take precedence.
The system will enter new MAC addresses in the
/var/beowulf/unknown_addressesfile, and require the user to manually assign the new nodes. The node entries will appear in the “Unknown” list in the BeoSetup GUI, which simplifies the node assignment process. An alternative to using the BeoSetup GUI is to manually edit the
/etc/beowulf/configfile and copy in the new MAC addresses from the
The system will ignore DHCP requests from any MAC addresses not already listed in the
/etc/beowulf/configfile. This prevents nodes from getting added to the cluster accidentally. This is particularly useful in a cluster with multiple masters, because it enables the Cluster Administrator to control which master responds to a new node request. When you are troubleshooting issues related to the cluster not “seeing” new nodes, one of the first things to check is whether
nodeassignis set to “locked”.
See the Administrator’s Guide for additional information on configuring nodes with the BeoSetup GUI and on manual node configuration.
- nodename name-format [IPv4 Offset or Base] [netgroup]
nodenamekeyword defines the primary hostname, as well as additional hostname-aliases for compute nodes. It can also be used to define hostnames and hostname-aliases for non-compute node entities with a per compute node relationship (e.g., to define a hostname and IP address for the IPMI management interface on each compute node). The presence of the (optional) IPv4 parameter determines if the entry is for compute nodes or for non-compute node entities. If no ‘nodename’ keyword is defined for compute nodes, then compute nodes’ primary hostname is of the ‘dot-number’ format (e.g., node 10’s primary hostname is ‘.10’).
Define a hostname or hostname-alias. The first instance of the nodename keyword with no IPv4 parameter defines the primary hostname format for compute nodes. While the user may define the primary hostname, the FIRST hostname alias shall always be of the ‘dot-number’ format. This allows compute nodes to always resolve their address from the ‘dot-number’ notation. Additional nodename entries without an IPv4 parameter define additional hostname aliases.
The name-format string must contain a conversion specification for node number substitution. The conversion specification is introduced by a percent sign (the ‘%’ symbol). An optional following digit in the range 1..5 specifies a zero-padded minimum field width. The specification is completed with an ‘N’. An unspecified or zero field width allows numeric interpretation to match compute node host names. For example, “n%N” will match “n23”, “n+23”, and “n0000023”. By contrast, “n%3N” will only match “n001” or “n023”, but not “n1” or “n23”.
- IPv4 Offset or Base
The presence of the optional IPv4 argument defines if the entry is for “compute nodes” (i.e. the entry will resolve to the ‘dot-number’ name) or if the entry is for non-cluster entities that are loosely associated with the compute node. If the argument has a leading zero, then the parameter specifies an IPv4 Offset. If the argument does not lead with a zero, then the argument specifies a ‘base’ from which IP addresses are computed, by adding the ‘node-number’ associated with the non-compute node entity.
The netgroup parameter specifies a netgroup that contains all the entries generated by the nodename entry
- nodes numnodes
nodeskeyword specifies the total possible number of nodes in the cluster. This should normally be set to match the
iprange. However, if multiple
iprangesare specified, then this value should represent the total number of nodes in all the
- pingtimeout SECS
bpmasterdaemon that executes on the master node sends periodic “ping” messages to the
bpslavedaemon that executes on each compute node, and each bpslave dutifully responds. This interaction serves as mutual bpmasterbpslave assurance that the other daemon and the network link is still alive and well. If bpslave does not see this “ping” message for SECS seconds, then the bpslave goes into “orphan mode”. If run-to-completion is enabled (see the Administrator’s Guide for details), then the node attempts to remain alive and functioning, despite its apparent inability to communicate with the master node. If run-to-completion is not enabled (which is the default), then the node reboots immediately. If bpmaster does not see a ping reply for SECS seconds, then it syslogs this event and breaks its side of the network connection to the compute node.
pingtimeoutvalue is 32 seconds. In rare cases, a particular workload may trigger such a “ping timeout” and its associated spontaneous reboot, and using a
pingtimeoutkeyword to increase the timeout value may stop the spontaneous rebooting.
- pci vendorid deviceid drivername
pcikeyword specifies what driver should be used in support of the specified PCI device. A device is identified by a unique vendor ID and device ID pair. The vendor and device ID’s can be either in decimal or hexadecimal with the “0x” notation. You should have one of these lines for each PCI ID (a vendor ID combined with a device ID) for each device on your compute nodes that is not already recognized. Any module dependencies or arguments should be specified with
- prestage pathname
prestagekeyword names a specific file that each compute node pulls from the master at node boot time. Multiple instances of
prestagecan be used. If the pathname is a file in one of the
librariesdirectories, then the pathname gets pulled into the compute node’s library cache. Otherwise, the file (and its directory hierarchy) is copied from the master to the compute nodes.
- server transport-protocol port
serverkeyword specifies the port numbers that ClusterWare uses for specified transport protocols. Each transport protocol uses a unique default port number. In the event that a default port value conflicts with a port number used by another service (typically, specified in
serverkeyword must specify an override value. The allowable transport-protocol keywords are “beofs2” (default port 932), “bproc” (default port 933), “beonss” (default port 3045), and “beostats” (default port 5545). (The keyword “tcp” is deprecated - use “beofs2” instead.)
iprange 192.168.1.0 192.168.1.50 nodename ipmi-n%N 0.0.1.0
In the above example, the hostname “ipmi-n0” has an address of 192.168.2.50. That is, the compute node’s address (192.168.1.50 for compute node 0) plus the IPv4 Offset of 0.0.1.0. The hostname “ipmi-n12” has an address of 192.168.2.12, which is compute node 12’s address plus the IPv4 Offset of 0.0.1.0.
nodename ib0-n%N 0.1.0.0 infiniband
In the above example, define a hostname for the infiniband interface for
each compute node. Using the
iprange values in the previous example,
the infiniband interface for compute node 0 has a primary hostname of
“ib-n0” and resolves to the address 18.104.22.168: node 0’s basic
iprange IP address, plus the increment 0.1.0.0. The infiniband
interface for compute node 10 has a primary hostname of “ib-n10” and
resolves to the address 22.214.171.124. Each of the “ib0-n%N” hostnames
belong to the “infiniband” netgroup.
nodename computenode%N nodename cnode%3N
In the above example, the primary hostname for compute node 0 is “computenode0”, and the primary hostname for compute node 12 is “computenode12”. The second nodename entry defines additional hostname aliases. The FIRST hostname alias will always be the ‘dot-number’ notation, so compute node 12’s first hostname alias is “.12”, and the second hostname alias will be “cnode012”. The ‘%’ followed by a three specifies a three-digit field width format for the entry.
The following is an example of a complete Beowulf Configuration File
# Beowulf Configuration file # Network interface used for Beowulf # Only first argument to interface is important interface eth1 192.168.1.1 255.255.255.1 # These two should probably agree for most users iprange 192.168.1.100 192.168.1.107 nodes 8 # Default location of boot images bootfile /var/beowulf/boot.img kernelimage /boot/vmlinuz-2.4.17-0.18.12.beo kernelcommandline apm=power-off # Default libraries libraries /lib /usr/lib # Default file system policies. fsck full mkfs if_needed # beoserv settings server beofs2 932 # Default Modules bootmodule 3c59x 8139too dmfe eepro100 epic100 hp100 natsemi bootmodule ne2k-pci pcnet32 sis900 starfire sundance tlan bootmodule tulip via-rhine winbond-840 yellowfin # Non-kernel integrated drivers bootmodule e100 bcm5700 # gm # Node assignment method nodeassign append # PCI Gigabit Ethernet. # * AceNIC and SysKonnect firmwares are very large. # * Some of these are distributed separate from the kernel bootmodule dl2k hamachi e1000 ns83820 # acenic sk98lin node 00:50:8B:D3:25:4D node 00:50:8B:D3:07:8B ignore 00:50:8B:D3:31:FB node 00:50:8B:D3:62:A0 node 00:50:8B:D3:00:66 node 00:50:8B:D3:30:42 node 00:50:8B:D3:98:EA