Interacting with Compute Nodes¶

The primary tool for interacting with nodes from the command line is scyld-nodectl. This tool is how an administrator would add a node, set or check configuration details of a node, see the basic node hardware, see basic status, cause a node to join or leave attribute groups, reboot or powerdown a node, or execute commands on the node.

In this section we will show a number of examples and discuss what information an administrator can both get and set through the scyld-nodectl tool, as well as reference other resources for further details.

Nodes are named by default in the form of nX, where X is a numeric zero-based index. More complicated clusters may benefit from more flexible naming schemes. See Node Names and Pools for details.

Node Creation with Known MAC address(es)¶

When a new node's MAC address is known to the cluster administrator, the simplest method is add the node to the cluster is to use scyld-nodectl create action and supply that node's MAC address:

scyld-nodectl create mac=11:22:33:44:55:66

and the node is assigned the next available node index and associated IP address.

The administrator can also add the node at an index other than the next available index, e.g., to add a node n10:

scyld-nodectl create mac=11:22:33:44:55:66 index=10

Of course, if a node already exists for the specified MAC or index, then an error is returned and no node is created.

Adding nodes one at a time would be tedious for a large cluster, so an administrator can also provide JSON formatted content to the create action. For example,

scyld-nodectl create --content @path/to/file.json

where that file.json contains an array of JSON objects, each object describing a single node, e.g., for two nodes:

[
  { "mac": "11:22:33:44:55:66" },
  { "mac": "00:11:22:33:44:55" }
]

The content argument can also directly accept JSON, or an INI formatted file, or a specially formatted text file. Details of how to use these alternative formats are available in the Reference Guide in Introduction to Tools.

Node Creation with Unknown MAC address(es)¶

A reset or powercycle of a node triggers a DHCP client request which embeds the node's MAC address. A head node with an interface that is listening on that private cluster network and which recognizes that MAC address will respond with an IP address that is associated with that MAC, unless directed to ignore that node. A ClusterWare head node can be so directed to ignore the known-MAC node by using a _no_boot attribute (see _no_boot), and a ClusterWare 6 or 7 master node can employ a /etc/beowulf/config file masterorder configuration directive to consider this known-MAC node to be owned by another head/master node.

A ClusterWare DHCP server which does not recognize the incoming MAC will by default ignore the incoming DHCP client request. To override this default:

scyld-clusterctl --set-accept-nodes True

and then any head node that shares the same database will add that new MAC to the shared ClusterWare database, assign to it the next available node index and associated IP address, and proceed to attempt to boot the node.

If a ClusterWare 6 or 7 beoserv daemon is alive and listening on the same private cluster network, then that master node should have its /etc/beowulf/config specify nodeassign locked, which directs its beoserv to ignore unknown MAC addresses.

When all new nodes with previously unknown MAC addresses are thus merged into the ClusterWare cluster, then the cluster administrator should again reenable the default functionality with:

scyld-clusterctl --set-accept-nodes False

If multiple new nodes concurrently initiate their DHCP client requests, then the likely result is a jumbled assignment of indices and IP addresses. Cluster administrators often prefer nodes in a rack to have ordered indices and IP addresses. This ordered assignment can be accomplished by performing subsequent carefully crafted scyld-nodectl update actions, e.g.,

scyld-nodectl -i n10 update index=100
scyld-nodectl -i n11 update index=101
scyld-nodectl -i n12 update index=102
scyld-nodectl -i n10,n11,n12 reboot   # at a minimum, reboot the updated nodes

Note

Desired ordering can more easily be accomplished by performing the initial node resets or powercycling for each individual node in sequence, one at a time, and allowing each node to boot and get added to the database before initiating the next node's DHCP request.

Changing IP addresses¶

To change IP addresses on a cluster, generate a configuration file of the currently state of the nodes with their current IP addresses, edit the file to change one or more IP addresses as desired, re-load the file, and trigger the head node to recompute the new addresses and update the database. For example:

scyld-cluster-conf save new_cluster.conf
# manually edit new_cluster.conf to change IP addresses
scyld-cluster-conf load new_cluster.conf
scyld-nodectl -i <NODES_THAT_CHANGE> update ip=

The new addresses are not seen by compute nodes until they reboot or perform a dhcp renewal.

Replacing Failed Nodes¶

Since nodes are identified by their MAC addresses, replacing a node in the database is relatively simple. If the node (n23 in the following example) was repaired but the same network interface is still being used, then no changes are necessary; however, if it was the network card that failed and it was replaced, then the node's MAC address can be updated with one command:

scyld-nodectl -i n23 update mac=44:22:33:44:55:66

If the entire node was replaced, then instead of just updating the MAC address, the administrator would likely prefer to clear the node status and any history associated with that node. To do this, delete and recreate the failed node:

scyld-nodectl -i n23 delete
scyld-nodectl create index=23 mac=44:22:33:44:55:66

Node Name Resolution¶

The scyld-install script installs the clusterware-dnsmasq package which provides resolution services for head node names. Similar to the clusterware-iscdhcp, this package depends on a standard OS provided service, but runs a private instance of that service, configuring it through the templated configuration file /opt/scyld/clusterware-dnsmasq/dnsmasq.conf.template. Within that file, fields like "<DOMAIN>" are substituted with appropriate values from the cluster network configuration, and the resulting file is rewritten.

Specifically, the "domain" field (defaulting to .cluster.local) is appended to compute node names (n0, n1, etc.) to produce a fully-qualified domain name. That default value can be overridden in the cluster configuration provided at installation time or loaded via the scyld-cluster-conf command. Multiple domains can be defined in that configuration file and are applied to any subsequently defined network segments until a later line sets a new domain value. Note that when changing this value on an established cluster, the cluster administrator may want to only load the networking portion of the cluster configuration instead of recreating already configured compute nodes:

scyld-cluster-conf load --nets-only cluster.conf
sudo systemctl restart clusterware

By default, any hosts listed in the /etc/hosts file on the head node will also resolve on the compute nodes through dnsmasq as will names added through the scyld-clusterctl hosts command. The localise-queries keyword in the template file is provided because head nodes commonly have multiple addresses on different networks and dnsmasq should reply with the IP appropriate to the requestor. Commenting out localise-queries will cause dnsmasq to reply with all IPs for a queried name. To entirely prevent dnsmasq being populated with head node IPs set leases.register_heads = False in /opt/scyld/clusterware/conf/base.ini and restart the ClusterWare service. These dnsmasq behaviors and many others can be changed in the aforementioned configuration template.

An administrator may modify the template file to completely remove the domain or to otherwise modify the dnsmasq configuration. Please see the dnsmasq project documentation for details of the options that service supports. Similarly, the dhcpd configuration template is located at /opt/scyld/clusterware-iscdhcp/dhcpd.conf.template, although as that service is much more integral to the proper operation of ClusterWare, changes should be kept to an absolute minimum. Administrators of more complicated clusters may add additional "options" lines or similarly remove the "option domain-name" line depending on their specific network needs. Additional DNS servers can also be provided to compute nodes through the "option domain-name-servers" lines. As with dnsmasq, please see the ISC DHCP documentation for supported options.

During compute node boot, dracut configures the bootnet interface of the node with the DNS servers and other network settings. These settings may be changed by cluster administrators in startup scripts as long as the head node(s) remain accessible to the compute nodes and vice versa.

During initial installation, the scyld-install script attempts to add the local dnsmasq instance (listening on the standard DNS port 53) as the first DNS server for the head node. If this is unsuccessful, DNS resolution will still work on compute nodes, although the administrator may need to add local DNS resolution before ssh and similar tools can reach the compute nodes. Please consult your Linux distribution documentation for details. Note that DNS is not used for compute node name resolution within the REST API or by the ClusterWare administrative tools; rather, the database is referenced in order to map node ids to IP addresses.

Executing Commands¶

A cluster administrator can execute commands on one or more compute nodes using the scyld-nodectl tool. For example:

scyld-nodectl -i n0 exec ls -l /

passes the command, e.g. ls -l /, to the head node, together with a list of target compute nodes. The head node will then ssh to each compute node using the head node's SSH key, execute the command, and return the output to the calling tool that will display the results. Note that this relay through the REST API is done because the ClusterWare tools may be installed on a machine that is not a head node and is not able to directly access the compute nodes.

Note that even if DNS resolution of compute node names is not possible on the local machine, scyld-nodectl exec will still work because it retrieves the node IP addresses from the ClusterWare database via the head node. Further, once an administrator has appropriate keys on the compute nodes and has DNS resolution of compute node names, they are encouraged to manage nodes either directly using the ssh or pdsh commands or at a higher level with a tool such as ansible.

Commands executed through scyld-nodectl exec are executed in parallel across the selected nodes. By default 64 nodes are accessed at a time, but this is adjustable by setting the ssh_runner.fanout to a larger or smaller number. This variable can be set in an administrator's ~/.scyldcw/settings.ini or can be set in /opt/scyld/clusterware/conf/base.ini on a head node. Setting the ssh_runner.fanout variable to a value less than or equal to 1 causes all commands to be executed serially across the nodes.

Some limited support is also provided for sending content to the stdin of the remote command. That content can be provided in a file via an option, e.g.:

scyld-nodectl -i n0 exec --stdin=@input.txt dd of=/root/output.txt

or the content can be provided directly:

scyld-nodectl -i n0 exec --stdin='Hello World' dd of=/root/output.txt

or the content can be piped to scyld-nodectl, and this time optionally using redirection on the compute node to write to the output file:

echo 'Hello world' | scyld-nodectl -i n0 exec cat > /root/output.txt

When a command is executed on a single node, the command's stdout and stderr streams will be sent unmodified to the matching file descriptor of the scyld-nodectl command. This allows an administrator to include remote commands in a pipe much like ssh. For example:

echo 'Hello world' | scyld-nodectl -i n0 exec tr 'a-z' 'A-Z' > output.txt

will result in a the local file output.txt containing the text "HELLO WORLD". The scyld-nodectl exec exit code will also be set to the exit code of the underlying command. When a command is executed on multiple nodes, the individual lines of the resulting output will be prefixed with the node names:

[admin@virthead]$ scyld-nodectl -in[0-1] exec ls -l
n0: total 4
n0: -rw-r--r--. 1 root root 13 Apr  5 20:39 output.txt
n1: total 0

When executing a command on multiple nodes, the exit code of the scyld-nodectl exec command will only be 0 if the command exits with a 0 on each node. Otherwise the tool return code will match the non-zero status of the underlying command from one of the failing instances.

The mechanism for passing stdin should not be used to transfer large amounts of data to the compute nodes, as the contents will be forwarded to the head node, briefly cached, and copied to all compute nodes. Further, if the data was passed as a stream either through piping to the scyld-nodectl command or passing the path to a large file via the --stdin=@/path/to/file mechanism, the nodes will be accessed serially, not in parallel, so that the stream can be rewound between executions. This is supported for convenience when passing small payloads, but is not efficient in large clusters. A more direct method such as scp or pdcp should be used when the content is more than a few megabytes in size. Also note that even when communicating with a single compute node, this is not truly interactive because all of stdin must be available and sent to the head node before the remote command is executed.

Node Attributes¶

The names and uses of the fields associated with each database object are fixed, although nodes may be augmented with attribute lists for more flexible management. These attribute lists are stored in the attributes field of a node and consist of names (ideally legal Javascript variable names) and textual values. Attribute names prefixed with an underscore such as _boot_config or _boot_style are reserved for use by ClusterWare. These attributes may be referenced or modified by administrator defined scripting, but changing their values will modify the behavior of ClusterWare.

Beyond their internal use, e.g. for controlling boot details, attributes are intended for use by cluster administrators to mark nodes for specific purposes, record important hardware and networking details, record physical rack locations, or whatever else the administrator may find useful. All attributes for a given node are available and periodically updated on the node in file /opt/scyld/clusterware-node/etc/attributes.ini. This directory /opt/scyld/clusterware-node/etc/ is also symlinked to /etc/clusterware.

Attributes can also be collected together into attribute groups that are stored separately from the node database objects. Administrators can then assign nodes to these groups and thereby change the attributes for a selection of nodes all at once.

Each node has a list of groups to which it belongs, and the order of this list is important. Attribute groups appearing later in the list can override attributes provided by groups earlier in the list. For any given node there are two special groups: the global default group and the node-specific group. The global default group, which is defined during the installation process and initially named "DefaultAttribs", is always applied first, and the node-specific group contained in the node database object is always applied last. Any attribute group can be assigned to be the default group through the scyld-clusterctl command, e.g.,

scyld-clusterctl --set-group GroupNameOrUID

An example should clarify how attributes are determined for a node. Immediately after installation the "DefaultAttribs" group contains a single value:

[example@head ~]$ scyld-attribctl ls -l
Attribute Groups
  DefaultAttribs
    attributes
      _boot_config: DefaultBoot

Note that fields extraneous to this example have been trimmed from this output, although some are discussed further in the Reference Guide. Looking at two nodes on this same cluster:

[example@head ~]$ scyld-nodectl ls -l
Nodes
  n0
    attributes:
      _boot_config: DefaultBoot
    groups: []

  n1
    attributes:
      _boot_config: DefaultBoot
    groups: []

By default no attributes are defined at the node level, although all nodes inherit the _boot_config value from the "DefaultAttribs" group. If an administrator creates a new boot configuration (possibly by using the scyld-add-boot-config script mentioned earlier) and calls it "AlternateBoot", then she could assign a single node to that configuration using the scyld-nodectl tool, e.g.,

scyld-nodectl -i n0 set _boot_config=AlternateBoot

Examining the same nodes after this change would show:

[example@head ~]$ scyld-nodectl ls -l
Nodes
  n0
    attributes:
      _boot_config: AlternateBoot
    groups: []

  n1
    attributes:
      _boot_config: DefaultBoot
    groups: []

Of course, managing nodes by changing their individual attributes on a per-node basis is cumbersome in larger clusters, so a savvy administrator can create a group and assign nodes to that group:

scyld-attribctl create name=AltAttribs
scyld-attribctl -i AltAttribs set _boot_config=ThirdBoot

Assigning additional nodes to that group is done by "joining" them to the attribute group:

scyld-nodectl -i n[11-20] join AltAttribs

After the above changes, node n0 is assigned to the "AlternateBoot" configuration, n11 through n20 would boot using the "ThirdBoot" configuration, and any other nodes in the system will continue to use "DefaultBoot". This approach allows administrators to efficiently aggregate a set of nodes in anticipation of an action against the entire set, for example when testing new images, or if some nodes need specific configuration differences due to hardware differences such as containing GPU hardware.

For a more technical discussion of setting and clearing attributes as well as nodes joining and leaving groups, please see the appropriate section of the Reference Guide.

Node Names and Pools¶

By default all compute nodes are named nX, where X is a numeric zero-based node index. This pattern can be changed using "nodename" lines found in a cluster configuration file. For example, a line "nodename compute{}" early in such a file will change the default node naming to computeX. This changes both the default node hostnames as well as the names recognized by the scyld-nodectl command.

For homogeneous clusters where all compute nodes are essentially the same, this is usually adequate, although in more complex environments there is utility in quickly identifying core compute node capabilities reflected by customized hostnames. For example, high memory nodes and general purpose GPU compute nodes could be named "hmX" and "gpgpuX". These names can be assigned via the _hostname attribute as described in Reserved Attributes, although the scyld-nodectl command will still refer to them as "nX".

To support multiple name groupings within the scyld-*ctl tools, the ClusterWare system includes the concept of a naming pool. These pools are defined and modified through the scyld-clusterctl pools command line interface. Once the appropriate pools are in place, then compute nodes can be added to those pools. Continuing the example described previously:

scyld-clusterctl pools create name=high_mem pattern=hm{} first_index=1
scyld-clusterctl pools create name=general_gpu pattern=gpgpu{} first_index=1
scyld-nodectl -in[37-40] update naming_pool=high_mem
scyld-nodectl -in[41,42] update naming_pool=general_gpu

After these changes the scyld-nodectl status and scyld-nodectl ls output will include the specified nodes as "hm[1-4]" and "gpgpu[1-2]". Any commands that previously used "nX" names will then accept "hmX" or "gpgpuX" names to refer to those renamed nodes. The first_index= field of the naming pool forces node numbering to begin with a specific value, defaulting to 0. Any nodes not explicitly attached to a naming pool will use the general cluster naming pattern controlled through the scyld-clusterctl --set-naming PATTERN command. This can be considered the default naming pool.

Important

Please note that when moving multiple compute nodes from one naming pool to another, that the node order may not be preserved. Instead, moving them individually, or specifying their MAC addresses in a cluster configuration file, may be more predictable.

When moving a node from one naming pool to another via the scyld-nodectl command, the node index will be reset to the next available index in the destination pool. Using an explicit index=X argument allows the cluster administrator to directly control the node renumbering. Note that nodes in different naming pools may have the same index, so in this configuration the index is no longer a unique identifier for individual nodes. Further, the --up, --down, --all node selectors are not restricted to a single naming pool and will affect nodes in all pools that match the selection constraint. Nodes in scyld-nodectl output will be ordered by index within their naming pool, although the order of the naming pools themselves is not guaranteed. For example:

[admin@head clusterware]$ scyld-nodectl ls
Nodes
  n1
  n2
  n3
  n4
  n5
  login6
  login7
  login8
  login9

Similarly, the nodes are grouped by naming pool in scyld-cluster-conf save output with "nodename" lines and explicit node indices inserted as needed:

[admin@head clusterware]$ scyld-cluster-conf save -
# Exported Scyld ClusterWare Configuration file
#
# This file contains the cluster configuration.
# Details of the syntax and semantics are covered in the
# Scyld ClusterWare Administrators Guide.
#
nodename n{}

# 10.10.24.0/24 network
domain cluster.local
1 10.10.24.101/24 10.10.24.115
node 1 00:00:00:00:00:01  # n1
node 00:00:00:00:00:02  # n2
node 00:00:00:00:00:03  # n3
node 00:00:00:00:00:04  # n4
node 00:00:00:00:00:05  # n5
nodename login{}
node 6 00:00:00:00:00:06  # login6
node 00:00:00:00:00:07  # login7
node 00:00:00:00:00:08  # login8
node 00:00:00:00:00:09  # login9

The organization of node naming pools is intentionally independent of node networking considerations. The cluster administrator may choose to combine these concepts by creating separate naming pools for each network segment, although this is not necessary.

Secondary DNS names can also be defined using "nodename":

nodename <pattern> <ip> [pool_name]

A "nodename" line containing an IP address (or IP offset such as "0.0.1.0") can define a name change at an offset within the IP space or define a secondary DNS name depending on whether the IP is within a defined network. For example:

iprange 10.10.124.100/24 10.10.124.250
node
node 08:00:27:F0:44:35  # n1 @ 10.10.124.101

nodename hello{}/5 10.10.124.105
node 08:00:27:A2:3F:C9  # hello5 @ 10.10.124.105

nodename world{}/10 10.10.124.155
node 12 08:00:27:E5:19:E5  # world12 @ 10.10.124.157

nodename n%N-ipmi 10.2.255.37 ipmi
# world12 maps to n2-ipmi @ 10.2.255.39

nodename world%N-ipmi/10 10.2.254.37 ipmi
# world12 maps to world12-ipmi @ 10.2.254.39

Note that the "<pattern>/X" syntax defines the lowest node index allowed within the naming pool.

Attribute Groups and Dynamic Groups¶

The scyld-install script creates a default attribute group called DefaultAttribs. That group can be modified or replaced, although all nodes are always joined to the default group. The cluster administrator can create additional attribute groups, e.g.,:

scyld-attribctl create name=dept_geophysics
scyld-attribctl create name=dept_atmospherics
scyld-attribctl create name=gpu

and then assign or remove one or more groups to specific nodes, e.g.,:

scyld-nodectl -i n[0-7] join dept_geophysics
scyld-nodectl -i n[8-11] join dept_atmospherics
scyld-nodectl -i n[0-3,7-9] join gpu
scyld-nodectl -i n7 leave gpu

These group assignments can be viewed either by specific nodes:

scyld-nodectl -i n0 ls -l
scyld-nodectl -i n[4-7] ls -l

or as a table:

[admin]$ scyld-nodectl --fields groups --table ls -l
Nodes |                       groups
------+-----------------------------
   n0 |   ['dept_geophysics', 'gpu']
   n1 |   ['dept_geophysics', 'gpu']
   n2 |   ['dept_geophysics', 'gpu']
   n3 |   ['dept_geophysics', 'gpu']
   n4 |          ['dept_geophysics']
   n5 |          ['dept_geophysics']
   n6 |          ['dept_geophysics']
   n7 |          ['dept_geophysics']
   n8 | ['dept_atmospherics', 'gpu']
   n9 | ['dept_atmospherics', 'gpu']
  n10 |        ['dept_atmospherics']
  n11 |        ['dept_atmospherics']
  n12 |                           []
  n13 |                           []
  n14 |                           []
  n15 |                           []

Scyld commands that accept group lists can reference nodes by their group name(s) (expressed with a % prefix) instead of their node names, e.g.,:

scyld-nodectl -i %dept_atmospherics
scyld-nodectl -i %gpu
scyld-nodectl -i %dept_geophysics status -L

Both the Kubernetes scyld-kube --init command and the Job Scheduler ${jobsched}-scyld.setup init, reconfigure, and update-nodes actions accept --ids %<GROUP> as well as --ids <NODES>. For details, see Job Schedulers and Kubernetes under Additional Software.

In addition to attribute groups, ClusterWare also supports admin-defined dynamic groups using a query language that allows for simple compound expressions. These expressions can reference individual attributes, group membership, hardware fields, or status fields. For example, suppose we define attribute groups "dc1" and "dc2":

scyld-attribctl create name=dc1 description='Data center located in rear of building 1'
scyld-attribctl create name=dc2 description='Data center in building 2'

and then add nodes to appropriate groups:

scyld-nodectl -i n[0-31]  join dc1
scyld-nodectl -i n[32-63] join dc2

and for each node, identify its rack number in an attribute:

scyld-nodectl -i n[0-15]  set rack=1
scyld-nodectl -i n[16-31] set rack=2
scyld-nodectl -i n[32-47] set rack=1
scyld-nodectl -i n[48-63] set rack=2

Note that all attribute values are saved as strings, not integers, so that subsequent selector expressions must enclose these values in double-quotes.

Now you can query a list of nodes in a particular rack of a particular building using a --selector (or -s) expression, and perform an action on the results of that selection:

scyld-nodectl -s 'in dc1 and attributes[rack] == "2"' status
# or use 'a' as the abbreviation of 'attributes'
scyld-nodectl -s 'in dc1 and a[rack] == "2"' set _boot_config=TestBoot

# Show the nodes that have 32 CPUs.
# These hardware _cpu_count values are integers, not strings, and are
#  not enclosed in double-quotes.
scyld-nodectl -s 'hardware[cpu_count] == 32' ls
# or use 'h' as the abbreviation of 'hardware'
scyld-nodectl -s 'h[cpu_count] == 32' ls

# Show the nodes that do not have 32 CPUs
scyld-nodectl -s 'h[cpu_count] != 32' ls

You can also create a dynamic group of a specific selector for later use:

scyld-clusterctl dyngroups create name=b1_rack1 selector='in dc1 and a[rack] == "1"'
scyld-clusterctl dyngroups create name=b1_rack2 selector='in dc1 and a[rack] == "2"'

# Show the nodes in building 1, rack 2
scyld-nodectl -i %b1_rack2 ls

# Show only those %b1_rack2 nodes with 32 CPUs
scyld-nodectl -i %b1_rack2 -s 'h[cpu_count] == 32' ls

You can list the dynamic groups using scyld-clusterctl:

# Show the list of dynamic groups
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups ls
Dynamic Groups
  b1_rack1
  b1_rack2

And show details of one or more dynamic group. For example:

# Show the selector associated with a specific dynamic group
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups -i b1_rack1 ls -l
Dynamic Groups
  b1_rack1
    name: b1_rack1
    selector: in dc1 and a[rack] == "1"

# Or show the selector associated with a specific dynamic group in full detail
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups -i b1_rack1 ls -L
Dynamic Groups
  b1_rack1
    name: b1_rack1
    parsed: ((in "dc1") and (attributes["rack"] == "1"))
    selector: in dc1 and a[rack] == "1"

The parsed line in the above output can be useful when debugging queries to confirm how Scyld parsed the provided query text.

State Maps¶

A common task for a cluster administrator is identifying specific nodes that are out of compliance in some way and executing actions to solve such issues. These actions often involve temporarily removing the node(s) from production while performing testing, reprovisioning, and requalification. As problem nodes are identified, new nodes are added, or nodes are transferred from one configuration to another, the cluster administrator must have some means to keep track of the progress of each node through these processes. After all, these processes involve multiple stages, likely spanning multiple reboots or even reimaging. ClusterWare node attributes can be leveraged for persistently storing this progress information.

For example, when a node health check detects a memory issue on a GPU, other tasks may dictate that power cycling the node for row remapping cannot occur immediately. Instead, the health checking code could set an attribute noting what was detected. Then, a separate process could see that attribute and initiate the steps of removing the node from production, rebooting it, triggering requalification tests, and moving it back into production if all goes well.

Of course, this simple detection and mitigation process only covers one type of failure and one possible resolution. The GPU or other hardware could fail in a myriad of ways, each requiring different mitigation strategies. This means that a node’s health check results or progress through requalification may be stored across multiple node attributes.

The ClusterWare node selection language allows a cluster administrator to identify nodes that match possibly complex criteria by matching attribute values, detected status, or hardware details using basic comparators and logical operators. See the above section Attribute Groups and Dynamic Groups for dynamic groups for examples of node selectors.

Polling the ClusterWare service for attribute status frequently or across many nodes is inefficient and in extreme cases can impact the head node performance. To alleviate this, ClusterWare provides a scyld-nodectl waitfor mechanism. One common use is to wait for a node to boot before proceeding with additional steps in an overall command, for example:

scyld-nodectl -in10 reboot then waitfor up then exec uptime

The up is shorthand for a longer selector, specifically status[state] == “up” and can be replaced with more complicated selectors if, for example, the administrator is not rebooting the node but executing a command that will modify a node attribute when it completes. This sort of command chaining with “then” allows for simple automation, but more complex automation will deal with multiple nodes at different stages. For that case, ClusterWare allows administrators to provide a set of selectors referred to as a state map.

Using a state map, a cluster administrator can track nodes through scenarios including: * Error detection, handling, and requalification * A rolling firmware update process * Idle-time performance testing

State maps provide a general purpose mechanism to select groups of nodes based on their status and configuration, trigger actions, and observe the resulting changes. An example state map is provided as part of the clusterware-tools package:

$ cat /opt/scyld/clusterware-tools/examples/node-states.ini
[status]
up = status[state] == "up"
down = status[state] == "down"
booting = status[state] == "booting"

This INI format defines a state map named “status” containing 3 states named “up”, “down”, and “booting”. The selector that defines each state is provided to the right of the name, after the equal sign. This file can also be written in JSON format as:

{
  "name": "status",
  "states": {
      "up": "status[state] == \"up\"",
      "down": "status[state] == \"down\"",
      "booting": "status[state] == \"booting\""
  }
}

The cluster administrator can load the state map through the scyld-nodectl command:

scyld-nodectl waitfor --load-only @node-status.json

Once the state map is loaded, the waitfor command can also be used to see what nodes match what selectors by referencing the loaded map name, i.e. “status” in this example:

$ scyld-nodectl waitfor --name status
Nodes
  n[5-8,10]: up

Additional arguments are available to allow for streaming state transitions or simplifying the output for easier parsing:

$ scyld-nodectl waitfor --stream --name status
n[1-4] in down
n[5-10] in up

n[5] left up
n[5] entered down
n[5] left down
n[5] entered booting
n[5] left booting
n[5] entered up

In the above example, a single node in a 10 node cluster was rebooted and state transitions were emitted as the node progressed from “up” to “down” to “booting” and back to “up”.