Interacting with Compute Nodes¶
The primary tool for interacting with nodes from the command line is
scyld-nodectl
. This tool is how an administrator would add a
node, set or check configuration details of a node, see the basic node
hardware, see basic status, cause a node to join or leave attribute
groups, reboot or powerdown a node, or execute commands on the node.
In this section we will show a number of examples and discuss what
information an administrator can both get and set through the
scyld-nodectl
tool, as well as reference other resources for further
details.
Nodes are named by default in the form of nX, where X is a numeric zero-based index. More complicated clusters may benefit from more flexible naming schemes. See Node Names and Pools for details.
Node Creation with Known MAC address(es)¶
When a new node's MAC address is known to the cluster administrator,
the simplest method is add the node to the cluster is to use
scyld-nodectl create
action and supply that node's MAC address:
scyld-nodectl create mac=11:22:33:44:55:66
and the node is assigned the next available node index and associated IP address.
The administrator can also add the node at an index other than the next available index, e.g., to add a node n10:
scyld-nodectl create mac=11:22:33:44:55:66 index=10
Of course, if a node already exists for the specified MAC or index, then an error is returned and no node is created.
Adding nodes one at a time would be tedious for a large cluster,
so an administrator can also provide JSON formatted content to the create
action.
For example,
scyld-nodectl create --content @path/to/file.json
where that file.json
contains an array of JSON objects,
each object describing a single node, e.g., for two nodes:
[
{ "mac": "11:22:33:44:55:66" },
{ "mac": "00:11:22:33:44:55" }
]
The content
argument can also directly accept JSON, or an INI formatted
file, or a specially formatted text file. Details of how to use these
alternative formats are available in the Reference Guide in
Introduction to Tools.
Node Creation with Unknown MAC address(es)¶
A reset or powercycle of a node triggers a DHCP client request which embeds the
node's MAC address.
A head node with an interface that is listening on that private cluster network
and which recognizes that MAC address will respond with an IP address that is
associated with that MAC, unless directed to ignore that node.
A ClusterWare head node can be so directed to ignore the known-MAC node
by using a _no_boot attribute (see _no_boot),
and a ClusterWare 6 or 7 master node can employ a /etc/beowulf/config
file
masterorder configuration directive to consider this known-MAC node to be
owned by another head/master node.
A ClusterWare DHCP server which does not recognize the incoming MAC will by default ignore the incoming DHCP client request. To override this default:
scyld-clusterctl --set-accept-nodes True
and then any head node that shares the same database will add that new MAC to the shared ClusterWare database, assign to it the next available node index and associated IP address, and proceed to attempt to boot the node.
If a ClusterWare 6 or 7 beoserv
daemon is alive and listening on the
same private cluster network,
then that master node should have its /etc/beowulf/config
specify
nodeassign locked, which directs its beoserv
to ignore unknown
MAC addresses.
When all new nodes with previously unknown MAC addresses are thus merged into the ClusterWare cluster, then the cluster administrator should again reenable the default functionality with:
scyld-clusterctl --set-accept-nodes False
If multiple new nodes concurrently initiate their DHCP client requests,
then the likely result is a jumbled assignment of indices and IP addresses.
Cluster administrators often prefer nodes in a rack to have ordered indices and
IP addresses.
This ordered assignment can be accomplished by performing subsequent
carefully crafted scyld-nodectl update
actions, e.g.,
scyld-nodectl -i n10 update index=100
scyld-nodectl -i n11 update index=101
scyld-nodectl -i n12 update index=102
scyld-nodectl -i n10,n11,n12 reboot # at a minimum, reboot the updated nodes
Note
Desired ordering can more easily be accomplished by performing the initial node resets or powercycling for each individual node in sequence, one at a time, and allowing each node to boot and get added to the database before initiating the next node's DHCP request.
Changing IP addresses¶
To change IP addresses on a cluster, generate a configuration file of the currently state of the nodes with their current IP addresses, edit the file to change one or more IP addresses as desired, re-load the file, and trigger the head node to recompute the new addresses and update the database. For example:
scyld-cluster-conf save new_cluster.conf
# manually edit new_cluster.conf to change IP addresses
scyld-cluster-conf load new_cluster.conf
scyld-nodectl -i <NODES_THAT_CHANGE> update ip=
The new addresses are not seen by compute nodes until they reboot or perform a dhcp renewal.
Replacing Failed Nodes¶
Since nodes are identified by their MAC addresses, replacing a node in the database is relatively simple. If the node (n23 in the following example) was repaired but the same network interface is still being used, then no changes are necessary; however, if it was the network card that failed and it was replaced, then the node's MAC address can be updated with one command:
scyld-nodectl -i n23 update mac=44:22:33:44:55:66
If the entire node was replaced, then instead of just updating the MAC address, the administrator would likely prefer to clear the node status and any history associated with that node. To do this, delete and recreate the failed node:
scyld-nodectl -i n23 delete
scyld-nodectl create index=23 mac=44:22:33:44:55:66
Node Name Resolution¶
The scyld-install
script installs the clusterware-dnsmasq package
which provides resolution services for head node names.
Similar to the clusterware-iscdhcp,
this package depends on a standard OS provided service, but runs a
private instance of that service, configuring it through the templated
configuration file /opt/scyld/clusterware-dnsmasq/dnsmasq.conf.template
.
Within that file, fields like "<DOMAIN>" are substituted with appropriate
values from the cluster network configuration, and the resulting file
is rewritten.
Specifically, the "domain" field (defaulting to .cluster.local
) is appended
to compute node names (n0, n1, etc.) to produce a fully-qualified domain name.
That default value can be overridden in the
cluster configuration provided at installation time or loaded via the
scyld-cluster-conf
command. Multiple domains can be defined in that
configuration file and are applied to any subsequently defined network
segments until a later line sets a new domain value.
Note that when changing this value on an established cluster, the
cluster administrator may want to only load the networking portion of
the cluster configuration instead of recreating already configured
compute nodes:
scyld-cluster-conf load --nets-only cluster.conf
sudo systemctl restart clusterware
By default, any hosts listed in the /etc/hosts
file on the head
node will also resolve on the compute nodes through dnsmasq as will
names added through the scyld-clusterctl hosts
command. The
localise-queries
keyword in the template file is provided because
head nodes commonly have multiple addresses on different networks and
dnsmasq should reply with the IP appropriate to the
requestor. Commenting out localise-queries
will cause dnsmasq to
reply with all IPs for a queried name. To entirely prevent dnsmasq
being populated with head node IPs set leases.register_heads =
False
in /opt/scyld/clusterware/conf/base.ini
and restart the
ClusterWare service. These dnsmasq behaviors and many others can be
changed in the aforementioned configuration template.
An administrator may modify the template file to completely remove the domain
or to otherwise modify the dnsmasq configuration.
Please see the dnsmasq project documentation for
details of the options that service supports. Similarly, the dhcpd
configuration template is located at
/opt/scyld/clusterware-iscdhcp/dhcpd.conf.template
, although as that
service is much more integral to the proper operation of ClusterWare,
changes should be kept to an absolute minimum. Administrators of more
complicated clusters may add additional "options" lines or similarly
remove the "option domain-name" line depending on their specific
network needs. Additional DNS servers can also be provided to compute
nodes through the "option domain-name-servers" lines. As with dnsmasq,
please see the ISC DHCP documentation for supported options.
During compute node boot, dracut
configures the bootnet interface
of the node with the DNS servers and other network settings. These
settings may be changed by cluster administrators in startup scripts
as long as the head node(s) remain accessible to the compute nodes and
vice versa.
During initial installation, the scyld-install
script attempts
to add the local dnsmasq instance (listening on the standard DNS
port 53) as the first DNS server for the head node. If this is
unsuccessful, DNS resolution will still work on compute nodes, although
the administrator may need to add local DNS resolution before ssh
and
similar tools can reach the compute nodes. Please consult your Linux
distribution documentation for details. Note that DNS is not used for
compute node name resolution within the REST API or by the ClusterWare
administrative tools; rather, the database is referenced in order to map node
ids to IP addresses.
Executing Commands¶
A cluster administrator can execute commands on one or more compute
nodes using the scyld-nodectl
tool. For example:
scyld-nodectl -i n0 exec ls -l /
passes the command, e.g. ls -l /
, to the head node,
together with a list of target compute nodes. The head node will then ssh
to each compute node using the head node's SSH key, execute the
command, and return the output to the calling tool that will display
the results. Note that this relay through the REST API is done because
the ClusterWare tools may be installed on a machine that is not a head
node and is not able to directly access the compute nodes.
Note that even if DNS resolution of compute node names is not possible on
the local machine, scyld-nodectl exec
will still work because it retrieves
the node IP addresses from the ClusterWare database via the head node.
Further, once an administrator has
appropriate keys on the compute nodes and has DNS resolution of
compute node names, they are encouraged to manage nodes either directly
using the ssh
or pdsh
commands or at a higher level with a
tool such as ansible
.
Commands executed through scyld-nodectl exec
are executed in
parallel across the selected nodes. By default 64 nodes are accessed
at a time, but this is adjustable by setting the ssh_runner.fanout
to a larger or smaller number. This variable can be set in an
administrator's ~/.scyldcw/settings.ini
or can be set in
/opt/scyld/clusterware/conf/base.ini
on a head node.
Setting the ssh_runner.fanout variable to a value less than or equal to 1
causes all commands to be executed serially across the nodes.
Some limited support is also provided for sending content to the
stdin
of the remote command. That content can be provided in a
file via an option, e.g.:
scyld-nodectl -i n0 exec --stdin=@input.txt dd of=/root/output.txt
or the content can be provided directly:
scyld-nodectl -i n0 exec --stdin='Hello World' dd of=/root/output.txt
or the content can be piped to scyld-nodectl
,
and this time optionally using redirection on the compute node to write
to the output file:
echo 'Hello world' | scyld-nodectl -i n0 exec cat > /root/output.txt
When a command is executed on a single node, the command's stdout
and stderr
streams will be sent unmodified to the matching file
descriptor of the scyld-nodectl
command. This allows an administrator
to include remote commands in a pipe much like ssh. For example:
echo 'Hello world' | scyld-nodectl -i n0 exec tr 'a-z' 'A-Z' > output.txt
will result in a the local file output.txt
containing the text
"HELLO WORLD". The scyld-nodectl exec
exit code will also be set
to the exit code of the underlying command.
When a command is executed on multiple nodes, the
individual lines of the resulting output will be prefixed with the node names:
[admin@virthead]$ scyld-nodectl -in[0-1] exec ls -l
n0: total 4
n0: -rw-r--r--. 1 root root 13 Apr 5 20:39 output.txt
n1: total 0
When executing a command on multiple nodes, the exit code of the
scyld-nodectl exec
command will only be 0 if the command exits
with a 0 on each node. Otherwise the tool return code will match the
non-zero status of the underlying command from one of the failing
instances.
The mechanism for passing stdin
should not be used to transfer large
amounts of data to the compute nodes, as the contents will be forwarded to the
head node, briefly cached, and copied to all compute nodes.
Further, if the data was passed as a stream either through piping to
the scyld-nodectl
command or passing the path to a large file via the
--stdin=@/path/to/file
mechanism, the nodes will be accessed
serially, not in parallel, so that the stream can be rewound between
executions. This is supported for convenience when passing small
payloads, but is not efficient in large clusters. A more
direct method such as scp
or pdcp
should be used when the
content is more than a few megabytes in size. Also note that even when
communicating with a single compute node, this is not truly interactive
because all of stdin
must be available and sent to the head node before the
remote command is executed.
Node Attributes¶
The names and uses of the fields associated with each database object are fixed, although nodes may be augmented with attribute lists for more flexible management. These attribute lists are stored in the attributes field of a node and consist of names (ideally legal Javascript variable names) and textual values. Attribute names prefixed with an underscore such as _boot_config or _boot_style are reserved for use by ClusterWare. These attributes may be referenced or modified by administrator defined scripting, but changing their values will modify the behavior of ClusterWare.
Beyond their internal use, e.g. for controlling boot details,
attributes are intended for use by cluster administrators to mark
nodes for specific purposes, record important hardware and networking
details, record physical rack locations, or whatever else the
administrator may find useful. All attributes for a given node are
available and periodically updated on the node in file
/opt/scyld/clusterware-node/etc/attributes.ini
.
This directory /opt/scyld/clusterware-node/etc/
is also
symlinked to /etc/clusterware
.
Attributes can also be collected together into attribute groups that are stored separately from the node database objects. Administrators can then assign nodes to these groups and thereby change the attributes for a selection of nodes all at once.
Each node has a list of groups to which it belongs, and the order of this
list is important. Attribute groups appearing later in the list can
override attributes provided by groups earlier in the list. For any
given node there are two special groups: the global default group and
the node-specific group. The global default group, which is defined during the
installation process and initially named "DefaultAttribs", is always
applied first, and the node-specific group contained in the node
database object is always applied last. Any attribute group can be
assigned to be the default group through the scyld-clusterctl
command, e.g.,
scyld-clusterctl --set-group GroupNameOrUID
An example should clarify how attributes are determined for a node. Immediately after installation the "DefaultAttribs" group contains a single value:
[example@head ~]$ scyld-attribctl ls -l
Attribute Groups
DefaultAttribs
attributes
_boot_config: DefaultBoot
Note that fields extraneous to this example have been trimmed from this output, although some are discussed further in the Reference Guide. Looking at two nodes on this same cluster:
[example@head ~]$ scyld-nodectl ls -l
Nodes
n0
attributes:
_boot_config: DefaultBoot
groups: []
n1
attributes:
_boot_config: DefaultBoot
groups: []
By default no attributes are defined at the node level, although all nodes
inherit the _boot_config value from the "DefaultAttribs" group.
If an administrator creates a new boot configuration (possibly by
using the scyld-add-boot-config
script mentioned earlier) and
calls it "AlternateBoot", then she could assign a single node to
that configuration using the scyld-nodectl
tool, e.g.,
scyld-nodectl -i n0 set _boot_config=AlternateBoot
Examining the same nodes after this change would show:
[example@head ~]$ scyld-nodectl ls -l
Nodes
n0
attributes:
_boot_config: AlternateBoot
groups: []
n1
attributes:
_boot_config: DefaultBoot
groups: []
Of course, managing nodes by changing their individual attributes on a per-node basis is cumbersome in larger clusters, so a savvy administrator can create a group and assign nodes to that group:
scyld-attribctl create name=AltAttribs
scyld-attribctl -i AltAttribs set _boot_config=ThirdBoot
Assigning additional nodes to that group is done by "joining" them to the attribute group:
scyld-nodectl -i n[11-20] join AltAttribs
After the above changes, node n0 is assigned to the "AlternateBoot" configuration, n11 through n20 would boot using the "ThirdBoot" configuration, and any other nodes in the system will continue to use "DefaultBoot". This approach allows administrators to efficiently aggregate a set of nodes in anticipation of an action against the entire set, for example when testing new images, or if some nodes need specific configuration differences due to hardware differences such as containing GPU hardware.
For a more technical discussion of setting and clearing attributes as well as nodes joining and leaving groups, please see the appropriate section of the Reference Guide.
Node Names and Pools¶
By default all compute nodes are named nX, where X is a numeric
zero-based node index.
This pattern can be changed using "nodename" lines found in
a cluster configuration file. For example, a line "nodename compute{}"
early in such a file will change the default node naming to
computeX. This changes both the default node hostnames as well as
the names recognized by the scyld-nodectl
command.
For homogeneous clusters where all compute nodes are essentially the
same, this is usually adequate, although in more complex environments there is
utility in quickly identifying core compute node capabilities reflected by
customized hostnames.
For example, high memory nodes and general purpose GPU
compute nodes could be named "hmX" and "gpgpuX". These names can
be assigned via the _hostname attribute as described in
Reserved Attributes, although the scyld-nodectl
command will
still refer to them as "nX".
To support multiple name groupings within the scyld-*ctl
tools, the
ClusterWare system includes the concept of a naming pool. These
pools are defined and modified through the scyld-clusterctl pools
command line interface. Once the appropriate pools are in place, then
compute nodes can be added to those pools. Continuing the example
described previously:
scyld-clusterctl pools create name=high_mem pattern=hm{} first_index=1
scyld-clusterctl pools create name=general_gpu pattern=gpgpu{} first_index=1
scyld-nodectl -in[37-40] update naming_pool=high_mem
scyld-nodectl -in[41,42] update naming_pool=general_gpu
After these changes the scyld-nodectl status
and scyld-nodectl ls
output will include the specified nodes as "hm[1-4]" and
"gpgpu[1-2]". Any commands that previously used "nX" names will
then accept "hmX" or "gpgpuX" names to refer to those renamed
nodes. The first_index=
field of the naming pool forces
node numbering to begin with a specific value, defaulting to 0. Any
nodes not explicitly attached to a naming pool will use the general
cluster naming pattern controlled through the
scyld-clusterctl --set-naming PATTERN
command. This can be
considered the default naming pool.
Important
Please note that when moving multiple compute nodes from one naming pool to another, that the node order may not be preserved. Instead, moving them individually, or specifying their MAC addresses in a cluster configuration file, may be more predictable.
When moving a node from one naming pool to another via the
scyld-nodectl
command, the node index will be reset to the next
available index in the destination pool. Using an explicit index=X
argument allows the cluster administrator to directly control the node
renumbering. Note that nodes in different naming pools may have the
same index, so in this configuration the index is no longer a unique
identifier for individual nodes. Further, the --up
, --down
,
--all
node selectors are not restricted to a single naming pool
and will affect nodes in all pools that match the selection
constraint. Nodes in scyld-nodectl
output will be ordered by index
within their naming pool, although the order of the naming pools themselves
is not guaranteed. For example:
[admin@head clusterware]$ scyld-nodectl ls
Nodes
n1
n2
n3
n4
n5
login6
login7
login8
login9
Similarly, the nodes are grouped by naming pool in
scyld-cluster-conf save
output with "nodename" lines and
explicit node indices inserted as needed:
[admin@head clusterware]$ scyld-cluster-conf save -
# Exported Scyld ClusterWare Configuration file
#
# This file contains the cluster configuration.
# Details of the syntax and semantics are covered in the
# Scyld ClusterWare Administrators Guide.
#
nodename n{}
# 10.10.24.0/24 network
domain cluster.local
1 10.10.24.101/24 10.10.24.115
node 1 00:00:00:00:00:01 # n1
node 00:00:00:00:00:02 # n2
node 00:00:00:00:00:03 # n3
node 00:00:00:00:00:04 # n4
node 00:00:00:00:00:05 # n5
nodename login{}
node 6 00:00:00:00:00:06 # login6
node 00:00:00:00:00:07 # login7
node 00:00:00:00:00:08 # login8
node 00:00:00:00:00:09 # login9
The organization of node naming pools is intentionally independent of node networking considerations. The cluster administrator may choose to combine these concepts by creating separate naming pools for each network segment, although this is not necessary.
Secondary DNS names can also be defined using "nodename":
nodename <pattern> <ip> [pool_name]
A "nodename" line containing an IP address (or IP offset such as "0.0.1.0") can define a name change at an offset within the IP space or define a secondary DNS name depending on whether the IP is within a defined network. For example:
iprange 10.10.124.100/24 10.10.124.250
node
node 08:00:27:F0:44:35 # n1 @ 10.10.124.101
nodename hello{}/5 10.10.124.105
node 08:00:27:A2:3F:C9 # hello5 @ 10.10.124.105
nodename world{}/10 10.10.124.155
node 12 08:00:27:E5:19:E5 # world12 @ 10.10.124.157
nodename n%N-ipmi 10.2.255.37 ipmi
# world12 maps to n2-ipmi @ 10.2.255.39
nodename world%N-ipmi/10 10.2.254.37 ipmi
# world12 maps to world12-ipmi @ 10.2.254.39
Note that the "<pattern>/X" syntax defines the lowest node index allowed within the naming pool.
Attribute Groups and Dynamic Groups¶
The scyld-install
script creates a default attribute group called
DefaultAttribs.
That group can be modified or replaced,
although all nodes are always joined to the default group.
The cluster administrator can create additional attribute groups, e.g.,:
scyld-attribctl create name=dept_geophysics
scyld-attribctl create name=dept_atmospherics
scyld-attribctl create name=gpu
and then assign or remove one or more groups to specific nodes, e.g.,:
scyld-nodectl -i n[0-7] join dept_geophysics
scyld-nodectl -i n[8-11] join dept_atmospherics
scyld-nodectl -i n[0-3,7-9] join gpu
scyld-nodectl -i n7 leave gpu
These group assignments can be viewed either by specific nodes:
scyld-nodectl -i n0 ls -l
scyld-nodectl -i n[4-7] ls -l
or as a table:
[admin]$ scyld-nodectl --fields groups --table ls -l
Nodes | groups
------+-----------------------------
n0 | ['dept_geophysics', 'gpu']
n1 | ['dept_geophysics', 'gpu']
n2 | ['dept_geophysics', 'gpu']
n3 | ['dept_geophysics', 'gpu']
n4 | ['dept_geophysics']
n5 | ['dept_geophysics']
n6 | ['dept_geophysics']
n7 | ['dept_geophysics']
n8 | ['dept_atmospherics', 'gpu']
n9 | ['dept_atmospherics', 'gpu']
n10 | ['dept_atmospherics']
n11 | ['dept_atmospherics']
n12 | []
n13 | []
n14 | []
n15 | []
Scyld commands that accept group lists can reference nodes by their group name(s) (expressed with a % prefix) instead of their node names, e.g.,:
scyld-nodectl -i %dept_atmospherics
scyld-nodectl -i %gpu
scyld-nodectl -i %dept_geophysics status -L
Both the Kubernetes scyld-kube --init
command (see Kubernetes)
and the Job Scheduler ${jobsched}-scyld.setup init
, reconfigure
,
and update-nodes
actions accept --ids %<GROUP>
as well as
--ids <NODES>
(see Job Schedulers).
In addition to attribute groups, ClusterWare also supports admin-defined dynamic groups using a query language that allows for simple compound expressions. These expressions can reference individual attributes, group membership, hardware fields, or status fields. For example, suppose we define attribute groups "dc1" and "dc2":
scyld-attribctl create name=dc1 description='Data center located in rear of building 1'
scyld-attribctl create name=dc2 description='Data center in building 2'
and then add nodes to appropriate groups:
scyld-nodectl -i n[0-31] join dc1
scyld-nodectl -i n[32-63] join dc2
and for each node, identify its rack number in an attribute:
scyld-nodectl -i n[0-15] set rack=1
scyld-nodectl -i n[16-31] set rack=2
scyld-nodectl -i n[32-47] set rack=1
scyld-nodectl -i n[48-63] set rack=2
Note that all attribute values are saved as strings, not integers, so that subsequent selector expressions must enclose these values in double-quotes.
Now you can query a list of nodes in a particular rack of a particular
building using a --selector
(or -s
) expression, and perform
an action on the results of that selection:
scyld-nodectl -s 'in dc1 and attributes[rack] == "2"' status
# or use 'a' as the abbreviation of 'attributes'
scyld-nodectl -s 'in dc1 and a[rack] == "2"' set _boot_config=TestBoot
# Show the nodes that have 32 CPUs.
# These hardware _cpu_count values are integers, not strings, and are
# not enclosed in double-quotes.
scyld-nodectl -s 'hardware[cpu_count] == 32' ls
# or use 'h' as the abbreviation of 'hardware'
scyld-nodectl -s 'h[cpu_count] == 32' ls
# Show the nodes that do not have 32 CPUs
scyld-nodectl -s 'h[cpu_count] != 32' ls
You can also create a dynamic group of a specific selector for later use:
scyld-clusterctl dyngroups create name=b1_rack1 selector='in dc1 and a[rack] == "1"'
scyld-clusterctl dyngroups create name=b1_rack2 selector='in dc1 and a[rack] == "2"'
# Show the nodes in building 1, rack 2
scyld-nodectl -i %b1_rack2 ls
# Show only those %b1_rack2 nodes with 32 CPUs
scyld-nodectl -i %b1_rack2 -s 'h[cpu_count] == 32' ls
You can list the dynamic groups using scyld-clusterctl
:
# Show the list of dynamic groups
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups ls
Dynamic Groups
b1_rack1
b1_rack2
And show details of one or more dynamic group. For example:
# Show the selector associated with a specific dynamic group
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups -i b1_rack1 ls -l
Dynamic Groups
b1_rack1
name: b1_rack1
selector: in dc1 and a[rack] == "1"
# Or show the selector associated with a specific dynamic group in full detail
[admin1@headnode1 ~]$ scyld-clusterctl dyngroups -i b1_rack1 ls -L
Dynamic Groups
b1_rack1
name: b1_rack1
parsed: ((in "dc1") and (attributes["rack"] == "1"))
selector: in dc1 and a[rack] == "1"
The parsed line in the above output can be useful when debugging queries to confirm how Scyld parsed the provided query text.
State maps¶
A common task for a cluster administrator is identifying specific nodes that are out of compliance in some way and executing actions to solve such issues. These actions often involve temporarily removing the node(s) from production while performing testing, reprovisioning, and requalification. As problem nodes are identified, new nodes are added, or nodes are transferred from one configuration to another, the cluster administrator must have some means to keep track of the progress of each node through these processes. After all, these processes involve multiple stages, likely spanning multiple reboots or even reimaging. ClusterWare node attributes can be leveraged for persistently storing this progress information.
For example, when a node health check detects a memory issue on a GPU, other tasks may dictate that power cycling the node for row remapping cannot occur immediately. Instead, the health checking code could set an attribute noting what was detected. Then, a separate process could see that attribute and initiate the steps of removing the node from production, rebooting it, triggering requalification tests, and moving it back into production if all goes well.
Of course, this simple detection and mitigation process only covers one type of failure and one possible resolution. The GPU or other hardware could fail in a myriad of ways, each requiring different mitigation strategies. This means that a node’s health check results or progress through requalification may be stored across multiple node attributes.
The ClusterWare node selection language allows a cluster administrator to identify nodes that match possibly complex criteria by matching attribute values, detected status, or hardware details using basic comparators and logical operators. See the above section :ref:'new_attribute_groups' for dynamic groups for examples of node selectors.
Polling the ClusterWare service for attribute status frequently or across many nodes is inefficient and in extreme cases can impact the head node performance. To alleviate this, ClusterWare provides a 'scyld-nodectl waitfor' mechanism. One common use is to wait for a node to boot before proceeding with additional steps in an overall command, for example:
scyld-nodectl -in10 reboot then waitfor up then exec uptime
The 'up' is shorthand for a longer selector, specifically 'status[state] == “up”' and can be replaced with more complicated selectors if, for example, the administrator is not rebooting the node but executing a command that will modify a node attribute when it completes. This sort of command chaining with “then” allows for simple automation, but more complex automation will deal with multiple nodes at different stages. For that case, ClusterWare allows administrators to provide a set of selectors referred to as a state map.
Using a state map, a cluster administrator can track nodes through scenarios including: * Error detection, handling, and requalification * A rolling firmware update process * Idle-time performance testing
State maps provide a general purpose mechanism to select groups of nodes based on their status and configuration, trigger actions, and observe the resulting changes. An example state map is provided as part of the 'clusterware-tools' package:
$ cat /opt/scyld/clusterware-tools/examples/node-states.ini
[status]
up = status[state] == "up"
down = status[state] == "down"
booting = status[state] == "booting"
This INI format defines a state map named “status” containing 3 states named “up”, “down”, and “booting”. The selector that defines each state is provided to the right of the name, after the equal sign. This file can also be written in JSON format as:
{
"name": "status",
"states": {
"up": "status[state] == \"up\"",
"down": "status[state] == \"down\"",
"booting": "status[state] == \"booting\""
}
}
The cluster administrator can load the state map through the 'scyld-nodectl' command:
scyld-nodectl waitfor --load-only @node-status.json
Once the state map is loaded, the 'waitfor' command can also be used to see what nodes match what selectors by referencing the loaded map name, i.e. “status” in this example:
$ scyld-nodectl waitfor --name status
Nodes
n[5-8,10]: up
Additional arguments are available to allow for streaming state transitions or simplifying the output for easier parsing:
$ scyld-nodectl waitfor --stream --name status
n[1-4] in down
n[5-10] in up
n[5] left up
n[5] entered down
n[5] left down
n[5] entered booting
n[5] left booting
n[5] entered up
In the above example, a single node in a 10 node cluster was rebooted and state transitions were emitted as the node progressed from “up” to “down” to “booting” and back to “up”.