Managing Multiple Head Nodes

ClusterWare supports optional active-active(-active....) configurations of multiple cooperating head nodes that share a single replicated database. Such multi-headnode configurations allow any head node to provide services for any compute node in the cluster. These services include cluster configuration using scyld-* tools, compute node booting and power control, and compute node status collection.

The ClusterWare etcd database requires a minimum of three cooperating head nodes to support full High Availability ("HA") in the event of head node failures. The etcd HA works in a limited manner with just two head nodes. The command-line tools provided by ClusterWare for head node management are intended to cover the majority of common cases.

Adding A Head Node

After installing the first head node as described in Installation and Upgrade of Scyld ClusterWare, additional head nodes can be installed and joined with the other cooperating head nodes using the same scyld-install tool or using curl.

On an existing head node view its database password:

sudo grep database.admin_pass /opt/scyld/clusterware/conf/base.ini

Join a non-ClusterWare server

A non-ClusterWare server can use scyld-install to join another head node (identified by its IP address IP_HEAD) that may itself already be joined to other head nodes. You can download scyld-install from the Penguin repo https://updates.penguincomputing.com/clusterware/12/installer/scyld-install or https://updates.penguincomputing.com/clusterware/12/installer-el8/scyld-install or https://updates.penguincomputing.com/clusterware/12/installer-el9/scyld-install without needing a cluster ID, or if you already have a /etc/yum.repos.d/clusterware.repo installed, then you can download the clusterware-installer package, which includes scyld-install. Then:

scyld-install --database-passwd <DBPASS> --join <IP_HEAD>

where DBPASS is IP_HEAD's database password, as described above. If no --database-passwd is provided as an argument, then scyld-install queries the administrator interactively for IP_HEAD's database password.

scyld-install doing a join will install ClusterWare using the same clusterware.repo and database type being used by the head node at IP_HEAD.

A cluster configuration file is not required when joining a server to a head node because those settings are obtained from the existing head node's cluster database after the join successfully completes.

Join a ClusterWare head node

A "solo" ClusterWare head node can use scyld-install to join another head node (identified by its IP address IP_HEAD) that may itself already be joined to other head nodes.

Important

The join action discards the "solo" head node's current images and boot configs, then finishes leaving the "solo" head node with access to just the cooperating head nodes' images and boot configs. If you want to save any images or configs, then first use scyld-bootctl export (Copying boot configurations between head nodes) or managedb save.

For example, to join a ClusterWare head node:

scyld-install --update --database-passwd <DBPASS> --join <IP_HEAD>

When a ClusterWare 12 head node joins another ClusterWare 12 head node, scyld-install performs a mandatory update of the current ClusterWare using IP_HEAD's clusterware.repo prior to joining that IP_HEAD. This ensures that ClusterWare (and scyld-install) are executing compatible ClusterWare software.

However, the ClusterWare 11 version of scyld-install will not automatically perform this mandatory update of 11 to 12 and will just update the joining head node to the newest version of ClusterWare 11. Penguin Computing recommends first updating the ClusterWare 11 head node to 12 (following the guidance of Updating ClusterWare 11 to ClusterWare 12), and then using the ClusterWare 12 scyld-install to perform the join.

Just as when joining a non-ClusterWare server, if no --database-passwd is provided as an argument, then scyld-install queries the administrator interactively for IP_HEAD's database password.

After a Join

Important

Every head node must know the hostname and IP address of every other head node, either by having those hostnames in each head node's /etc/hosts or by having their common DNS server know all the hostnames. Additionally, if using head nodes as default routes for the compute nodes, as described in Configure IP Forwarding, then ensure that all head nodes are configured to forward IP traffic preferably over the same routes.

Important

Every head node should use a common network time-sync protocol. The Red Hat RHEL default is chronyd (found in the chrony package), although ntpd (found in the ntp package) continues to be available.

Subsequent head node software updates are also accomplished by executing scyld-install -u. We recommend that all cooperating head nodes update to a common ClusterWare release. In rare circumstance a newer ClusterWare release on the head nodes also requires a compatible newer clusterware-node package in each compute node image. Such a rare coordinated update will be documented in the Release Notes and Changelog & Known Issues.

Cleaning up From Join Failures

The scyld-install --join <IP> command ensures that the database.admin_pass variable is properly set in /opt/scyld/clusterware/conf/base.ini and logs errors into ~/.scyldcw/logs/install_<TIMESTAMP>.log. If the join fails, check that log file for details. Please report those errors to Penguin, but there are several approaches to retrying the join.

Before retrying the join, the cluster administrator should ensure that the existing cluster has not been negatively affected by the failed join. Appropriate recovery depends on the initial number of head nodes in the cluster.

For a single head node:

The HA mechanisms expect at least 3 head nodes, so if joining fails while adding the second head node to the cluster, then etcd on the initial head node may be left in a bad state. Check for this by running:

/opt/scyld/clusterware/bin/managedb --heads

If that command results in an error then the existing head node requires recovery via:

/opt/scyld/clusterware/bin/managedb recover

Once recovery has completed then managedb --heads will show a single head node and scyld-*ctl commands will work as expected. At this point the other head nodes can be joined.

When joining to a cluster with more than one head node:

The complete join process consists of two stages. In the first stage, the joining head node reaches out to the existing head node to retrieve the etcd peer URLs and adds itself as an etcd member. In the second stage, the joining head node reconfigures its local etcd instance to communicate with the existing etcd instances.

If a failure happens during the first stage, existing head nodes will not be modified and the join can be retried without additional interventions. If a failure occurs during the second stage then running managedb --heads on the existing heads will show the joining head node and that remnant needs to be ejected before proceeding. To do that run a command on an existing head node:

/opt/scyld/clusterware/bin/managedb eject <JOINING_HEAD_IP>

After cleaning:

Once the appropriate steps are complete the cluster administrator can retry the head node join process. Possible reasons for join failure are:

Networking issues

  • If the head nodes cannot reach each other at all then we expect a failure in stage one and the join can be retried once the issue is resolved.

  • If the head nodes can reach the API but not the etcd port (default 52380), usually due to a firewall configuration, then we expect a failure in the second stage and cleanup will be required before retrying.

Duplicate head node UIDs

  • If a head node is created by cloning an existing VM and the cluster administrator does not replace the head.uid in the base.ini, we expect a failure in the second stage and cleanup is required before fixing the problem and retrying.

If the joining process fails multiple times and the joining head node was previously a member of this or some other cluster, it can be useful to provide a --purge argument to the join process. This causes the joining head node to completely remove any local database content before attempting to join the cluster:

/opt/scyld/clusterware/bin/managedb join –purge <IP>

Removing a Joined Head Node

A list of connected head nodes can be seen with:

sudo /opt/scyld/clusterware/bin/managedb --heads

Head nodes also store status information in the ClusterWare database. That content can be viewed via:

scyld-clusterctl heads ls -l

Sometimes a cluster administrator will need to temporarily or permanently remove a head node from the cluster during the course of an RMA, upgrade, or other maintenance task. For a cluster with three or more head nodes you can remove one of the head nodes by running managedb leave on that head node:

sudo /opt/scyld/clusterware/bin/managedb leave

Or if that head node is shut down or otherwise unavailable, then it can be ejected by another head node in the cluster by running:

sudo /opt/scyld/clusterware/bin/managedb eject <IP_HEAD_TO_REMOVE>

Note that this command will attempt to stop some services on the targeted head node and that step may fail. That does not mean that the eject has necessarily failed. The now-detached head node will no longer have access to the shared database and will be unable to execute any scyld-* command, as those require a database. Either re-join the previous cluster:

sudo /opt/scyld/clusterware/bin/managedb join <IP_HEAD>

or join another cluster after updating the local /opt/scyld/clusterware/conf/base.ini database.admin_pass to the other cluster's database password and then joining to a head node in that other cluster. After joining a cluster by directly using the managedb join command the clusterware service should be restart on the joining head node.

The node can also be wiped and reinstalled:

scyld-install --clear-all --config <CLUSTER_CONFIG>

However, for a cluster with only two head nodes you cannot managedb eject or managedb leave, and instead must execute managedb recover, thereby "ejecting" the other head node:

sudo /opt/scyld/clusterware/bin/managedb recover

This command will reduce the head node cluster to a single head node, specifically the one where that command is executed, severing the connection to the shared database. If this command is executed on one of three or more head nodes the remaining head nodes will continue to operate as an independent cluster.

Important

Keep in mind that if managedb recover is run on both head nodes then both head nodes will have their own copies of the now-severed database that manages the same set of compute nodes, which means that both will compete for "ownership" of the same booting compute nodes.

To avoid both head nodes competing for the same compute nodes, either execute sudo systemctl stop clusterware on one of the head nodes, or perform one of the steps described above to re-join this head node to the other head node that previously shared the same database, or join another head node, or perform a fresh ClusterWare install.

Peer Downloads

Multi-head clusters replicate data between head nodes on local storage to preserve a copy of each uploaded or requested file. The storage location is defined in /opt/scyld/clusterware/conf/base.ini by the local_files.path variable, and it defaults to /opt/scyld/clusterware/storage/.

Whenever a ClusterWare head node is asked for a file such as a kernel, the expected file size and checksum are retrieved from the database. If the file exists in local storage and has the correct size and checksum, then that local file will be provided. However, if the file is missing or incorrect, then the head node attempts to retrieve the correct file from a peer.

Note that local files whose checksums do not match will be renamed with a .old.NN extension, where NN starts at 00 and increases up to 99 with each successive bad file. This ensures that in the unlikely event that the checksum in the database is somehow corrupted, the original file can be manually restored.

Peer downloading consists of the requesting head node retrieving the list of all head nodes from the database and contacting each in turn in random order. The first peer that confirms that it has a file with the correct size provides that file to the requesting head node. The checksum is computed during the transfer, and the transferred file is discarded if that checksum is incorrect. Contacted peers will themselves not attempt to download the file from other peers in order to avoid having a completely missing file trigger a cascade.

After a successful peer download, the original requester receives the file contents after a delay due to the peer download process. If the file cannot be retrieved from any head node, then the original requester will receive a HTTP 404 error.

This peer download process can be bypassed by providing shared storage among head nodes. Such storage should either be mounted at the storage directory location prior to installation, or the /opt/scyld/clusterware/conf/base.ini should be updated with the non-default pathname immediately after installation of each head node. Remember to restart the clusterware service after modifying the base.ini file by executing sudo systemctl restart clusterware, and note that the systemd clusterware.service is currently an alias for the httpd.service.

When a boot configuration or image is deleted from the cluster, the deleting head node will remove the underlying file(s) from its local storage. That head node will also temporarily move the file's database entry into a deleted files list that other head nodes periodically check and delete matching files from their own local storage. If the clusterware service is not running on a head node when a file is marked as deleted, then that head node will not be able to delete the local copy. When the service is later restarted, it will see its local file is now no longer referenced by the database and will rename it with the .old.NN extension described earlier. This is done to inform the administrator that these files are not being used and can be removed, although cautious administrators may wish to keep these renamed files until they confirm all node images and boot configurations are working as expected.

Booting With Multiple Head Nodes

Since all head nodes are connected to the same private cluster network, any compute node's DHCP request will receive offers from all the head nodes. All offers will contain the same IP address by virtue of the fact that all head nodes share the same MAC-to-IP and node index information in the replicated database. The PXE client on the node accepts one of the DHCP offers, which is usually the first received, and proceeds to boot with the offering head node as its "parent head node". This parent head node provides the kernel and initramfs files during the PXE process, and provides the root file system for the booting node, all of which should also be replicated in /opt/scyld/clusterware/storage/ (or in the alternative non-default location specified in /opt/scyld/clusterware/conf/base.ini).

On a given head node you can determine the compute nodes for which it is the parent by examining the head node /var/log/clusterware/head_* or /var/log/clusterware/api_error_log* files for lines containing "Booting node". On a given compute node you can determine its parent by examining the node's /etc/hosts entry for parent-head-node.

Once a node boots, it asks its parent head node for a complete list of head nodes, and then thereafter the node sends periodic status information to its parent head node at the top of the list. If at any point that parent head node does not respond to the compute node's status update, then the compute node chooses a new parent by rotating its list of available head nodes by moving the unresponsive parent to the bottom of the list and moving the second node in the list up to the top of the list as the new parent.

The administrator can force compute nodes to re-download the head node list by executing scyld-nodectl script fetch_hosts and specifying one or more compute nodes. The administrator can also refresh the SSH keys on the compute node using scyld-nodectl script update_keys.

Clusters of 100 nodes or more benefit from having each head node being a parent to roughly the same number of compute nodes. Each head node periodically computes the current mean number of nodes per head, and if a head node parents significantly more (e.g., >20%) nodes than the mean, then the head node triggers some of its nodes to use another head node. Care is taken to avoid unnecessary shuffling of compute nodes. The use of the _preferred_head attribute may create an imbalance that this rebalancing cannot remedy.

Copying boot configurations between head nodes

A multiple head node cluster contains cooperating head nodes that share a replicated database and transparent access to peer boot configurations, kernel images, and initramfs files. See Managing Multiple Head Nodes for details. There is no need to manually copy boot configs between these head nodes.

However, it may be useful to copy boot configurations from a head node that controls one cluster to another head node that controls a separate cluster, thereby allowing the same boot config to be employed by compute nodes in the target cluster. On the source head node the administrator "exports" a boot config to create a single all-inclusive self-contained file that can be copied to a target head node. On the target head node the administrator "imports" that file into the local cluster database, where it merges with the local head node's existing configs, images, and files.

Important

Prior to exporting/importing a boot configuration, you should determine if the boot config and kernel image names on the source cluster already exist on the target cluster. For example, for a boot configuration named xyzBoot, execute scyld-bootctl -i xyzBoot ls -l on the source head node to view the boot config name xyzBoot and note its image name, e.g., xyzImage. Then on the target head node execute scyld-bootctl ls -l | egrep "xyzBoot|xyzImage" to determine if duplicates exist.

If any name conflict exists, then either (1) on the source head node create or clone a new uniquely named boot config associated with a uniquely named image, then export that new boot config, or (2) on the target head node import the boot config using optional arguments, as needed, to assign unique name or names.

To export the boot configuration xyzBoot:

scyld-bootctl -i xyzBoot export

which creates the file xyzBoot.export. If there is no name conflict(s) with the target cluster, then on the target head node import with:

scyld-bootctl import xyzImage.export

If there is a name conflict with the image name, then perform the import with the additional argument to rename the imported image:

scyld-bootctl import xyzImage.export --image uniqueImg

or import the boot config without importing its embedded image at all (and later associate a new image with this imported boot config):

scyld-bootctl import xyzImage.export --no-recurse

If there is a name conflict with the boot config name itself, then add:

scyld-bootctl import xyzImage.export --boot-config uniqueBoot

Associate a new image name to the imported boot config if desired, then associate the boot config with the desired compute node(s):

scyld-nodectl -i <NODES> set _boot_config=xyzBoot