Additional Software¶

Additional ClusterWare Software¶

scyld-install installs and updates the basic ClusterWare software. Additional software packages are available in the ClusterWare repository.

scyld-install manipulates the /etc/yum.repos.d/clusterware.repo file to automatically enable the scyld repos when the tool executes and disable the repos when finished. This is done to avoid inadvertent updating of ClusterWare packages when executing a simple yum update.

Note

If the cluster administrator has created multiple /etc/yum.repos.d/*.repo files that specify repos containing ClusterWare RPMs, then this protection against inadvertent updating is performed only for /etc/yum.repos.d/clusterware.repo, not for those additional repo files.

Accordingly, the --enablerepo=scyld* argument is required when using yum for listing, installing, and updating these optional ClusterWare packages on a head node. For example, these optional installable software packages can be viewed using yum list --enablerepo=scyld* | grep scyld. After installation, any available updates can be viewed using yum check-update --enablerepo=scyld* | grep scyld.

Specific install and configuration instructions for various of these packages, e.g., job managers and OpenMPI middleware, are detailed in this chapter.

Adding 3rd-party Software¶

An existing compute node image may need to contain additional software (e.g., a driver and perhaps the driver's associated software) that has been downloaded from a 3rd-party vendor in the form of an RPM or a tarball.

Suppose a tarball named driver-tarball.tgz has been downloaded into the head node /tmp/ directory, and you need to install its contents into an image. A cautious first step is to clone an existing image and add the new software to that clone, which leaves the existing image unmodified. For example, clone a new image:

scyld-imgctl -i DefaultImage clone name=UpdatedImage

Now enter the new UpdatedImage in a chroot environment:

scyld-modimg -i UpdatedImage --chroot

Suppose your clusterware administrator user name is admin1. Inside the chroot you are always user root. Copy the downloaded tarball from the head node into your chroot with a simple command from inside the chroot:

scp -r admin1@localhost:/tmp/driver-tarball.tgz /tmp

Unpack /tmp/driver-tarball.tgz and examine the contents, where you will likely find a script that manages the tarball-specific software installation.

Important

Carefully read the instructions provided by the 3rd-party software vendor before executing the script, and carefully read the output produced when executing the script.

There are several factors to keep in mind when executing the 3rd-party install script:

A 3rd-party installation that involves a new kernel module requires linking that module to the kernel in the chroot. This requires the presence of the kernel-devel package that matches that kernel. If that RPM is not currently installed in the chroot, then inside the chroot manually yum install it, naming the specific kernel version, e.g.,:
```
yum install kernel-devel-3.10.0-957.27.2.el7.x86_64
```
to match kernel-3.10.0-957.27.2.el7.x86_64.
Some 3rd-party install scripts use the uname command to determine the kernel against which to link a new kernel module. However, when the uname command executes inside a chroot, it actually reports the kernel version of the host system that executes the scyld-modimg --chroot command, not the kernel that has been installed inside the chroot. This uname behavior only works properly for module linking purposes if the chroot contains only one kernel and if that kernel matches the kernel on the scyld-modimg --chroot-executing server. To specify an alternate kernel, either name that kernel as an optional argument of a --chroot argument, e.g.,:
```
scyld-modimg -i NewImage --chroot 3.10.0-1160.45.1.el7.x86_64
```
or as a KVER variable value using the --exec argument, e.g., for a script inside the image that initializes a software driver module and links that module to a specific kernel:
```
scyld-modimg -i NewImage --execute 'KVER=3.10.0-1160.45.1.el7.x86_64 /path/to/script'
```
Otherwise, hopefully the 3rd-party install script supports an optional argument that specifies the intended kernel version, such as:
```
/path/to/install-script -k 3.10.0-1160.45.1.el7.x86_64
```
If the 3rd-party install script encounters a missing dependency RPM, then the script reports the missing package name(s) and fails. You must manually yum install those missing RPM(s) within the chroot and reexecute the script.
Some 3rd-party install scripts replace RPMs that were installed from the base distribution, e.g., Infiniband, OFED. If any currently installed ClusterWare packages declare these base distribution packages as dependencies, then the install script's attempt to replace those packages fails. You must then uninstall the specified ClusterWare package(s) (e.g., openmpi3.1, openmpi3.1-intel), then retry executing the install script. In some cases the 3rd-party tarball contains packages that replace the ClusterWare package(s). In other cases you can reinstall these ClusterWare package(s) after the 3rd-party install script successfully completes.

Finally, exit the chroot and specify to Keep changes, Replace local image, Upload image, and Replace remote image.

Job Schedulers¶

The default Scyld ClusterWare installation for RHEL/CentOS 7 includes support for the optional job scheduler packages Slurm and PBS TORQUE, and for RHEL/CentOS 8 includes support for the optional packages Slurm and OpenPBS. These optional packages can coexist on a scheduler server, which may or may not be a ClusterWare head node. However, if job schedulers are installed on the same server, then only one at a time should be enabled and executing on that given server.

All nodes in the job scheduler cluster must be able to resolve hostnames of all other nodes as well as the scheduler server hostname. ClusterWare provides a DNS server in the clusterware-dnsmasq package, as discussed in Node Name Resolution. This dnsmasq will resolve all compute node hostnames, and the job scheduler's hostname should be added to /etc/hosts on the head node(s) in order to be resolved by dnsmasq. Whenever /etc/hosts is edited, please restart the clusterware-dnsmasq service with:

sudo systemctl restart clusterware-dnsmasq

Installing and configuring a job scheduler requires making changes to the compute node software. When using image-based compute nodes, we suggest first cloning the DefaultImage or creating a new image, leaving untouched the DefaultImage as a basic known-functional pristine image.

For example, to set up nodes n0 through n3, you might first do:

scyld-imgctl -i DefaultImage clone name=jobschedImage
scyld-bootctl -i DefaultBoot clone name=jobschedBoot image=jobschedImage
scyld-nodectl -i n[0-3] set _boot_config=jobschedBoot

When these nodes reboot after all the setup steps are complete, they will use the jobschedBoot and jobschedImage.

See https://slurm.schedmd.com/rosetta.pdf for a discussion of the differences between PBS TORQUE and Slurm. See https://slurm.schedmd.com/faq.html#torque for useful information about how to transition from OpenPBS or PBS TORQUE to Slurm.

The following sections describe the installation and configuration of each job scheduler type.

Slurm¶

See Job Schedulers for general job scheduler information and configuration guidelines. See https://slurm.schedmd.com for Slurm documentation.

Note

As of Clusterware 12, the default slurm-scyld configuration is Configless, see https://slurm.schedmd.com/configless_slurm.html for more information. This reduces the admin effort needed when updating the list of compute nodes.

First install Slurm software on the job scheduler server:

sudo yum install slurm-scyld --enablerepo=scyld*

Important

For RHEL/CentOS 8, install Slurm with an additional argument: sudo yum install slurm-scyld --enablerepo=scyld* --enablerepo=powertools For RHEL/Rocky 9, install Slurm with an additional argument: sudo yum install slurm-scyld --enablerepo=scyld* --enablerepo=crb

Now use a helper script slurm-scyld.setup to complete the initialization and setup the job scheduler and the compute node image(s).

Note

The slurm-scyld.setup script performs the init, reconfigure, and update-nodes actions (described below) by default against all up nodes. Those actions optionally accept a node-specific argument using the syntax [--ids|-i <NODES>] or a group-specific argument using [--ids|-i %<GROUP>]. See Attribute Groups and Dynamic Groups for details.

slurm-scyld.setup init                        # default to all 'up' nodes

init first generates /etc/slurm/slurm.conf by trying to install slurm-scyld-node and run slurmd -C on 'up' nodes. By default configless slurm is enabled by "SlurmctldParameters=enable_configless" in /etc/slurm/slurm.conf, and a DNS SRV record called slurmctld_primary is created. To see the details about the SRV: scyld-clusterctl hosts -i slurmctld_primary ls -l.

Note

For clusters with a backup Slurm controller, create a slurmctld_backup DNS SRV record:

scyld-clusterctl --hidden hosts create name=slurmctd_backup port=6817 service=slurmctld \
       domain=cluster.local target=backuphostname type=srvrec priority=20

However if there are no 'up' nodes or slurm-scyld-node installation fails for some reason, then no node is configured in slurm.conf during init. Later you can use reconfigure to create a new slurm.conf or update-node to update the nodes in an existing slurm.conf. init also generates /etc/slurm/cgroup.conf and /etc/slurm/slurmdbd.conf, starts munge, slurmctld, mariadb, slurmdbd, and restarts slurmctld. At last, init tries to start slurmd on nodes. In an ideal case if the script succeeds to install slurm-scyld-node on compute nodes, srun -N 1 hostname works after init.

The slurmd installation and configuration on 'up' nodes do not survive after nodes reboot, unless on diskful compute nodes. To make a persistent slurm image:

slurm-scyld.setup update-image slurmImage     # for permanence in the image

By default update-image does not include slurm config files into slurmImage if configless is enabled, otherwise includes config files into slurmImage. You can overwrite this default behavior by appending an additional arg "--copy-configs" or "--remove-configs" after slurmImage as in above command.

Reboot the compute nodes to bring them into active management by Slurm. Check the Slurm status:

slurm-scyld.setup status

If any services on controller (slurmctld, slurmdbd and munge) or compute nodes (slurmd and munge) are not running, you can try to use systemctl to start individual service, or use slurm-scyld.setup cluster-restart, slurm-scyld.setup restart and slurm-scyld.setup start-nodes to restart slurm cluster-wide, controller only and nodes only.

Note

The above restart or start do not effect slurmImage.

The update-image is necessary for persistence across compute node reboots.

Generate new slurm-specific config files with:

slurm-scyld.setup reconfigure      # default to all 'up' nodes

Add nodes by executing:

slurm-scyld.setup update-nodes     # default to all 'up' nodes

or add or remove nodes by directly editing the /etc/slurm/slurm.conf config file.

Note

With Configless Slurm, the slurmImage does NOT need to be reconfigured after new nodes are added -- Slurm will automatically forward the new information to the slurmd daemons on the nodes.

Inject users into the compute node image using the sync-uids script. The administrator can inject all users, or a selected list of users, or a single user. For example, inject the single user janedoe:

/opt/scyld/clusterware-tools/bin/sync-uids \
              -i slurmImage --create-homes \
              --users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub

See Configure Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

To view the Slurm status on the server and compute nodes:

slurm-scyld.setup status

The Slurm service can also be started and stopped cluster-wide with:

slurm-scyld.setup cluster-stop
slurm-scyld.setup cluster-start

Slurm executable commands and libraries are installed in /opt/scyld/slurm/. The Slurm controller configuration can be found in /etc/slurm/slurm.conf, and each node caches a copy of that slurm.conf file in /var/spool/slurmd/conf-cache/. Each Slurm user must set up the PATH and LD_LIBRARY_PATH environment variables to properly access the Slurm commands. This is done automatically for users who login when Slurm is running via the /etc/profile.d/scyld.slurm.sh script. Alternatively, each Slurm user can manually execute module load slurm or can add that command line to (for example) the user's ~/.bash_profile or ~/.bashrc.

For a traditional config-file-based Slurm deployment, the admin will have to push the new /etc/slurm/slurm.conf file out to the compute nodes and then restart slurmd. Alternately, the admin can modify the boot image to include the new config file, and then reboot the nodes into that new image.

PBS TORQUE¶

PBS TORQUE is only available for RHEL/CentOS 7 clusters. See Job Schedulers for general job scheduler information and configuration guidelines. See https://www.adaptivecomputing.com/support/documentation-index/torque-resource-manager-documentation for PBS TORQUE documentation.

First install PBS TORQUE software on the job scheduler server:

sudo yum install torque-scyld --enablerepo=scyld*

Now use a helper script torque-scyld.setup to complete the initialization and setup the job scheduler and config file in the compute node image(s).

Note

The torque-scyld.setup script performs the init, reconfigure, and update-nodes actions (described below) by default against all up nodes. Those actions optionally accept a node-specific argument using the syntax [--ids|-i <NODES>] or a group-specific argument using [--ids|-i %<GROUP>]. See Attribute Groups and Dynamic Groups for details.

torque-scyld.setup init                       # default to all 'up' nodes
torque-scyld.setup update-image torqueImage   # for permanence in the image

Reboot the compute nodes to bring them into active management by TORQUE. Check the TORQUE status:

torque-scyld.setup status

# If the TORQUE daemon is not executing, then:
torque-scyld.setup cluster-restart

# And check the status again

This cluster-restart is a manual one-time setup that doesn't affect the torqueImage. The update-image is necessary for persistence across compute node reboots.

Generate new torque-specific config files with:

torque-scyld.setup reconfigure      # default to all 'up' nodes

Add nodes by executing:

torque-scyld.setup update-nodes     # default to all 'up' nodes

or add or remove nodes by directly editing the /var/spool/torque/server_priv/nodes config file. Any such changes must be added to torqueImage by reexecuting:

torque-scyld.setup update-image slurmImage

and then either reboot all the compute nodes with that updated image, or additional execute:

torque-scyld.setup cluster-restart

to manually push the changes to the up nodes without requiring a reboot.

Inject users into the compute node image using the sync-uids script. The administrator can inject all users, or a selected list of users, or a single user. For example, inject the single user janedoe:

/opt/scyld/clusterware-tools/bin/sync-uids \
              -i torqueImage --create-homes \
              --users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub

See Configure Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

To view the TORQUE status on the server and compute nodes:

torque-scyld.setup status

The TORQUE service can also be started and stopped cluster-wide with:

torque-scyld.setup cluster-stop
torque-scyld.setup cluster-start

TORQUE executable commands are installed in /usr/sbin/ and /usr/bin/, TORQUE libraries are installed in /usr/lib64/, and are therefore accessible by the default search rules.

OpenPBS¶

OpenPBS is only available for RHEL/CentOS 8 clusters.

See Job Schedulers for general job scheduler information and configuration guidelines. See https://www.openpbs.org for OpenPBS documentation.

First install OpenPBS software on the job scheduler server:

sudo yum install openpbs-scyld --enablerepo=scyld*

Use a helper script to complete the initialization and setup the job scheduler and config file in the compute node image(s).

Note

The openpbs-scyld.setup script performs the init, reconfigure, and update-nodes actions (described below) by default against all up nodes. Those actions optionally accept a node-specific argument using the syntax [--ids|-i <NODES>] or a group-specific argument using [--ids|-i %<GROUP>]. See Attribute Groups and Dynamic Groups for details.

openpbs-scyld.setup init                      # default to all 'up' nodes
openpbs-scyld.setup update-image openpbsImage # for permanence in the image

Reboot the compute nodes to bring them into active management by OpenPBS. Check the OpenPBS status:

openpbs-scyld.setup status

# If the OpenPBS daemon is not executing, then:
openpbs-scyld.setup cluster-restart

# And check the status again

This cluster-restart is a manual one-time setup that doesn't affect the openpbsImage. The update-image is necessary for persistence across compute node reboots.

Generate new openpbs-specific config files with:

openpbs-scyld.setup reconfigure      # default to all 'up' nodes

Add nodes by executing:

openpbs-scyld.setup update-nodes     # default to all 'up' nodes

or add or remove nodes by executing qmgr.

Any such changes must be added to openpbsImage by reexecuting:

openpbs-scyld.setup update-image openpbsImage

and then either reboot all the compute nodes with that updated image, or additional execute:

openpbs-scyld.setup cluster-restart

to manually push the changes to the up nodes without requiring a reboot.

Inject users into the compute node image using the sync-uids script. The administrator can inject all users, or a selected list of users, or a single user. For example, inject the single user janedoe:

/opt/scyld/clusterware-tools/bin/sync-uids \
              -i openpbsImage --create-homes \
              --users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub

See Configure Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

To view the OpenPBS status on the server and compute nodes:

openpbs-scyld.setup status

The OpenPBS service can also be started and stopped cluster-wide with:

openpbs-scyld.setup cluster-stop
openpbs-scyld.setup cluster-start

OpenPBS executable commands and libraries are installed in /opt/scyld/openpbs/. Each OpenPBS user must set up the PATH and LD_LIBRARY_PATH environment variables to properly access the OpenPBS commands. This is done automatically for users who login when OpenPBS is running via the /etc/profile.d/scyld.openpbs.sh script. Alternatively, each OpenPBS user can manually execute module load openpbs or can add that command line to (for example) the user's ~/.bash_profile or ~/.bashrc.

Kubernetes¶

ClusterWare administrators wanting to use Kubernetes as a container orchestration layer across their cluster can either choose to install Kubernetes manually following directions found online, or use scripts provided by the clusterware-kubeadm package. To use these scripts first install the clusterware-kubeadm package on a server that is a Scyld ClusterWare head node, a locally installed ClusterWare compute node, or a separate non-ClusterWare server. Installing the control plane an RAM-booted or otherwise ephemeral compute node is discouraged.

The provided scripts are based on the kubeadm tool and inherit both the benefits and limitations of that tool. If you prefer to use a different tool to install Kubernetes please follow appropriate directions available online from your chosen Kubernetes provider. The clusterware-kubeadm package is mandatory, and the clusterware-tools package is recommended:

sudo yum --enablerepo=scyld* install clusterware-kubeadm clusterware-tools

Important

For a server to function as a Kubernetes control plane, SELinux must be disabled (verify with getenforce) and swap must be turned off (verify with swapon -s, disable with swapoff -a -v).

After installing the software, as a cluster administrator execute the scyld-kube tool to initialize the Kubernetes control plane. To initialize on a local server:

scyld-kube --init

Or to initialize on an existing booted ClusterWare compute node (e.g., node n0):

scyld-kube --init -i n0

Note that a ClusterWare cluster can have multiple control planes and can use multiple control planes in a Kubernetes High Availability (HA) configuration. See Appendix: Using Kubernetes for detailed examples.

You can validate this initialization by executing:

kubectl get nodes

which should show the newly initialized control plane server.

Next, join one or more booted ClusterWare nodes (e.g., nodes n[1-3]) as worker nodes of this Kubernetes cluster. The full command syntax accomplishes this by explicitly identifying the control plane node by its IP address:

scyld-kube -i n[1-3] --join --cluster <CONTROL_PLANE_IP_ADDR>

However, if the control plane node is a ClusterWare compute node, then the scyld-kube --init process defined Kube-specific attributes and a simpler syntax suffices:

scyld-kube -i n[1-3] --join

The simpler join command can find the control plane node without needing to be told its IP address as long as there is only one compute node that functioning as a Kubernetes control plane.

Note that scyld-kube --join also accepts admin-defined group names, e.g., for a collection of nodes joined to the kube_workers group:

scyld-kube -i %kube_workers --join --cluster <CONTROL_PLANE_IP_ADDR>

See Attribute Groups and Dynamic Groups for details.

For persistence across compute node reboots, modify a node image (e.g., kubeimg), that is used by Kubernetes worker nodes so that these nodes auto-join when booted. If multiple control planes are present optionally specify the control plane by IP address:

scyld-kube --image kubeimg --join
    or
scyld-kube --image kubeimg --join --cluster CONTROL_PLANE_IP_ADDR

After rebooting these worker nodes, you can check Kubernetes status again on the control plane node and should now see the joined worker nodes:

kubectl get nodes

You can test Kubernetes by executing a simple job that calculates pi:

kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml

(ref: https://kubernetes.io/docs/concepts/workloads/controllers/job/)

See Appendix: Using Kubernetes for detailed examples.

OpenMPI, MPICH, and/or MVAPICH¶

Scyld ClusterWare distributes several versions of OpenMPI, MPICH, and MVAPICH2, and other versions are available from 3rd-party providers. Different versions of the ClusterWare packages can coexist, and users can link applications to the desired libraries and execute the appropriate binary executables using module load commands. Typically one or more of these packages are installed in the compute node images for execution, as well as on any other server where OpenMPI (and similar) applications are built.

View the available ClusterWare versions using:

yum clean all     # just to ensure you'll see the latest versions
yum list --enablerepo=scyld* | egrep "openmpi|mpich|mvapich" | egrep scyld

The OpenMPI, MPICH, and MVAPICH packages are named by their major-minor version numbers, e.g., 4.0, 4.1, and each has one or more available major-minor "point" releases, e.g., openmpi4.1-4.1.1 and openmpi4.1-4.1.4.

A simple yum install will install the latest "point" release for the specified major-minor version, e.g.:

sudo yum install openmpi4.1 --enablerepo=scyld*

installs the default GNU libraries, binary executables, buildable source code for various example programs, and man pages for openmpi4.1-4.1.4. The openmpi4.1-gnu packages are equivalent to openmpi4.1.

Alternatively or additionally:

sudo yum install openmpi4.1-intel --enablerepo=scyld*

installs those same packages built using the Intel oneAPI compiler suite. These compiler-specific packages can co-exist with the base GNU package. Similarly you can additionally install openmpi4.1-nvhpc for libraries and executables built using the Nvidia HPC SDK suite, and openmpi-aocc for libraries and executables built using AMD Optimizing C/C++ and Fortran Compilers. Additionally openmpi4.1-hpcx_cuda-${compiler} rpms sets are built against Nvidia HPC-X and cuda software packages and with gnu, intel, nvhpc and aocc compilers.

Note

ClusterWare provides openmpi packages that are built with third party software and compilers with best effort. However if an openmpi rpm of a certain combination of compiler, software, OpenMPI version and distro is missing, that is because that combination failed to build or package failed to run. Also, the third party software and compilers that are needed for those OpenMPI packages must be installed in addition to clusterware installation.

Important

The ClusterWare yum repo includes various versions of openmpi* RPMs which were built with different sets of options by different compilers, each potentially having requirements for specific other 3rd-party packages. In general, avoid installing openmpi RPMs using a wildcard such as openmpi4*scyld and instead carefully install only specific RPMs from the ClusterWare yum repo together with their specific required 3rd-party packages.

Suppose openmpi4.1-4.1.1 is installed and you see a newer "point" release openmpi4.1-4.1.4 in the repo. If you do:

sudo yum update openmpi4.1 --enablerepo=scyld*

then 4.1.1 updates to 4.1.4 and removes 4.1.1. Suppose for some reason you want to retain 4.1.1, install the newer 4.1.4, and have both "point" releases coexist. For that you need to download the 4.1.4 RPMs and install (not update) them using rpm, e.g.,

sudo rpm -iv openmpi4.1-4.1.4*

You can add OpenMPI (et al) environment variables to a user's ~/.bash_profile or ~/.bashrc file, e.g., add module load openmpi/intel/4.1.4 to default a simple OpenMPI command to use a particular release and compiler suite. Commonly a cluster uses shared storage of some kind for /home directories, so changes made by the cluster administrator or by an individual user are transparently reflected across all nodes that access that same shared /home storage.

For OpenMPI, consistent user uid/gid and passphrase-less key-based access is required for a multi-threaded application to communicate between threads executing on different nodes using ssh as a transport mechanism. The administrator can inject all users, or a selected list of users, or a single user into the compute node image using the sync-uids script. See Configure Authentication and /opt/scyld/clusterware-tools/bin/sync-uids -h for details.

To use OpenMPI (et al) without installing either torque-scyld or slurm-scyld, then you must configure the firewall that manages the private cluster network between the head node(s), server node(s), and compute nodes. See Firewall Configuration for details.