Additional Software¶
Additional ClusterWare Software¶
scyld-install
installs and updates the basic ClusterWare software.
Additional software packages are available in the ClusterWare repository.
scyld-install
manipulates the /etc/yum.repos.d/clusterware.repo
file to automatically enable the scyld repos when the tool executes and
disable the repos when finished.
This is done to avoid inadvertent updating of ClusterWare packages when
executing a simple yum update
.
Note
If the cluster administrator has created multiple
/etc/yum.repos.d/*.repo
files that specify repos containing ClusterWare
RPMs, then this protection against inadvertent updating is performed only
for /etc/yum.repos.d/clusterware.repo
,
not for those additional repo files.
Accordingly, the --enablerepo=scyld*
argument is required when using
yum
for listing, installing, and updating these optional ClusterWare
packages on a head node.
For example, these optional installable software packages can be viewed using
yum list --enablerepo=scyld* | grep scyld
.
After installation, any available updates can be viewed using
yum check-update --enablerepo=scyld* | grep scyld
.
Specific install and configuration instructions for various of these packages, e.g., job managers and OpenMPI middleware, are detailed in this chapter.
Adding 3rd-party Software¶
An existing compute node image may need to contain additional software (e.g., a driver and perhaps the driver's associated software) that has been downloaded from a 3rd-party vendor in the form of an RPM or a tarball.
Suppose a tarball named driver-tarball.tgz
has been downloaded into
the head node /tmp/
directory,
and you need to install its contents into an image.
A cautious first step is to clone an existing image and add the new software
to that clone, which leaves the existing image unmodified.
For example, clone a new image:
scyld-imgctl -i DefaultImage clone name=UpdatedImage
Now enter the new UpdatedImage in a chroot environment:
scyld-modimg -i UpdatedImage --chroot
Suppose your clusterware administrator user name is admin1. Inside the chroot you are always user root. Copy the downloaded tarball from the head node into your chroot with a simple command from inside the chroot:
scp -r admin1@localhost:/tmp/driver-tarball.tgz /tmp
Unpack /tmp/driver-tarball.tgz
and examine the contents,
where you will likely find a script that manages the tarball-specific
software installation.
Important
Carefully read the instructions provided by the 3rd-party software vendor before executing the script, and carefully read the output produced when executing the script.
There are several factors to keep in mind when executing the 3rd-party install script:
A 3rd-party installation that involves a new kernel module requires linking that module to the kernel in the chroot. This requires the presence of the kernel-devel package that matches that kernel. If that RPM is not currently installed in the chroot, then inside the chroot manually
yum install
it, naming the specific kernel version, e.g.,:yum install kernel-devel-3.10.0-957.27.2.el7.x86_64
to match kernel-3.10.0-957.27.2.el7.x86_64.
Some 3rd-party install scripts use the
uname
command to determine the kernel against which to link a new kernel module. However, when theuname
command executes inside a chroot, it actually reports the kernel version of the host system that executes thescyld-modimg --chroot
command, not the kernel that has been installed inside the chroot. Thisuname
behavior only works properly for module linking purposes if the chroot contains only one kernel and if that kernel matches the kernel on thescyld-modimg --chroot
-executing server. To specify an alternate kernel, either name that kernel as an optional argument of a--chroot
argument, e.g.,:scyld-modimg -i NewImage --chroot 3.10.0-1160.45.1.el7.x86_64
or as a KVER variable value using the
--exec
argument, e.g., for a script inside the image that initializes a software driver module and links that module to a specific kernel:scyld-modimg -i NewImage --execute 'KVER=3.10.0-1160.45.1.el7.x86_64 /path/to/script'
Otherwise, hopefully the 3rd-party install script supports an optional argument that specifies the intended kernel version, such as:
/path/to/install-script -k 3.10.0-1160.45.1.el7.x86_64
If the 3rd-party install script encounters a missing dependency RPM, then the script reports the missing package name(s) and fails. You must manually
yum install
those missing RPM(s) within the chroot and reexecute the script.Some 3rd-party install scripts replace RPMs that were installed from the base distribution, e.g., Infiniband, OFED. If any currently installed ClusterWare packages declare these base distribution packages as dependencies, then the install script's attempt to replace those packages fails. You must then uninstall the specified ClusterWare package(s) (e.g., openmpi3.1, openmpi3.1-intel), then retry executing the install script. In some cases the 3rd-party tarball contains packages that replace the ClusterWare package(s). In other cases you can reinstall these ClusterWare package(s) after the 3rd-party install script successfully completes.
Finally, exit
the chroot
and specify to Keep changes, Replace local image, Upload image,
and Replace remote image.
Job Schedulers¶
The default Scyld ClusterWare installation for RHEL/CentOS 7 includes support for the optional job scheduler packages Slurm and PBS TORQUE, and for RHEL/CentOS 8 includes support for the optional packages Slurm and OpenPBS. These optional packages can coexist on a scheduler server, which may or may not be a ClusterWare head node. However, if job schedulers are installed on the same server, then only one at a time should be enabled and executing on that given server.
All nodes in the job scheduler cluster must be able to resolve hostnames
of all other nodes as well as the scheduler server hostname.
ClusterWare provides a DNS server in the clusterware-dnsmasq package,
as discussed in Node Name Resolution.
This dnsmasq will resolve all compute node hostnames,
and the job scheduler's hostname should be added to
/etc/hosts
on the head node(s) in order to be resolved by dnsmasq.
Whenever /etc/hosts
is edited, please restart the clusterware-dnsmasq
service with:
sudo systemctl restart clusterware-dnsmasq
Installing and configuring a job scheduler requires making changes to the compute node software. When using image-based compute nodes, we suggest first cloning the DefaultImage or creating a new image, leaving untouched the DefaultImage as a basic known-functional pristine image.
For example, to set up nodes n0 through n3, you might first do:
scyld-imgctl -i DefaultImage clone name=jobschedImage
scyld-bootctl -i DefaultBoot clone name=jobschedBoot image=jobschedImage
scyld-nodectl -i n[0-3] set _boot_config=jobschedBoot
When these nodes reboot after all the setup steps are complete, they will use the jobschedBoot and jobschedImage.
See https://slurm.schedmd.com/rosetta.pdf for a discussion of the differences between PBS TORQUE and Slurm. See https://slurm.schedmd.com/faq.html#torque for useful information about how to transition from OpenPBS or PBS TORQUE to Slurm.
The following sections describe the installation and configuration of each job scheduler type.
Slurm¶
See Job Schedulers for general job scheduler information and configuration guidelines. See https://slurm.schedmd.com for Slurm documentation.
Note
As of Clusterware 12, the default slurm-scyld configuration is Configless, see https://slurm.schedmd.com/configless_slurm.html for more information. This reduces the admin effort needed when updating the list of compute nodes.
First install Slurm software on the job scheduler server:
sudo yum install slurm-scyld --enablerepo=scyld*
Important
For RHEL/CentOS 8, install Slurm with an additional argument:
sudo yum install slurm-scyld --enablerepo=scyld* --enablerepo=powertools
For RHEL/Rocky 9, install Slurm with an additional argument:
sudo yum install slurm-scyld --enablerepo=scyld* --enablerepo=crb
Now use a helper script slurm-scyld.setup
to complete the initialization
and setup the job scheduler and the compute node image(s).
Note
The slurm-scyld.setup
script performs the init
,
reconfigure
, and update-nodes
actions (described below) by default
against all up nodes. Those actions optionally accept a node-specific
argument using the syntax [--ids|-i <NODES>]
or a group-specific
argument using [--ids|-i %<GROUP>]
.
See Attribute Groups and Dynamic Groups for details.
slurm-scyld.setup init # default to all 'up' nodes
init
first generates /etc/slurm/slurm.conf by trying to install slurm-scyld-node
and run slurmd -C on 'up' nodes. By default configless slurm is enabled by
"SlurmctldParameters=enable_configless" in /etc/slurm/slurm.conf, and a DNS SRV record
called slurmctld_primary is created. To see the details about the SRV:
scyld-clusterctl hosts -i slurmctld_primary ls -l
.
Note
For clusters with a backup Slurm controller, create a slurmctld_backup DNS SRV record:
scyld-clusterctl --hidden hosts create name=slurmctd_backup port=6817 service=slurmctld \
domain=cluster.local target=backuphostname type=srvrec priority=20
However if there are no 'up' nodes or slurm-scyld-node installation fails for some reason,
then no node is configured in slurm.conf during init
. Later you can use reconfigure
to create a new slurm.conf or update-node
to update the nodes in an existing slurm.conf.
init
also generates /etc/slurm/cgroup.conf and /etc/slurm/slurmdbd.conf,
starts munge, slurmctld, mariadb, slurmdbd, and restarts slurmctld. At last, init
tries to start slurmd on nodes.
In an ideal case if the script succeeds to install slurm-scyld-node on compute nodes,
srun -N 1 hostname
works after init
.
The slurmd installation and configuration on 'up' nodes do not survive after nodes reboot, unless on diskful compute nodes. To make a persistent slurm image:
slurm-scyld.setup update-image slurmImage # for permanence in the image
By default update-image
does not include slurm config files into slurmImage if configless
is enabled, otherwise includes config files into slurmImage. You can overwrite this default
behavior by appending an additional arg "--copy-configs" or "--remove-configs" after
slurmImage as in above command.
Reboot the compute nodes to bring them into active management by Slurm. Check the Slurm status:
slurm-scyld.setup status
If any services on controller (slurmctld, slurmdbd and munge) or compute nodes (slurmd and
munge) are not running, you can try to use systemctl to start individual service, or use
slurm-scyld.setup cluster-restart
, slurm-scyld.setup restart
and
slurm-scyld.setup start-nodes
to restart slurm cluster-wide, controller only and nodes only.
Note
The above restart or start do not effect slurmImage.
The update-image
is necessary for persistence
across compute node reboots.
Generate new slurm-specific config files with:
slurm-scyld.setup reconfigure # default to all 'up' nodes
Add nodes by executing:
slurm-scyld.setup update-nodes # default to all 'up' nodes
or add or remove nodes by directly editing the /etc/slurm/slurm.conf
config file.
Note
With Configless Slurm, the slurmImage does NOT need to be
reconfigured after new nodes are added -- Slurm will automatically
forward the new information to the slurmd
daemons on the nodes.
Inject users into the compute node image using the sync-uids
script.
The administrator can inject all users, or a selected list of users,
or a single user.
For example, inject the single user janedoe:
/opt/scyld/clusterware-tools/bin/sync-uids \
-i slurmImage --create-homes \
--users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub
See Configure Authentication and
/opt/scyld/clusterware-tools/bin/sync-uids -h
for details.
To view the Slurm status on the server and compute nodes:
slurm-scyld.setup status
The Slurm service can also be started and stopped cluster-wide with:
slurm-scyld.setup cluster-stop
slurm-scyld.setup cluster-start
Slurm executable commands and libraries are installed in /opt/scyld/slurm/
.
The Slurm controller configuration can be found in /etc/slurm/slurm.conf
,
and each node caches a copy of that slurm.conf
file in /var/spool/slurmd/conf-cache/
.
Each Slurm user must set up the PATH and LD_LIBRARY_PATH
environment variables to properly access the Slurm commands.
This is done automatically for users who login when Slurm is running
via the /etc/profile.d/scyld.slurm.sh
script.
Alternatively, each Slurm user can manually execute module load slurm
or can add that command line to (for example) the user's
~/.bash_profile
or ~/.bashrc
.
For a traditional config-file-based Slurm deployment, the admin will have to
push the new /etc/slurm/slurm.conf
file out to the compute nodes and then
restart slurmd
. Alternately, the admin can modify the boot image to
include the new config file, and then reboot the nodes into that new image.
PBS TORQUE¶
PBS TORQUE is only available for RHEL/CentOS 7 clusters. See Job Schedulers for general job scheduler information and configuration guidelines. See https://www.adaptivecomputing.com/support/documentation-index/torque-resource-manager-documentation for PBS TORQUE documentation.
First install PBS TORQUE software on the job scheduler server:
sudo yum install torque-scyld --enablerepo=scyld*
Now use a helper script torque-scyld.setup
to complete the initialization
and setup the job scheduler and config file in the compute node image(s).
Note
The torque-scyld.setup
script performs the init
,
reconfigure
, and update-nodes
actions (described below) by default
against all up nodes. Those actions optionally accept a node-specific
argument using the syntax [--ids|-i <NODES>]
or a group-specific
argument using [--ids|-i %<GROUP>]
.
See Attribute Groups and Dynamic Groups for details.
torque-scyld.setup init # default to all 'up' nodes
torque-scyld.setup update-image torqueImage # for permanence in the image
Reboot the compute nodes to bring them into active management by TORQUE. Check the TORQUE status:
torque-scyld.setup status
# If the TORQUE daemon is not executing, then:
torque-scyld.setup cluster-restart
# And check the status again
This cluster-restart
is a manual one-time setup that doesn't affect the
torqueImage.
The update-image
is necessary for persistence
across compute node reboots.
Generate new torque-specific config files with:
torque-scyld.setup reconfigure # default to all 'up' nodes
Add nodes by executing:
torque-scyld.setup update-nodes # default to all 'up' nodes
or add or remove nodes by directly editing the
/var/spool/torque/server_priv/nodes
config file.
Any such changes must be added to torqueImage by reexecuting:
torque-scyld.setup update-image slurmImage
and then either reboot all the compute nodes with that updated image, or additional execute:
torque-scyld.setup cluster-restart
to manually push the changes to the up nodes without requiring a reboot.
Inject users into the compute node image using the sync-uids
script.
The administrator can inject all users, or a selected list of users,
or a single user.
For example, inject the single user janedoe:
/opt/scyld/clusterware-tools/bin/sync-uids \
-i torqueImage --create-homes \
--users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub
See Configure Authentication and
/opt/scyld/clusterware-tools/bin/sync-uids -h
for details.
To view the TORQUE status on the server and compute nodes:
torque-scyld.setup status
The TORQUE service can also be started and stopped cluster-wide with:
torque-scyld.setup cluster-stop
torque-scyld.setup cluster-start
TORQUE executable commands are installed in /usr/sbin/
and /usr/bin/
,
TORQUE libraries are installed in /usr/lib64/
,
and are therefore accessible by the default search rules.
OpenPBS¶
OpenPBS is only available for RHEL/CentOS 8 clusters.
See Job Schedulers for general job scheduler information and configuration guidelines. See https://www.openpbs.org for OpenPBS documentation.
First install OpenPBS software on the job scheduler server:
sudo yum install openpbs-scyld --enablerepo=scyld*
Use a helper script to complete the initialization and setup the job scheduler and config file in the compute node image(s).
Note
The openpbs-scyld.setup
script performs the init
,
reconfigure
, and update-nodes
actions (described below) by default
against all up nodes. Those actions optionally accept a node-specific
argument using the syntax [--ids|-i <NODES>]
or a group-specific
argument using [--ids|-i %<GROUP>]
.
See Attribute Groups and Dynamic Groups for details.
openpbs-scyld.setup init # default to all 'up' nodes
openpbs-scyld.setup update-image openpbsImage # for permanence in the image
Reboot the compute nodes to bring them into active management by OpenPBS. Check the OpenPBS status:
openpbs-scyld.setup status
# If the OpenPBS daemon is not executing, then:
openpbs-scyld.setup cluster-restart
# And check the status again
This cluster-restart
is a manual one-time setup that doesn't affect the
openpbsImage.
The update-image
is necessary for persistence
across compute node reboots.
Generate new openpbs-specific config files with:
openpbs-scyld.setup reconfigure # default to all 'up' nodes
Add nodes by executing:
openpbs-scyld.setup update-nodes # default to all 'up' nodes
or add or remove nodes by executing qmgr
.
Any such changes must be added to openpbsImage by reexecuting:
openpbs-scyld.setup update-image openpbsImage
and then either reboot all the compute nodes with that updated image, or additional execute:
openpbs-scyld.setup cluster-restart
to manually push the changes to the up nodes without requiring a reboot.
Inject users into the compute node image using the sync-uids
script.
The administrator can inject all users, or a selected list of users,
or a single user.
For example, inject the single user janedoe:
/opt/scyld/clusterware-tools/bin/sync-uids \
-i openpbsImage --create-homes \
--users janedoe --sync-key janedoe=/home/janedoe/.ssh/id_rsa.pub
See Configure Authentication and
/opt/scyld/clusterware-tools/bin/sync-uids -h
for details.
To view the OpenPBS status on the server and compute nodes:
openpbs-scyld.setup status
The OpenPBS service can also be started and stopped cluster-wide with:
openpbs-scyld.setup cluster-stop
openpbs-scyld.setup cluster-start
OpenPBS executable commands and libraries are installed in
/opt/scyld/openpbs/
.
Each OpenPBS user must set up the PATH and LD_LIBRARY_PATH
environment variables to properly access the OpenPBS commands.
This is done automatically for users who login when OpenPBS is running
via the /etc/profile.d/scyld.openpbs.sh
script.
Alternatively, each OpenPBS user can manually execute module load openpbs
or can add that command line to (for example) the user's
~/.bash_profile
or ~/.bashrc
.
Kubernetes¶
ClusterWare administrators wanting to use Kubernetes as a container orchestration layer across their cluster can either choose to install Kubernetes manually following directions found online, or use scripts provided by the clusterware-kubeadm package. To use these scripts first install the clusterware-kubeadm package on a server that is a Scyld ClusterWare head node, a locally installed ClusterWare compute node, or a separate non-ClusterWare server. Installing the control plane an RAM-booted or otherwise ephemeral compute node is discouraged.
The provided scripts are based on the kubeadm
tool and inherit
both the benefits and limitations of that tool. If you prefer to use a
different tool to install Kubernetes please follow appropriate
directions available online from your chosen Kubernetes provider.
The clusterware-kubeadm package is mandatory,
and the clusterware-tools package is recommended:
sudo yum --enablerepo=scyld* install clusterware-kubeadm clusterware-tools
Important
For a server to function as a Kubernetes control plane,
SELinux must be disabled (verify with getenforce
) and swap must be
turned off (verify with swapon -s
, disable with swapoff -a -v
).
After installing the software, as a cluster administrator execute the
scyld-kube
tool to initialize the Kubernetes control plane.
To initialize on a local server:
scyld-kube --init
Or to initialize on an existing booted ClusterWare compute node (e.g., node n0):
scyld-kube --init -i n0
Note that a ClusterWare cluster can have multiple control planes and can use multiple control planes in a Kubernetes High Availability (HA) configuration. See Appendix: Using Kubernetes for detailed examples.
You can validate this initialization by executing:
kubectl get nodes
which should show the newly initialized control plane server.
Next, join one or more booted ClusterWare nodes (e.g., nodes n[1-3]) as worker nodes of this Kubernetes cluster. The full command syntax accomplishes this by explicitly identifying the control plane node by its IP address:
scyld-kube -i n[1-3] --join --cluster <CONTROL_PLANE_IP_ADDR>
However, if the control plane node is a ClusterWare compute node, then
the scyld-kube --init
process defined Kube-specific attributes and
a simpler syntax suffices:
scyld-kube -i n[1-3] --join
The simpler join command can find the control plane node without needing to be told its IP address as long as there is only one compute node that functioning as a Kubernetes control plane.
Note that scyld-kube --join
also accepts admin-defined group names,
e.g., for a collection of nodes joined to the kube_workers group:
scyld-kube -i %kube_workers --join --cluster <CONTROL_PLANE_IP_ADDR>
See Attribute Groups and Dynamic Groups for details.
For persistence across compute node reboots, modify a node image (e.g., kubeimg), that is used by Kubernetes worker nodes so that these nodes auto-join when booted. If multiple control planes are present optionally specify the control plane by IP address:
scyld-kube --image kubeimg --join
or
scyld-kube --image kubeimg --join --cluster CONTROL_PLANE_IP_ADDR
After rebooting these worker nodes, you can check Kubernetes status again on the control plane node and should now see the joined worker nodes:
kubectl get nodes
You can test Kubernetes by executing a simple job that calculates pi:
kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml
(ref: https://kubernetes.io/docs/concepts/workloads/controllers/job/)
See Appendix: Using Kubernetes for detailed examples.
OpenMPI, MPICH, and/or MVAPICH¶
Scyld ClusterWare distributes several versions of OpenMPI, MPICH, and MVAPICH2,
and other versions are available from 3rd-party providers.
Different versions of the ClusterWare packages can coexist,
and users can link applications to the desired libraries and execute the
appropriate binary executables using module load
commands.
Typically one or more of these packages are installed in the compute node
images for execution, as well as on any other server where OpenMPI
(and similar) applications are built.
View the available ClusterWare versions using:
yum clean all # just to ensure you'll see the latest versions
yum list --enablerepo=scyld* | egrep "openmpi|mpich|mvapich" | egrep scyld
The OpenMPI, MPICH, and MVAPICH packages are named by their major-minor version numbers, e.g., 4.0, 4.1, and each has one or more available major-minor "point" releases, e.g., openmpi4.1-4.1.1 and openmpi4.1-4.1.4.
A simple yum install
will install the latest "point" release for the
specified major-minor version, e.g.:
sudo yum install openmpi4.1 --enablerepo=scyld*
installs the default GNU libraries, binary executables, buildable source code for various example programs, and man pages for openmpi4.1-4.1.4. The openmpi4.1-gnu packages are equivalent to openmpi4.1.
Alternatively or additionally:
sudo yum install openmpi4.1-intel --enablerepo=scyld*
installs those same packages built using the Intel oneAPI compiler suite. These compiler-specific packages can co-exist with the base GNU package. Similarly you can additionally install openmpi4.1-nvhpc for libraries and executables built using the Nvidia HPC SDK suite, and openmpi-aocc for libraries and executables built using AMD Optimizing C/C++ and Fortran Compilers. Additionally openmpi4.1-hpcx_cuda-${compiler} rpms sets are built against Nvidia HPC-X and cuda software packages and with gnu, intel, nvhpc and aocc compilers.
Note
ClusterWare provides openmpi packages that are built with third party software and compilers with best effort. However if an openmpi rpm of a certain combination of compiler, software, OpenMPI version and distro is missing, that is because that combination failed to build or package failed to run. Also, the third party software and compilers that are needed for those OpenMPI packages must be installed in addition to clusterware installation.
Important
The ClusterWare yum repo includes various versions of openmpi* RPMs which were built with different sets of options by different compilers, each potentially having requirements for specific other 3rd-party packages. In general, avoid installing openmpi RPMs using a wildcard such as openmpi4*scyld and instead carefully install only specific RPMs from the ClusterWare yum repo together with their specific required 3rd-party packages.
Suppose openmpi4.1-4.1.1 is installed and you see a newer "point" release openmpi4.1-4.1.4 in the repo. If you do:
sudo yum update openmpi4.1 --enablerepo=scyld*
then 4.1.1 updates to 4.1.4 and removes 4.1.1.
Suppose for some reason you want to retain 4.1.1, install the newer 4.1.4,
and have both "point" releases coexist.
For that you need to download the 4.1.4 RPMs and install (not update) them
using rpm
, e.g.,
sudo rpm -iv openmpi4.1-4.1.4*
You can add OpenMPI (et al) environment variables to a user's
~/.bash_profile
or ~/.bashrc
file,
e.g., add module load openmpi/intel/4.1.4
to default a simple OpenMPI
command to use a particular release and compiler suite.
Commonly a cluster uses shared storage of some kind for /home
directories,
so changes made by the cluster administrator or by an individual user are
transparently reflected across all nodes that access that same shared
/home
storage.
For OpenMPI, consistent user uid/gid and passphrase-less key-based access
is required for a multi-threaded application to communicate between
threads executing on different nodes using ssh
as a transport mechanism.
The administrator can inject all users, or a selected list of users,
or a single user into the compute node image using the sync-uids
script.
See Configure Authentication and
/opt/scyld/clusterware-tools/bin/sync-uids -h
for details.
To use OpenMPI (et al) without installing either torque-scyld or slurm-scyld, then you must configure the firewall that manages the private cluster network between the head node(s), server node(s), and compute nodes. See Firewall Configuration for details.