Job Batching

For Scyld ClusterWare, the default installation includes both the TORQUE resource manager and the Slurm workload manager, each providing users with an intuitive interface for remotely initiating and managing batch jobs on distributed compute nodes.

ClusterWare TORQUE is a customized redistribution of Open Source software that derives from `Adaptive Computing Enterprises, Inc. https://www.adaptivecomputing.com/products/opensource/torque. TORQUE is an Open Source tool based on standard OpenPBS. ClusterWare Slurm is a redistribution of Open Source software that derives from https://slurm.schedmd.com, and the associated Munge package derives from http://dun.github.io/munge/.

Both TORQUE and Slurm are installed by default, although only one job manager can be enabled at any one time. See Enabling TORQUE or Slurm below, for details. See the User's Guide for general information about using TORQUE or Slurm. See Managing Multiple Master Nodes for details about how to configure TORQUE for high availability using multiple master nodes.

Scyld also redistributes the Scyld Maui job scheduler, also derived from Adaptive Computing, that functions in conjunction with the TORQUE job manager. The alternative Moab job scheduler is also available from Adaptive Computing with a separate license, giving customers additional job scheduling, reporting, and monitoring capabilities.

In addition, Scyld provides support for most popular Open Source and commercial schedulers and resource managers, including SGE, LSF, and PBSPro. For the latest information, see the Penguin Computing Support Portal at https://www.penguincomputing.com/support.

Enabling TORQUE or Slurm

To enable TORQUE: after all compute nodes are up and running, then disable Slurm (if it is currently enabled), then enable and configure TORQUE, then reboot all the compute nodes:

service slurm-scyld cluster-stop
chkconfig slurm-scyld off
beochkconfig 98slurm off
chkconfig torque on
beochkconfig 98torque on
service torque reconfigure
service torque start
bpctl -S all -R

and then after the compute nodes have rebooted, restart TORQUE cluster-wide:

service torque cluster-restart

To enable Slurm: after all compute nodes are up and running, you disable TORQUE (if it is currently enabled), then enable and configure Slurm, then reboot all the compute nodes:

service torque cluster-stop
chkconfig torque off
beochkconfig 98torque off
chkconfig slurm-scyld on
beochkconfig 98slurm on

Next, configure Slurm by generating /etc/slurm/slurm.conf and /etc/slurm/slurmdbd.conf from Scyld-provided templates:

service slurm-scyld reconfigure

Finally, start Slurm on the master node and reboot all compute nodes:

service slurm-scyld start
bpctl -S all -R

and then after the compute nodes have rebooted, restart Slurm cluster-wide:

service slurm-scyld cluster-restart

Finally, start Slurm (and Munge and mysql) on the master node and reboot all compute nodes:

service slurm-scyld start
bpctl -S all -R

and then after the compute nodes have rebooted, restart Slurm cluster-wide:

service slurm-scyld cluster-restart

Note: slurmdbd uses mysql to create a database defined by /etc/slurm/slurmdbd.conf, and expects mysql to be configured with no password.

Each Slurm user must setup the PATH and LD_LIBRARY_PATH environment variables to properly access the Slurm commands. This is done automatically for users who login when the slurm service is running and the pbs_server is not running, via the /etc/profile.d/scyld.slurm.sh script. Alternatively, each Slurm user can manually execute module load slurm or can add that command line to (for example) the user's .bash_profile.