Running Programs¶
This chapter describes how to run both serial and parallel jobs with Scyld ClusterWare, and how to monitor the status of the cluster once your applications are running. It begins with a brief discussion of program execution concepts, including some examples. The discussion then covers running programs that aren’t parallelized, running parallel programs (including MPI-aware and PVM-aware programs), running serial programs in parallel, job batching, and file systems.
Program Execution Concepts¶
This section compares program execution on a stand-alone computer and a Scyld cluster. It also discusses the differences between running programs on a traditional Beowulf cluster and a Scyld cluster. Finally, it provides some examples of program execution on a Scyld cluster.
Stand-Alone Computer vs. Scyld Cluster¶
On a stand-alone computer running Linux, Unix, and most other operating
systems, executing a program is a very simple process. For example, to
generate a list of the files in the current working directory, you open
a terminal window and type the command ls
followed by the
[return]
key. Typing the [return]
key causes the command shell —
a program that listens to and interprets commands entered in the
terminal window — to start the ls
program (stored at /bin/ls
).
The output is captured and directed to the standard output stream, which
also appears in the same window where you typed the command.
A Scyld cluster isn’t simply a group of networked stand-alone computers.
Only the master node resembles the computing system with which you are
familiar. The compute nodes have only the minimal software components
necessary to support an application initiated from the master node. So
for instance, running the ls
command on the master node causes the
same series of actions as described above for a stand-alone computer,
and the output is for the master node only.
However, running ls
on a compute node involves a very different
series of actions. Remember that a Scyld cluster has no resident
applications on the compute nodes; applications reside only on the
master node. So for instance, to run the ls
command on compute node
1, you would enter the command bpsh 1 ls
on the master node. This
command sends ls
to compute node 1 via Scyld’s BProc
software,
and the output stream is directed to the terminal window on the master
node, where you typed the command.
Some brief examples of program execution are provided in the last
section of this chapter. Both BProc
and bpsh
are covered in more
detail in the Administrator’s Guide.
Traditional Beowulf Cluster vs. Scyld Cluster¶
A job on a Beowulf cluster is actually a collection of processes running on the compute nodes. In traditional clusters of computers, and even on earlier Beowulf clusters, getting these processes started and running together was a complicated task. Typically, the cluster administrator would need to do all of the following:
Ensure that the user had an account on all the target nodes, either manually or via a script.
Ensure that the user could spawn jobs on all the target nodes. This typically entailed configuring a
hosts.allow
file on each machine, creating a specialized PAM module (a Linux authentication mechanism), or creating a server daemon on each node to spawn jobs on the user’s behalf.Copy the program binary to each node, either manually, with a script, or through a network file system.
Ensure that each node had available identical copies of all the dependencies (such as libraries) needed to run the program.
Provide knowledge of the state of the system to the application manually, through a configuration file, or through some add-on scheduling software.
With Scyld ClusterWare, most of these steps are removed. Jobs are started on the
master node and are migrated out to the compute nodes via BProc
. A
cluster architecture where jobs may be initiated only from the master
node via BProc
provides the following advantages:
Users no longer need accounts on remote nodes.
Users no longer need authorization to spawn jobs on remote nodes.
Neither binaries nor libraries need to be available on the remote nodes.
The
BProc
system provides a consistent view of all jobs running on the system.
With all these complications removed, program execution on the compute
nodes becomes a simple matter of letting BProc
know about your job
when you start it. The method for doing so depends on whether you are
launching a parallel program (for example, an MPI job or PVM job) or any
other kind of program. See the sections on running parallel programs and
running non-parallelized programs later in this chapter.
Program Execution Examples¶
This section provides a few examples of program execution with Scyld ClusterWare. Additional examples are provided in the sections on running parallel programs and running non-parallelized programs later in this chapter.
Example 1. Directed Execution with bpsh¶
In the directed execution mode, the user explicitly defines which node
(or nodes) will run a particular job. This mode is invoked using the
bpsh
command, the ClusterWare shell command analogous in
functionality to both the rsh
(remote shell) and ssh
(secure
shell) commands. Following are two examples of using bpsh
.
The first example runs hostname
on compute node 0 and writes the
output back from the node to the user’s screen:
[user@cluster user] $ bpsh 0 /bin/hostname
n0
If /bin
is in the user’s $PATH, then the bpsh
does not need the
full pathname:
[user@cluster user] $ bpsh 0 hostname
n0
The second example runs the /usr/bin/uptime
utility on node 1.
Assuming /usr/bin
is in the user’s $PATH:
[user@cluster user] $ bpsh 1 uptime
12:56:44 up 4:57, 5 users, load average: 0.06, 0.09, 0.03
Example 2. Dynamic Execution with beorun and mpprun¶
In the dynamic execution mode, Scyld decides which node is the most
capable of executing the job at that moment in time. Scyld includes two
parallel execution tools that dynamically select nodes: beorun
and
mpprun
. They differ only in that beorun
runs the job
concurrently on the selected nodes, while mpprun
runs the job
sequentially on one node at a time.
The following example shows the difference in the elapsed time to run a
command with beorun
vs. mpprun
:
[user@cluster user] $ date;beorun -np 8 sleep 1;date
Fri Aug 18 11:48:30 PDT 2006
Fri Aug 18 11:48:31 PDT 2006
[user@cluster user] $ date;mpprun -np 8 sleep 1;date
Fri Aug 18 11:48:46 PDT 2006
Fri Aug 18 11:48:54 PDT 2006
Example 3. Binary Pre-Staged on Compute Node¶
A needed binary can be “pre-staged” by copying it to a compute node
prior to execution of a shell script. In the following example, the
shell script is in a file called test.sh
:
######
#! /bin/bash
hostname.local
#######
[user@cluster user] $ bpsh 1 mkdir -p /usr/local/bin
[user@cluster user] $ bpcp /bin/hostname 1:/usr/local/bin/hostname.local
[user@cluster user] $ bpsh 1 ./test.sh
n1
This makes the hostname
binary available on compute node 1 as
/usr/local/bin/hostname.local
before the script is executed. The
shell’s $PATH contains /usr/local/bin
, so the compute node searches
locally for hostname.local
in $PATH, finds it, and executes it.
Note that copying files to a compute node generally puts the files into the RAM filesystem on the node, thus reducing main memory that might otherwise be available for programs, libraries, and data on the node.
Example 4. Binary Migrated to Compute Node¶
If a binary is not “pre-staged” on a compute node, the full path to the
binary must be included in the script in order to execute properly. In
the following example, the master node starts the process (in this case,
a shell) and moves it to node 1, then continues execution of the script.
However, when it comes to the hostname.local2
command, the process
fails:
######
#! /bin/bash
hostname.local2
#######
[user@cluster user] $ bpsh 1 ./test.sh
./test.sh: line 2: hostname.local2: command not found
Since the compute node does not have hostname.local2
locally, the
shell attempts to resolve the binary by asking for the binary from the
master. The problem is that the master has no idea which binary to give
back to the node, hence the failure.
Because there is no way for Bproc
to know which binaries may be
needed by the shell, hostname.local2
is not migrated along with the
shell during the initial startup. Therefore, it is important to provide
the compute node with a full path to the binary:
######
#! /bin/bash
/tmp/hostname.local2
#######
[user@cluster user] $ cp /bin/hostname /tmp/hostname.local2
[user@cluster user] $ bpsh 1 ./test.sh
n1
With a full path to the binary, the compute node can construct a proper request for the master, and the master knows which exact binary to return to the compute node for proper execution.
Example 5. Process Data Files¶
Files that are opened by a process (including files on disk, sockets, or
named pipes) are not automatically migrated to compute nodes. Suppose
the application BOB needs the data file 1.dat
:
er@cluster user] $ bpsh 1 /usr/local/BOB/bin/BOB 1.dat
1.dat
must be either pre-staged to the compute node, e.g., using
bpcp
to copy it there; or else the data files must be accessible on
an NFS-mounted file system. The file /etc/beowulf/fstab
(or a
node-specific fstab.
nodeNumber) specifies which filesystems are
NFS-mounted on each compute node by default.
Example 6. Installing Commercial Applications¶
Through the course of its execution, the application BOB in the example
above does some work with the data file 1.dat
, and then later
attempts to call /usr/local/BOB/bin/BOB.helper.bin
and
/usr/local/BOB/bin/BOB.cleanup.bin
.
If these binaries are not in the memory space of the process during
migration, the calls to these binaries will fail. Therefore,
/usr/local/BOB
should be NFS-mounted to all of the compute nodes, or
the binaries should be pre-staged using bpcp
to copy them by hand to
the compute nodes. The binaries will stay on each compute node until
that node is rebooted.
Generally for commercial applications, the administrator should have
$APP_HOME
NFS-mounted on the compute nodes that will be involved in
execution. A general best practice is to mount a general directory such
as /opt
, and install all of the applications into /opt
.
Environment Modules¶
The RHEL/CentOS environment-modules package provides for the dynamic
modification of a user’s environment via modulefiles. Each modulefile
contains the information needed to configure the shell for an
application, allowing a user to easily switch between applications with
a simple module switch
command that resets environment variables
like PATH and LD_LIBRARY_PATH. A number of modules are already
installed that configure application builds and execution with OpenMPI,
MPICH2, and MVAPICH2. Execute the command module avail
to see a list
of available modules. See specific sections, below, for examples of how
to use modules.
For more information about creating your own modules, see
http://modules.sourceforge.net, or view the manpages man module
and
man modulefile
.
Running Programs That Are Not Parallelized¶
Starting and Migrating Programs to Compute Nodes (bpsh)¶
There are no executable programs (binaries) on the file system of the
compute nodes. This means that there is no getty
, no login
, nor
any shells on the compute nodes.
Instead of the remote shell (rsh
) and secure shell (ssh
)
commands that are available on networked stand-alone computers (each of
which has its own collection of binaries), Scyld ClusterWare has the bpsh
command.
The following example shows the standard ls
command running on node
2 using bpsh
:
[user@cluster user] $ bpsh 2 ls -FC /
bin/ dev/ home/ lib64/ proc/ sys/ usr/
bpfs/ etc/ lib/ opt/ sbin/ tmp/ var/
At startup time, by default Scyld ClusterWare exports various directories, e.g.,
/bin
and /usr/bin
, on the master node, and those directories are
NFS-mounted by compute nodes.
However, an NFS-accessible /bin/ls
is not a requirement for
bpsh 2 ls
to work. Note that the /sbin
directory also exists on
the compute node. It is not exported by the master node by default, and
thus it exists locally on a compute node in the RAM-based filesystem.
bpsh 2 ls /sbin
usually shows an empty directory. Nonetheless,
bpsh 2 modprobe bproc
executes successfully, even though
which modprobe
shows the command resides in /sbin/modprobe
and
bpsh 2 which modprobe
fails to find the command on the compute node
because its /sbin
does not contain modprobe
.
bpsh 2 modprobe bproc
works because the bpsh
initiates a
modprobe
process on the master node, then forms a process memory
image that includes the command’s binary and references to all its
dynamically linked libraries. This process memory image is then copied
(migrated) to the compute node, and there the references to dynamic
libraries are remapped in the process address space. Only then does the
modprobe
command begin real execution.
bpsh
is not a special version of sh
, but a special way of
handling execution. This process works with any program. Be aware of the
following:
All three standard I/O streams —
stdin
,stdout
, andstderr
— are forwarded to the master node. Since some programs need to read standard input and will stop working if they’re run in the background, be sure to close standard input at invocation by using use thebpsh -n
flag when you run a program in the background on a compute node.Because shell scripts expect executables to be present, and because compute nodes don’t meet this requirement, shell scripts should be modified to include the
bpsh
commands required to affect the compute nodes and run on the master node.The dynamic libraries are cached separately from the process memory image, and are copied to the compute node only if they are not already there. This saves time and network bandwidth. After the process completes, the dynamic libraries are unloaded from memory, but they remain in the local cache on the compute node, so they won’t need to be copied if needed again.
For additional information on the BProc
Distributed Process Space and how
processes are migrated to compute nodes, see the Administrator’s Guide.
Copying Information to Compute Nodes (bpcp)¶
Just as traditional Unix has copy (cp
), remote copy (rcp
), and
secure copy (scp
) to move files to and from networked machines, Scyld ClusterWare
has the bpcp
command.
Although the default sharing of the master node’s home directories via
NFS is useful for sharing small files, it is not a good solution for
large data files. Having the compute nodes read large data files served
via NFS from the master node will result in major network congestion, or
even an overload and shutdown of the NFS server. In these cases, staging
data files on compute nodes using the bpcp
command is an alternate
solution. Other solutions include using dedicated NFS servers or NAS
appliances, and using cluster file systems.
Following are some examples of using bpcp
.
This example shows the use of bpcp
to copy a data file named
foo2.dat
from the current directory to the /tmp
directory on
node 6:
[user@cluster user] $ bpcp foo2.dat 6:/tmp
The default directory on the compute node is the current directory on the master node. The current directory on the compute node may already be NFS-mounted from the master node, but it may not exist. The example above works, since /tmp exists on the compute node, but will fail if the destination does not exist. To avoid this problem, you can create the necessary destination directory on the compute node before copying the file, as shown in the next example.
In this example, we change to the /tmp/foo
directory on the master,
use bpsh
to create the same directory on the node 6, then copy
foo2.dat
to the node:
[user@cluster user] $ cd /tmp/foo
[user@cluster user] $ bpsh 6 mkdir /tmp/foo
[user@cluster user] $ bpcp foo2.dat 6:
This example copies foo2.dat
from node 2 to node 3 directly, without
the data being stored on the master node. As in the first example, this
works because /tmp
exists:
[user@cluster user] $ bpcp 2:/tmp/foo2.dat 3:/tmp
Running Parallel Programs¶
An Introduction to Parallel Programming APIs¶
Programmers are generally familiar with serial, or sequential, programs. Simple programs — like “Hello World” and the basic suite of searching and sorting programs — are typical of sequential programs. They have a beginning, an execution sequence, and an end; at any time during the run, the program is executing only at a single point.
A thread is similar to a sequential program, in that it also has a beginning, an execution sequence, and an end. At any time while a thread is running, there is a single point of execution. A thread differs in that it isn’t a stand-alone program; it runs within a program. The concept of threads becomes important when a program has multiple threads running at the same time and performing different tasks.
To run in parallel means that more than one thread of execution is running at the same time, often on different processors of one computer; in the case of a cluster, the threads are running on different computers. A few things are required to make parallelism work and be useful: The program must migrate to another computer or computers and get started; at some point, the data upon which the program is working must be exchanged between the processes.
The simplest case is when the same single-process program is run with different input parameters on all the nodes, and the results are gathered at the end of the run. Using a cluster to get faster results of the same non-parallel program with different inputs is called parametric execution.
A much more complicated example is a simulation, where each process represents some number of elements in the system. Every few time steps, all the elements need to exchange data across boundaries to synchronize the simulation. This situation requires a message passing interface or MPI.
To solve these two problems — program startup and message passing — you can develop your own code using POSIX interfaces. Alternatively, you could utilize an existing parallel application programming interface (API), such as the Message Passing Interface (MPI) or the Parallel Virtual Machine (PVM). These are discussed in the sections that follow.
MPI¶
The Message Passing Interface (MPI) application programming interface is currently the most popular choice for writing parallel programs. The MPI standard leaves implementation details to the system vendors (like Scyld). This is useful because they can make appropriate implementation choices without adversely affecting the output of the program.
A program that uses MPI is automatically started a number of times and is allowed to ask two questions: How many of us (size) are there, and which one am I (rank)? Then a number of conditionals are evaluated to determine the actions of each process. Messages may be sent and received between processes.
The advantages of MPI are that the programmer:
Doesn’t have to worry about how the program gets started on all the machines
Has a simplified interface for inter-process messages
Doesn’t have to worry about mapping processes to nodes
Abstracts the network details, resulting in more portable hardware-agnostic software
Also see the section on running MPI-aware programs later in this chapter. Scyld ClusterWare includes several implementations of MPI:
MPICH.
Scyld ClusterWare 6 (and earlier releases) includes MPICH, a freely-available
implementations of the MPI standard, and a project that is managed by
Argonne National Laboratory. NOTE: MPICH is deprecated and removed from
ClusterWare 7 and later releases, and supplanted by MPICH2 and beyond.
Visit https://www.mpich.org for more information. Scyld MPICH is modified
to use BProc
and Scyld job mapping support; see the section on job
mapping later in this chapter.
MVAPICH.
MVAPICH is an implementation of MPICH for Infiniband interconnects.
NOTE: MVAPICH is deprecated and removed from ClusterWare 7 and later
releases, and supplanted by MVAPICH2 and beyond. Visit
http://mvapich.cse.ohio-state.edu/ for more information. Scyld MVAPICH
is modified to use BProc
and Scyld job mapping support; see the
section on job mapping later in this chapter.
MPICH2. Scyld ClusterWare includes MPICH2, a second generation MPICH. Visit https://www.mpich.org for more information. Scyld MPICH2 is customized to use environment modules. See MPICH2 Release Information for details.
MVAPICH2. MVAPICH2 is second generation MVAPICH. Visit http://mvapich.cse.ohio-state.edu/ for more information. Scyld MVAPICH2 is customized to use environment modules. See MVAPICH2 Release Information for details.
OpenMPI. OpenMPI is an open-source implementation of the Message Passing Interface 2 (MPI-2) specification. The OpenMPI implementation is an optimized combination of several other MPI implementations. Visit https://www.open-mpi.org for more information. Also see OpenMPI Release Information for details.
Other MPI Implementations. Various commercial MPI implementations run on Scyld ClusterWare. Visit the Penguin Computing Support Portal at https://www.penguincomputing.com/support for more information. You can also download and build your own version of MPI, and configure it to run on Scyld ClusterWare.
PVM¶
Parallel Virtual Machine (PVM) was an earlier parallel programming interface. Unlike MPI, it is not a specification but a single set of source code distributed on the Internet. PVM reveals much more about the details of starting your job on remote nodes. However, it fails to abstract implementation details as well as MPI does.
PVM is deprecated, but is still in use by legacy code. We generally advise against writing new programs in PVM, but some of the unique features of PVM may suggest its use.
Also see the section on running PVM-aware programs later in this chapter.
Custom APIs¶
As mentioned earlier, you can develop you own parallel API by using various Unix and TCP/IP standards. In terms of starting a remote program, there are programs written:
Using the
rexec
function callTo use the
rexec
orrsh
program to invoke a sub-programTo use Remote Procedure Call (RPC)
To invoke another sub-program using the
inetd
super server
These solutions come with their own problems, particularly in the implementation details. What are the network addresses? What is the path to the program? What is the account name on each of the computers? How is one going to load-balance the cluster?
Scyld ClusterWare, which doesn’t have binaries installed on the cluster
nodes, may not lend itself to these techniques. We recommend you write
your parallel code in MPI. That having been said, we can say that Scyld
has some experience with getting rexec()
calls to work, and that one
can simply substitute calls to rsh
with the more cluster-friendly
bpsh
.
Mapping Jobs to Compute Nodes¶
Running programs specifically designed to execute in parallel across a
cluster requires at least the knowledge of the number of processes to be
used. Scyld ClusterWare uses the NP
environment variable to determine this. The
following example will use 4 processes to run an MPI-aware program
called a.out
, which is located in the current directory.
[user@cluster user] $ NP=4 ./a.out
Note that each kind of shell has its own syntax for setting environment
variables; the example above uses the syntax of the Bourne shell
(/bin/sh
or /bin/bash
).
What the example above does not specify is which specific nodes will execute the processes; this is the job of the mapper. Mapping determines which node will execute each process. While this seems simple, it can get complex as various requirements are added. The mapper scans available resources at the time of job submission to decide which processors to use.
Scyld ClusterWare includes beomap
, a mapping API (documented in the Programmer’s Guide
with details for writing your own mapper). The mapper’s default behavior is
controlled by the following environment variables:
NP — The number of processes requested, but not the number of processors. As in the example earlier in this section,
NP=4 ./a.out
will run the MPI programa.out
with 4 processes.ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above,
--all-cpus=1 ./a.out
would run the MPI programa.out
on all available CPUs.ALL_NODES — Set the number of processes to the number of nodes available to the current user. Similar to the
ALL_CPUS
variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.ALL_LOCAL — Run every process on the master node; used for debugging purposes.
NO_LOCAL — Don’t run any processes on the master node.
EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.
BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.
You can use the beomap
program to display the current mapping for
the current user in the current environment with the current resources
at the current time. See the Reference Guide for a detailed description of
beomap
and its options, as well as examples for using it.
Running MPICH and MVAPICH Programs¶
NOTE: MPICH and MVAPICH (version 1) are deprecated and removed from Scyld ClusterWare
MPI-aware programs are those written to the MPI specification and linked with Scyld MPI libraries. NOTE: MPICH and MVAPICH are deprecated and have been supplanted by MPICH2 and MVAPICH2 (and newer versions of those packages). Applications that use MPICH (Ethernet “p4”) or MVAPICH (Infiniband “vapi”) are compiled and linked with common MPICH/MVAPICH implementation libraries, plus specific compiler family (e.g., gnu, Intel, PGI) libraries. The same application binary can execute either in an Ethernet interconnection environment or an Infiniband interconnection environment that is specified at run time. This section discusses how to run these programs and how to set mapping parameters from within such programs.
For information on building MPICH/MVAPICH programs, see the Programmer’s Guide.
mpirun¶
Almost all implementations of MPI have an mpirun
program, which
shares the syntax of mpprun
, but which boasts of additional features
for MPI-aware programs.
In the Scyld implementation of mpirun
, all of the options available
via environment variables or flags through directed execution are
available as flags to mpirun
, and can be used with properly compiled
MPI jobs. For example, the command for running a hypothetical program
named my-mpi-prog
with 16 processes:
[user@cluster user] $ mpirun -np 16 my-mpi-prog arg1 arg2
is equivalent to running the following commands in the Bourne shell:
[user@cluster user] $ export NP=16
[user@cluster user] $ my-mpi-prog arg1 arg2
Setting Mapping Parameters from Within a Program¶
A program can be designed to set all the required parameters itself. This makes it possible to create programs in which the parallel execution is completely transparent. However, it should be noted that this will work only with Scyld ClusterWare, while the rest of your MPI program should work on any MPI platform.
Use of this feature differs from the command line approach, in that all options that need to be set on the command line can be set from within the program. This feature may be used only with programs specifically designed to take advantage of it, rather than any arbitrary MPI program. However, this option makes it possible to produce turn-key application and parallel library functions in which the parallelism is completely hidden.
Following is a brief example of the necessary source code to invoke
mpirun
with the -np 16
option from within a program, to run the
program with 16 processes:
/* Standard MPI include file */
# include <mpi.h>
main(int argc, char **argv) {
setenv("NP","16",1); // set up mpirun env vars
MPI_Init(&argc,&argv);
MPI_Finalize();
}
More details for setting mapping parameters within a program are provided in the Programmer’s Guide.
Examples¶
The examples in this section illustrate certain aspects of running a
hypothetical MPI-aware program named my-mpi-prog
.
Example 7. Specifying the Number of Processes¶
This example shows a cluster execution of a hypothetical program named
my-mpi-prog
run with 4 processes:
[user@cluster user] $ NP=4 ./my-mpi-prog
An alternative syntax is as follows:
[user@cluster user] $ NP=4
[user@cluster user] $ export NP
[user@cluster user] $ ./my-mpi-prog
Note that the user specified neither the nodes to be used nor a mechanism for migrating the program to the nodes. The mapper does these tasks, and jobs are run on the nodes with the lowest CPU utilization.
In addition to specifying the number of processes to create, you can
also exclude specific nodes as computing resources. In this example, we
run my-mpi-prog
again, but this time we not only specify the number
of processes to be used (NP=6), but we also exclude of the master node
(NO_LOCAL=1) and some cluster nodes (EXCLUDE=2:4:5) as computing
resources.
[user@cluster user] $ NP=6 NO_LOCAL=1 EXCLUDE=2:4:5 ./my-mpi-prog
Running OpenMPI Programs¶
OpenMPI programs are those written to the MPI-2 specification. This section provides information needed to use programs with OpenMPI as implemented in Scyld ClusterWare.
Pre-Requisites to Running OpenMPI¶
A number of commands, such as mpirun
, are duplicated between OpenMPI
and other MPI implementations. The environment-modules package gives
users a convenient way to switch between the various implementations.
Each module bundles together various compiler-specific environment
variables to configure your shell for building and running your
application, and for accessing compiler-specific manpages. Be sure that
you are loading the proper module to match the compiler that built the
application you wish to run. For example, to load the OpenMPI module for
use with the Intel compiler, do the following:
[user@cluster user] $ module load openmpi/intel
Currently, there are modules for the GNU, Intel, and PGI compilers. To see a list of all of the available modules:
[user@cluster user] $ module avail openmpi
------------------------------- /opt/modulefiles -------------------------------
openmpi/gnu/1.5.3 openmpi/intel/1.5.3 openmpi/pgi/1.5.3
For more information about creating your own modules, see
http://modules.sourceforge.net and the manpages man module
and
man modulefile
.
Using OpenMPI¶
OpenMPI does not honor the Scyld ClusterWare job mapping environment variables. You must either specify the list of hosts on the command line or inside a hostfile. To specify the list of hosts on the command line, use the -H option. The argument following -H is a comma separated list of hostnames, not node numbers. For example, to run a two process job, with one process running on node 0 and one on node 1:
[user@cluster user] $ mpirun -H n0,n1 -np 2 ./mpiprog
Support for running jobs over Infiniband using the OpenIB transport is included with OpenMPI distributed with Scyld ClusterWare. Much like running a job with MPICH over Infiniband, one must specifically request the use of OpenIB. For example:
[user@cluster user] $ mpirun --mca btl openib,sm,self -H n0,n1 -np 2 ./myprog
Read the OpenMPI mpirun
man page for more information about, using a
hostfile, and using other tunable options available through mpirun
.
Running MPICH2 and MVAPICH2 Programs¶
MPICH2 and MVAPICH2 programs are those written to the MPI-2 specification. This section provides information needed to use programs with MPICH2 or MVAPICH2 as implemented in Scyld ClusterWare.
Pre-Requisites to Running MPICH2/MVAPICH2¶
As with Scyld OpenMPI, the Scyld MPICH2 and MVAPICH2 distributions are repackaged Open Source MPICH2 and MVAPICH2 that utilize environment modules to build and to execute applications. Each module bundles together various compiler-specific environment variables to configure your shell for building and running your application, and for accessing implementation- and compiler-specific manpages. You must use the same module to both build the application and to execute it. For example, to load the MPICH2 module for use with the Intel compiler, do the following:
[user@cluster user] $ module load mpich2/intel
Currently, there are modules for the GNU, Intel, and PGI compilers. To see a list of all of the available modules:
[user@cluster user] $ module avail mpich2 mvapich2
------------------------------- /opt/modulefiles -------------------------------
mpich2/gnu/1.3.2 mpich2/intel/1.3.2 mpich2/pgi/1.3.2
------------------------------- /opt/modulefiles -------------------------------
mvapich2/gnu/1.6 mvapich2/intel/1.6 mvapich2/pgi/1.6
For more information about creating your own modules, see
http://modules.sourceforge.net and the manpages man module
and
man modulefile
.
Using MPICH2¶
Unlike the Scyld ClusterWare MPICH implementation, MPICH2 does not honor
the Scyld ClusterWare job mapping environment variables. Use mpiexec
to execute
MPICH2 applications. After loading an mpich2 module, see the
man mpiexec
manpage for specifics, and visit https://www.mpich.org
for full documentation.
Using MVAPICH2¶
MVAPICH2 does not honor the Scyld ClusterWare job mapping environment variables. Use
mpirun_rsh
to execute MVAPICH2 applications. After loading an
mvapich2 module, use mpirun_rsh --help
to see specifics, and visit
http://mvapich.cse.ohio-state.edu/ for full documentation.
Running PVM-Aware Programs¶
Parallel Virtual Machine (PVM) is an application programming interface for writing parallel applications, enabling a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Scyld has developed the Scyld PVM library, specifically tailored to allow PVM to take advantage of the technologies used in Scyld ClusterWare. A PVM-aware program is one that has been written to the PVM specification and linked against the Scyld PVM library.
A complete discussion of cluster configuration for PVM is beyond the scope of this document. However, a brief introduction is provided here, with the assumption that the reader has some background knowledge on using PVM.
You can start the master PVM daemon on the master node using the PVM
console, pvm
. To add a compute node to the virtual machine, issue an
add .
command, where # is replaced by a node’s assigned number in
the cluster.
Tip
You can generate a list of node numbers using
bpstat
command.
Alternately, you can start the PVM console with a hostfile filename on
the command line. The hostfile should contain a .# for each compute node
you want as part of the virtual machine. As with standard PVM, this
method automatically spawns PVM slave daemons to the specified compute
nodes in the cluster. From within the PVM console, use the conf
command to list your virtual machine’s configuration; the output will
include a separate line for each node being used. Once your virtual
machine has been configured, you can run your PVM applications as you
normally would.
Porting Other Parallelized Programs¶
Programs written for use on other types of clusters may require various levels of change to function with Scyld ClusterWare. For instance:
Scripts or programs that invoke
rsh
can instead callbpsh
.Scripts or programs that invoke
rcp
can instead callbpcp
.beomap
can be used with any script to load balance programs that are to be dispatched to the compute nodes.
For more information on porting applications, see the Programmer’s Guide.
Running Serial Programs in Parallel¶
For jobs that are not “MPI-aware” or “PVM-aware”, but need to be started
in parallel, Scyld ClusterWare provides the parallel execution utilities mpprun
and beorun
. These utilities are more sophisticated than bpsh
, in
that they can automatically select ranges of nodes on which to start
your program, run tasks on the master node, determine the number of CPUs
on a node, and start a copy on each CPU. Thus, mpprun
and beorun
provide you with true “dynamic execution” capabilities, whereas bpsh
provides “directed execution” only.
mpprun
and beorun
are very similar, and have similar parameters.
They differ only in that mpprun
runs jobs sequentially on the
selected processors, while beorun
runs jobs concurrently on the
selected processors.
mpprun¶
mpprun
is intended for applications rather than utilities, and runs
them sequentially on the selected nodes. The basic syntax of mpprun
is as follows:
[user@cluster user] $ mpprun [options] app arg1 arg2...
where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.
Options¶
mpprun
includes options for controlling various aspects of the job,
including the ability to:
Specify the number of processors on which to start copies of the program
Start one copy on each node in the cluster
Start one copy on each CPU in the cluster
Force all jobs to run on the master node
Prevent any jobs from running on the master node
The most interesting of the options is the --map
option, which lets
the user specify which nodes will run copies of a program; an example is
provided in the next section. This argument, if specified, overrides the
mapper’s selection of resources that it would otherwise use.
See the Reference Guide for a complete list of options for mpprun
.
Examples¶
Run 16 tasks of program app:
[user@cluster user] $ mpprun -np 16 app infile outfile
Run 16 tasks of program app on any available nodes except nodes 2 and 3:
[user@cluster user] $ mpprun -np 16 --exclude 2:3 app infile outfile
Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:
[user@cluster user] $ mpprun --map 4:2:1:5 app infile outfile
beorun¶
beorun
is intended for applications rather than utilities, and runs
them concurrently on the selected nodes. The basic syntax of beorun
is as follows:
[user@cluster user] $ beorun [options] app arg1 arg2...
where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.
Options¶
beorun
includes options for controlling various aspects of the job,
including the ability to:
Specify the number of processors on which to start copies of the program
Start one copy on each node in the cluster
Start one copy on each CPU in the cluster
Force all jobs to run on the master node
Prevent any jobs from running on the master node
The most interesting of the options is the --map
option, which lets
the user specify which nodes will run copies of a program; an example is
provided in the next section. This argument, if specified, overrides the
mapper’s selection of resources that it would otherwise use.
See the Reference Guide for a complete list of options for beorun
.
Examples¶
Run 16 tasks of program app:
[user@cluster user] $ beorun -np 16 app infile outfile
Run 16 tasks of program app on any available nodes except nodes 2 and 3:
[user@cluster user] $ beorun -np 16 --exclude 2:3 app infile outfile
Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:
[user@cluster user] $ beorun --map 4:2:1:5 app infile outfile
Job Batching¶
Job Batching Options for ClusterWare¶
For Scyld ClusterWare, the default installation includes both the TORQUE resource
manager and the Slurm workoad manager, each providing users an intuitive
interface for for remotely initiating and managing batch jobs on
distributed compute nodes. TORQUE is an Open Source tool based on
standard OpenPBS. Slurm is another Open Source tool, employing the Open
Source Munge
for authentication and mysql
(for ClusterWare 6) or
mariadb
(for ClusterWare 7 and beyond) for managing a database. Basic
instructions for using TORQUE are provided in the next section. For more
general product information, see http://www.adaptivecomputing.com/ for
Adaptive Computing’s TORQUE information and https://slurm.schedmd.com
for Slurm information.
Only one job manager can be enabled at any one time. See the Scyld ClusterWare
Administrator’s Guide for details about how to enable either TORQUE or Slurm. If
Slurm is the chosen job manager, then users must setup the PATH and
LD_LIBRARY_PATH environment variables to properly access the Slurm
commands. This is done automatically for users who login when the
slurm service is running and the pbs_server is not running, via the
/etc/profile.d/scyld.slurm.sh
script. Alternatively, each Slurm user
can manually execute module load slurm
or can add that command line
to (for example) the user’s .bash_profile
.
The https://slurm.schedmd.com Slurm website also provides an optional TORQUE wrapper to minimize the syntactic differences between TORQUE and Slurm commands and scripts. See https://slurm.schedmd.com/rosetta.pdf for a discussion of the differences between TORQUE and Slurm, and https://slurm.schedmd.com/faq.html#torque provides useful information about how to switch from PBS or TORQUE to Slurm.
Scyld also redistributes the Scyld Maui job scheduler, also derived from Adaptive Computing, that functions in conjunction with the TORQUE job manager. The alternative Moab job scheduler is also available from Adaptive Computing with a separate license, giving customers additional job scheduling, reporting, and monitoring capabilities.
In addition, Scyld provides support for most popular open source and commercial schedulers and resource managers, including SGE, LSF, and PBSPro. For the latest information, visit the Penguin Computing Support Portal at https://www.penguincomputing.com/support.
Job Batching with TORQUE¶
The default installation is configured as a simple job serializer with a single queue named batch.
You can use the TORQUE resource manager to run jobs, check job status, find out which nodes are running your job, and find job output.
Running a Job¶
To run a job with TORQUE, you can put the commands you would normally
use into a job script, and then submit the job script to the cluster
using qsub
. The qsub
program has a number of options that may be
supplied on the command line or as special directives inside the job
script. For the most part, these options should behave exactly the same
in a job script or via the command line, but job scripts make it easier
to manage your actions and their results.
Example 9. Starting a Job with a Job Script Using One Node¶
Following are some examples of running a job using qsub
. For more
detailed information on qsub
, see the qsub
man page.
The following script declares a job with the name “myjob”, to be run using one node. The script uses the PBS -N directive, launches the job, and finally sends the current date and working directory to standard output.
#!/bin/sh
## Set the job name
#PBS -N myjob
#PBS -l nodes=1
# Run my job
/path/to/myjob
echo Date: $
echo Dir: $PWD
You would submit “myjob” as follows:
[bjosh@iceberg]$ qsub -l nodes=1 myjob
15.iceberg
Example 10. Starting a Job from the Command Line¶
This example provides the command line equivalent of the job run in the
example above. We enter all of the qsub
options on the initial
command line. Then qsub
reads the job commands line-by-line until we
type ^D, the end-of-file character. At that point, qsub
queues the
job and returns the Job ID.
[bjosh@iceberg]$ qsub -N myjob -l nodes=1:ppn=1 -j oe
cd $PBS_0_WORKDIR
echo Date: $
echo Dir: $PWD
^D
16.iceberg
Example 11. Starting an MPI Job with a Job Script¶
The following script declares an MPI job named “mpijob”. The script uses
the PBS -N
directive, prints out the nodes that will run the job,
launches the job using mpirun
, and finally prints out the current
date and working directory. When submitting MPI jobs using TORQUE, it is
recommended to simply call mpirun without any arguments. mpirun
will
detect that it is being launched from within TORQUE and assure that the
job will be properly started on the nodes TORQUE has assigned to the
job. In this case, TORQUE will properly manage and track resources used
by the job.
## Set the job name
#PBS -N mpijob
# RUN my job
mpirun /path/to/mpijob
echo Date: $
echo Dir: $PWD
To request 8 total processors to run “mpijob”, you would submit the job as follows:
[bjosh@iceberg]$ qsub -l nodes=8 mpijob
17.iceberg
To request 8 total processors, using 4 nodes, each with 2 processors per node, you would submit the job as follows:
[bjosh@iceberg]$ qsub -l nodes=4:ppn=2 mpijob
18.iceberg
Checking Job Status¶
You can check the status of your job using qstat
. The command line
option qstat -n
will display the status of queued jobs. To watch the
progression of events, use the watch
command to execute qstat -n
every 2 seconds by default; type [CTRL]-C
to interrupt watch
when needed.
Example 12. Checking Job Status¶
This example shows how to check the status of the job named “myjob”, which we ran on 1 node in the first example above, using the option to watch the progression of events.
[bjosh@iceberg]$ qsub myjob && watch qstat -n
iceberg:
JobID Username Queue Jobname SessID NDS TSK ReqdMemory ReqdTime S ElapTime
15.iceberg bjosh default myjob -- 1 -- -- 00:01 Q --
Table 1. Useful Job Status Commands¶
Command |
Purpose |
---|---|
ps -ef | bpstat -P |
Display all running jobs, with node number for each |
qstat -Q |
Display status of all queues |
qstat -n |
Display status of queued jobs |
qstat -f JOBID |
Display very detailed information about Job ID |
pbsnodes -a |
Display status of all nodes |
Finding Out Which Nodes Are Running a Job¶
To find out which nodes are running your job, use the following commands:
Finding Job Output¶
When your job terminates, TORQUE will store its output and error streams in files in the script’s working directory.
Default output file:
.o
You can override the default using
qsub
with the ``-o `` option on the command line, or use the ``#PBS -o `` directive in your job script.Default error file:
.e
You can override the default using
qsub
with the ``-e `` option on the command line, or use the ``#PBS -e `` directive in your job script.To join the output and error streams into a single file, use
qsub
with the-j oe
option on the command line, or use the#PBS -j oe
directive in your job script.
Job Batching with POD Tools¶
POD Tools
is a collection of tools for submitting TORQUE jobs to a
remote cluster and for monitoring them. POD Tools is useful for, but not
limited to, submitting and monitoring jobs to a remote Penguin On Demand
cluster. POD Tools executes on both Scyld and non-Scyld client machines,
and the Tools communicate with the beoweb
service that must be
executing on the target cluster.
The primary tool in POD Tools is POD Shell (podsh)
, which is a
command-line interface that allows for remote job submission and
monitoring. POD Shell is largely self-documented. Enter podsh --help
for a list of possible commands and their formats.
The general usage is podsh [OPTIONS] [FILE/ID]
. The action
specifies what type of action to perform, such as submit (for
submitting a new job) or status (for collecting status on all jobs or
a specific job).
POD Shell can upload a TORQUE job script to the target cluster, where it
will be added to the job queue. Additionally, POD Shell can be used to
stage data in and out of the target cluster. Staging data in (i.e.
copying data to the cluster) is performed across an unencrypted TCP
socket. Staging data out (i.e. from the cluster back to the client
machine) is performed using scp
from the cluster to the client. In
order for this transfer to be successful, password-less authentication
must be in place using SSH keys between the cluster’s master node and
the client.
POD Shell uses a configuration file that supports both site-wide and
user-local values. Site-wide values are stored in entries in
/etc/podtools.conf
. These settings can be overridden by values in a
user’s ~/.podtools/podtools.conf
file. These values can again be
overridden by command-line arguments passed to podsh
. The template
for podtools.conf
is found at
/opt/scyld/podtools/podtools.conf.template
.
Using Singularity¶
Scyld ClusterWare 7 distributes Singularity, a powerful Linux container platform designed by Lawrence Berkeley National Laboratory.
Singularity enables users to have full control of their environment, allowing a non-privileged user to “swap out” the operating system on the host by executing a lightweight Singularity container environment and an application that executes within that environment. For example, Singularity can provide a user with the ability to create an Ubuntu image of their application, and run the containerized application on a RHEL7 or CentOS7 ClusterWare system in its native Ubuntu environment.
Refer to the Singularity documentation at https://www.sylabs.io/docs/ for instructions on how to create and use Singularity containers.
When running MPI-enabled applications with Singularity on Scyld ClusterWare, follow these additional instructions:
Always compile MPI applications inside a container image with the same MPI implementation and version you plan to use on your Scyld ClusterWare system. Refer to the Singularity documentation for currently supported MPI implementations.
Be aware of the MPI transports which are compatible with your containerized binary, and ensure that you use the same MPI transport when executing MPI applications through Singularity. For example, Scyld ClusterWare’s OpenMPI packages support TCP, Verbs, PSM and PSM2 MPI transports, but not all operating systems will support this gamut of options. Adjust your
mpirun
accordingly on Scyld ClusterWare to use the MPI transport supported by your containerized application.
For example, after building a container image and an OpenMPI executable binary that was built for that image:
module load singularity
module load openmpi/gnu/2.0.2
mpirun -np 4 -H n0,n1,n2,n3 singularity exec <container.img> <container mpi binary>
File Systems¶
Data files used by the applications processed on the cluster may be stored in a variety of locations, including:
On the local disk of each node
On the master node’s disk, shared with the nodes through a network file system
On disks on multiple nodes, shared with all nodes through the use of a parallel file system
The simplest approach is to store all files on the master node, as with
the standard Network File System. Any files in your /home
directory
are shared via NFS with all the nodes in your cluster. This makes
management of the files very simple, but in larger clusters the
performance of NFS on the master node can become a bottleneck for
I/O-intensive applications. If you are planning a large cluster, you
should include disk drives that are separate from the system disk to
contain your shared files; for example, place /home
on a separate
pair of RAID1 disks in the master node. A more scalable solution is to
utilize a dedicated NFS server with a properly configured storage system
for all shared files and programs, or a high performance NAS appliance.
Storing files on the local disk of each node removes the performance problem, but makes it difficult to share data between tasks on different nodes. Input files for programs must be distributed manually to each of the nodes, and output files from the nodes must be manually collected back on the master node. This mode of operation can still be useful for temporary files created by a process and then later reused on that same node.