Running Programs

This chapter describes how to run both serial and parallel jobs with Scyld ClusterWare, and how to monitor the status of the cluster once your applications are running. It begins with a brief discussion of program execution concepts, including some examples. The discussion then covers running programs that aren't parallelized, running parallel programs (including MPI-aware and PVM-aware programs), running serial programs in parallel, job batching, and file systems.

Program Execution Concepts

This section compares program execution on a stand-alone computer and a Scyld cluster. It also discusses the differences between running programs on a traditional Beowulf cluster and a Scyld cluster. Finally, it provides some examples of program execution on a Scyld cluster.

Stand-Alone Computer vs. Scyld Cluster

On a stand-alone computer running Linux, Unix, and most other operating systems, executing a program is a very simple process. For example, to generate a list of the files in the current working directory, you open a terminal window and type the command ls followed by the [return] key. Typing the [return] key causes the command shell — a program that listens to and interprets commands entered in the terminal window — to start the ls program (stored at /bin/ls). The output is captured and directed to the standard output stream, which also appears in the same window where you typed the command.

A Scyld cluster isn't simply a group of networked stand-alone computers. Only the master node resembles the computing system with which you are familiar. The compute nodes have only the minimal software components necessary to support an application initiated from the master node. So for instance, running the ls command on the master node causes the same series of actions as described above for a stand-alone computer, and the output is for the master node only.

However, running ls on a compute node involves a very different series of actions. Remember that a Scyld cluster has no resident applications on the compute nodes; applications reside only on the master node. So for instance, to run the ls command on compute node 1, you would enter the command bpsh 1 ls on the master node. This command sends ls to compute node 1 via Scyld's BProc software, and the output stream is directed to the terminal window on the master node, where you typed the command.

Some brief examples of program execution are provided in the last section of this chapter. Both BProc and bpsh are covered in more detail in the Administrator's Guide.

Traditional Beowulf Cluster vs. Scyld Cluster

A job on a Beowulf cluster is actually a collection of processes running on the compute nodes. In traditional clusters of computers, and even on earlier Beowulf clusters, getting these processes started and running together was a complicated task. Typically, the cluster administrator would need to do all of the following:

  • Ensure that the user had an account on all the target nodes, either manually or via a script.
  • Ensure that the user could spawn jobs on all the target nodes. This typically entailed configuring a hosts.allow file on each machine, creating a specialized PAM module (a Linux authentication mechanism), or creating a server daemon on each node to spawn jobs on the user's behalf.
  • Copy the program binary to each node, either manually, with a script, or through a network file system.
  • Ensure that each node had available identical copies of all the dependencies (such as libraries) needed to run the program.
  • Provide knowledge of the state of the system to the application manually, through a configuration file, or through some add-on scheduling software.

With Scyld ClusterWare, most of these steps are removed. Jobs are started on the master node and are migrated out to the compute nodes via BProc. A cluster architecture where jobs may be initiated only from the master node via BProc provides the following advantages:

  • Users no longer need accounts on remote nodes.
  • Users no longer need authorization to spawn jobs on remote nodes.
  • Neither binaries nor libraries need to be available on the remote nodes.
  • The BProc system provides a consistent view of all jobs running on the system.

With all these complications removed, program execution on the compute nodes becomes a simple matter of letting BProc know about your job when you start it. The method for doing so depends on whether you are launching a parallel program (for example, an MPI job or PVM job) or any other kind of program. See the sections on running parallel programs and running non-parallelized programs later in this chapter.

Program Execution Examples

This section provides a few examples of program execution with Scyld ClusterWare. Additional examples are provided in the sections on running parallel programs and running non-parallelized programs later in this chapter.

Example 1. Directed Execution with bpsh

In the directed execution mode, the user explicitly defines which node (or nodes) will run a particular job. This mode is invoked using the bpsh command, the ClusterWare shell command analogous in functionality to both the rsh (remote shell) and ssh (secure shell) commands. Following are two examples of using bpsh.

The first example runs hostname on compute node 0 and writes the output back from the node to the user's screen:

[user@cluster user] $ bpsh 0 /bin/hostname
  n0

If /bin is in the user's $PATH, then the bpsh does not need the full pathname:

[user@cluster user] $ bpsh 0 hostname
  n0

The second example runs the /usr/bin/uptime utility on node 1. Assuming /usr/bin is in the user's $PATH:

[user@cluster user] $ bpsh 1 uptime
  12:56:44 up  4:57,  5 users,  load average: 0.06, 0.09, 0.03

Example 2. Dynamic Execution with beorun and mpprun

In the dynamic execution mode, Scyld decides which node is the most capable of executing the job at that moment in time. Scyld includes two parallel execution tools that dynamically select nodes: beorun and mpprun. They differ only in that beorun runs the job concurrently on the selected nodes, while mpprun runs the job sequentially on one node at a time.

The following example shows the difference in the elapsed time to run a command with beorun vs. mpprun:

[user@cluster user] $ date;beorun -np 8 sleep 1;date
  Fri Aug 18 11:48:30 PDT 2006
  Fri Aug 18 11:48:31 PDT 2006
[user@cluster user] $ date;mpprun -np 8 sleep 1;date
  Fri Aug 18 11:48:46 PDT 2006
  Fri Aug 18 11:48:54 PDT 2006

Example 3. Binary Pre-Staged on Compute Node

A needed binary can be "pre-staged" by copying it to a compute node prior to execution of a shell script. In the following example, the shell script is in a file called test.sh:

######
#! /bin/bash
hostname.local
#######

[user@cluster user] $ bpsh 1 mkdir -p /usr/local/bin
[user@cluster user] $ bpcp /bin/hostname 1:/usr/local/bin/hostname.local
[user@cluster user] $ bpsh 1 ./test.sh
  n1

This makes the hostname binary available on compute node 1 as /usr/local/bin/hostname.local before the script is executed. The shell's $PATH contains /usr/local/bin, so the compute node searches locally for hostname.local in $PATH, finds it, and executes it.

Note that copying files to a compute node generally puts the files into the RAM filesystem on the node, thus reducing main memory that might otherwise be available for programs, libraries, and data on the node.

Example 4. Binary Migrated to Compute Node

If a binary is not "pre-staged" on a compute node, the full path to the binary must be included in the script in order to execute properly. In the following example, the master node starts the process (in this case, a shell) and moves it to node 1, then continues execution of the script. However, when it comes to the hostname.local2 command, the process fails:

######
#! /bin/bash
hostname.local2
#######

[user@cluster user] $ bpsh 1 ./test.sh
  ./test.sh:  line 2:  hostname.local2:  command not found

Since the compute node does not have hostname.local2 locally, the shell attempts to resolve the binary by asking for the binary from the master. The problem is that the master has no idea which binary to give back to the node, hence the failure.

Because there is no way for Bproc to know which binaries may be needed by the shell, hostname.local2 is not migrated along with the shell during the initial startup. Therefore, it is important to provide the compute node with a full path to the binary:

######
#! /bin/bash
/tmp/hostname.local2
#######

[user@cluster user] $ cp /bin/hostname /tmp/hostname.local2
[user@cluster user] $ bpsh 1 ./test.sh
  n1

With a full path to the binary, the compute node can construct a proper request for the master, and the master knows which exact binary to return to the compute node for proper execution.

Example 5. Process Data Files

Files that are opened by a process (including files on disk, sockets, or named pipes) are not automatically migrated to compute nodes. Suppose the application BOB needs the data file 1.dat:

er@cluster user] $ bpsh 1 /usr/local/BOB/bin/BOB 1.dat

1.dat must be either pre-staged to the compute node, e.g., using bpcp to copy it there; or else the data files must be accessible on an NFS-mounted file system. The file /etc/beowulf/fstab (or a node-specific fstab.nodeNumber) specifies which filesystems are NFS-mounted on each compute node by default.

Example 6. Installing Commercial Applications

Through the course of its execution, the application BOB in the example above does some work with the data file 1.dat, and then later attempts to call /usr/local/BOB/bin/BOB.helper.bin and /usr/local/BOB/bin/BOB.cleanup.bin.

If these binaries are not in the memory space of the process during migration, the calls to these binaries will fail. Therefore, /usr/local/BOB should be NFS-mounted to all of the compute nodes, or the binaries should be pre-staged using bpcp to copy them by hand to the compute nodes. The binaries will stay on each compute node until that node is rebooted.

Generally for commercial applications, the administrator should have $APP_HOME NFS-mounted on the compute nodes that will be involved in execution. A general best practice is to mount a general directory such as /opt, and install all of the applications into /opt.

Environment Modules

The RHEL/CentOS environment-modules package provides for the dynamic modification of a user's environment via modulefiles. Each modulefile contains the information needed to configure the shell for an application, allowing a user to easily switch between applications with a simple module switch command that resets environment variables like PATH and LD_LIBRARY_PATH. A number of modules are already installed that configure application builds and execution with OpenMPI, MPICH2, and MVAPICH2. Execute the command module avail to see a list of available modules. See specific sections, below, for examples of how to use modules.

For more information about creating your own modules, see http://modules.sourceforge.net, or view the manpages man module and man modulefile.

Running Programs That Are Not Parallelized

Starting and Migrating Programs to Compute Nodes (bpsh)

There are no executable programs (binaries) on the file system of the compute nodes. This means that there is no getty, no login, nor any shells on the compute nodes.

Instead of the remote shell (rsh) and secure shell (ssh) commands that are available on networked stand-alone computers (each of which has its own collection of binaries), Scyld ClusterWare has the bpsh command. The following example shows the standard ls command running on node 2 using bpsh:

[user@cluster user] $ bpsh 2 ls -FC /
  bin/   dev/  home/  lib64/  proc/  sys/  usr/
  bpfs/  etc/  lib/   opt/    sbin/  tmp/  var/

At startup time, by default Scyld ClusterWare exports various directories, e.g., /bin and /usr/bin, on the master node, and those directories are NFS-mounted by compute nodes.

However, an NFS-accessible /bin/ls is not a requirement for bpsh 2 ls to work. Note that the /sbin directory also exists on the compute node. It is not exported by the master node by default, and thus it exists locally on a compute node in the RAM-based filesystem. bpsh 2 ls /sbin usually shows an empty directory. Nonetheless, bpsh 2 modprobe bproc executes successfully, even though which modprobe shows the command resides in /sbin/modprobe and bpsh 2 which modprobe fails to find the command on the compute node because its /sbin does not contain modprobe.

bpsh 2 modprobe bproc works because the bpsh initiates a modprobe process on the master node, then forms a process memory image that includes the command's binary and references to all its dynamically linked libraries. This process memory image is then copied (migrated) to the compute node, and there the references to dynamic libraries are remapped in the process address space. Only then does the modprobe command begin real execution.

bpsh is not a special version of sh, but a special way of handling execution. This process works with any program. Be aware of the following:

  • All three standard I/O streams — stdin, stdout, and stderr — are forwarded to the master node. Since some programs need to read standard input and will stop working if they're run in the background, be sure to close standard input at invocation by using use the bpsh -n flag when you run a program in the background on a compute node.
  • Because shell scripts expect executables to be present, and because compute nodes don't meet this requirement, shell scripts should be modified to include the bpsh commands required to affect the compute nodes and run on the master node.
  • The dynamic libraries are cached separately from the process memory image, and are copied to the compute node only if they are not already there. This saves time and network bandwidth. After the process completes, the dynamic libraries are unloaded from memory, but they remain in the local cache on the compute node, so they won't need to be copied if needed again.

For additional information on the BProc Distributed Process Space and how processes are migrated to compute nodes, see the Administrator's Guide.

Copying Information to Compute Nodes (bpcp)

Just as traditional Unix has copy (cp), remote copy (rcp), and secure copy (scp) to move files to and from networked machines, Scyld ClusterWare has the bpcp command.

Although the default sharing of the master node's home directories via NFS is useful for sharing small files, it is not a good solution for large data files. Having the compute nodes read large data files served via NFS from the master node will result in major network congestion, or even an overload and shutdown of the NFS server. In these cases, staging data files on compute nodes using the bpcp command is an alternate solution. Other solutions include using dedicated NFS servers or NAS appliances, and using cluster file systems.

Following are some examples of using bpcp.

This example shows the use of bpcp to copy a data file named foo2.dat from the current directory to the /tmp directory on node 6:

[user@cluster user] $ bpcp foo2.dat 6:/tmp

The default directory on the compute node is the current directory on the master node. The current directory on the compute node may already be NFS-mounted from the master node, but it may not exist. The example above works, since /tmp exists on the compute node, but will fail if the destination does not exist. To avoid this problem, you can create the necessary destination directory on the compute node before copying the file, as shown in the next example.

In this example, we change to the /tmp/foo directory on the master, use bpsh to create the same directory on the node 6, then copy foo2.dat to the node:

[user@cluster user] $ cd /tmp/foo
[user@cluster user] $ bpsh 6 mkdir /tmp/foo
[user@cluster user] $ bpcp foo2.dat 6:

This example copies foo2.dat from node 2 to node 3 directly, without the data being stored on the master node. As in the first example, this works because /tmp exists:

[user@cluster user] $ bpcp 2:/tmp/foo2.dat 3:/tmp

Running Parallel Programs

An Introduction to Parallel Programming APIs

Programmers are generally familiar with serial, or sequential, programs. Simple programs — like "Hello World" and the basic suite of searching and sorting programs — are typical of sequential programs. They have a beginning, an execution sequence, and an end; at any time during the run, the program is executing only at a single point.

A thread is similar to a sequential program, in that it also has a beginning, an execution sequence, and an end. At any time while a thread is running, there is a single point of execution. A thread differs in that it isn't a stand-alone program; it runs within a program. The concept of threads becomes important when a program has multiple threads running at the same time and performing different tasks.

To run in parallel means that more than one thread of execution is running at the same time, often on different processors of one computer; in the case of a cluster, the threads are running on different computers. A few things are required to make parallelism work and be useful: The program must migrate to another computer or computers and get started; at some point, the data upon which the program is working must be exchanged between the processes.

The simplest case is when the same single-process program is run with different input parameters on all the nodes, and the results are gathered at the end of the run. Using a cluster to get faster results of the same non-parallel program with different inputs is called parametric execution.

A much more complicated example is a simulation, where each process represents some number of elements in the system. Every few time steps, all the elements need to exchange data across boundaries to synchronize the simulation. This situation requires a message passing interface or MPI.

To solve these two problems — program startup and message passing — you can develop your own code using POSIX interfaces. Alternatively, you could utilize an existing parallel application programming interface (API), such as the Message Passing Interface (MPI) or the Parallel Virtual Machine (PVM). These are discussed in the sections that follow.

MPI

The Message Passing Interface (MPI) application programming interface is currently the most popular choice for writing parallel programs. The MPI standard leaves implementation details to the system vendors (like Scyld). This is useful because they can make appropriate implementation choices without adversely affecting the output of the program.

A program that uses MPI is automatically started a number of times and is allowed to ask two questions: How many of us (size) are there, and which one am I (rank)? Then a number of conditionals are evaluated to determine the actions of each process. Messages may be sent and received between processes.

The advantages of MPI are that the programmer:

  • Doesn't have to worry about how the program gets started on all the machines
  • Has a simplified interface for inter-process messages
  • Doesn't have to worry about mapping processes to nodes
  • Abstracts the network details, resulting in more portable hardware-agnostic software

Also see the section on running MPI-aware programs later in this chapter. Scyld ClusterWare includes several implementations of MPI:

MPICH. Scyld ClusterWare 6 (and earlier releases) includes MPICH, a freely-available implementations of the MPI standard, and a project that is managed by Argonne National Laboratory. NOTE: MPICH is deprecated and removed from ClusterWare 7 and later releases, and supplanted by MPICH2 and beyond. Visit https://www.mpich.org for more information. Scyld MPICH is modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter.

MVAPICH. MVAPICH is an implementation of MPICH for Infiniband interconnects. NOTE: MVAPICH is deprecated and removed from ClusterWare 7 and later releases, and supplanted by MVAPICH2 and beyond. Visit http://mvapich.cse.ohio-state.edu/ for more information. Scyld MVAPICH is modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter.

MPICH2. Scyld ClusterWare includes MPICH2, a second generation MPICH. Visit https://www.mpich.org for more information. Scyld MPICH2 is customized to use environment modules. See MPICH2 Release Information for details.

MVAPICH2. MVAPICH2 is second generation MVAPICH. Visit http://mvapich.cse.ohio-state.edu/ for more information. Scyld MVAPICH2 is customized to use environment modules. See MVAPICH2 Release Information for details.

OpenMPI. OpenMPI is an open-source implementation of the Message Passing Interface 2 (MPI-2) specification. The OpenMPI implementation is an optimized combination of several other MPI implementations. Visit https://www.open-mpi.org for more information. Also see OpenMPI Release Information for details.

Other MPI Implementations. Various commercial MPI implementations run on Scyld ClusterWare. Visit the Penguin Computing Support Portal at http://www.penguincomputing.com/support for more information. You can also download and build your own version of MPI, and configure it to run on Scyld ClusterWare.

PVM

Parallel Virtual Machine (PVM) was an earlier parallel programming interface. Unlike MPI, it is not a specification but a single set of source code distributed on the Internet. PVM reveals much more about the details of starting your job on remote nodes. However, it fails to abstract implementation details as well as MPI does.

PVM is deprecated, but is still in use by legacy code. We generally advise against writing new programs in PVM, but some of the unique features of PVM may suggest its use.

Also see the section on running PVM-aware programs later in this chapter.

Custom APIs

As mentioned earlier, you can develop you own parallel API by using various Unix and TCP/IP standards. In terms of starting a remote program, there are programs written:

  • Using the rexec function call
  • To use the rexec or rsh program to invoke a sub-program
  • To use Remote Procedure Call (RPC)
  • To invoke another sub-program using the inetd super server

These solutions come with their own problems, particularly in the implementation details. What are the network addresses? What is the path to the program? What is the account name on each of the computers? How is one going to load-balance the cluster?

Scyld ClusterWare, which doesn't have binaries installed on the cluster nodes, may not lend itself to these techniques. We recommend you write your parallel code in MPI. That having been said, we can say that Scyld has some experience with getting rexec() calls to work, and that one can simply substitute calls to rsh with the more cluster-friendly bpsh.

Mapping Jobs to Compute Nodes

Running programs specifically designed to execute in parallel across a cluster requires at least the knowledge of the number of processes to be used. Scyld ClusterWare uses the NP environment variable to determine this. The following example will use 4 processes to run an MPI-aware program called a.out, which is located in the current directory.

[user@cluster user] $ NP=4 ./a.out

Note that each kind of shell has its own syntax for setting environment variables; the example above uses the syntax of the Bourne shell (/bin/sh or /bin/bash).

What the example above does not specify is which specific nodes will execute the processes; this is the job of the mapper. Mapping determines which node will execute each process. While this seems simple, it can get complex as various requirements are added. The mapper scans available resources at the time of job submission to decide which processors to use.

Scyld ClusterWare includes beomap, a mapping API (documented in the Programmer's Guide with details for writing your own mapper). The mapper's default behavior is controlled by the following environment variables:

  • NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.
  • ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.
  • ALL_NODES — Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.
  • ALL_LOCAL — Run every process on the master node; used for debugging purposes.
  • NO_LOCAL — Don't run any processes on the master node.
  • EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.
  • BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.

You can use the beomap program to display the current mapping for the current user in the current environment with the current resources at the current time. See the Reference Guide for a detailed description of beomap and its options, as well as examples for using it.

Running MPICH and MVAPICH Programs

NOTE: MPICH and MVAPICH (version 1) are deprecated and removed from Scyld ClusterWare

MPI-aware programs are those written to the MPI specification and linked with Scyld MPI libraries. NOTE: MPICH and MVAPICH are deprecated and have been supplanted by MPICH2 and MVAPICH2 (and newer versions of those packages). Applications that use MPICH (Ethernet "p4") or MVAPICH (Infiniband "vapi") are compiled and linked with common MPICH/MVAPICH implementation libraries, plus specific compiler family (e.g., gnu, Intel, PGI) libraries. The same application binary can execute either in an Ethernet interconnection environment or an Infiniband interconnection environment that is specified at run time. This section discusses how to run these programs and how to set mapping parameters from within such programs.

For information on building MPICH/MVAPICH programs, see the Programmer's Guide.

mpirun

Almost all implementations of MPI have an mpirun program, which shares the syntax of mpprun, but which boasts of additional features for MPI-aware programs.

In the Scyld implementation of mpirun, all of the options available via environment variables or flags through directed execution are available as flags to mpirun, and can be used with properly compiled MPI jobs. For example, the command for running a hypothetical program named my-mpi-prog with 16 processes:

[user@cluster user] $ mpirun -np 16 my-mpi-prog arg1 arg2

is equivalent to running the following commands in the Bourne shell:

[user@cluster user] $ export NP=16
[user@cluster user] $ my-mpi-prog arg1 arg2

Setting Mapping Parameters from Within a Program

A program can be designed to set all the required parameters itself. This makes it possible to create programs in which the parallel execution is completely transparent. However, it should be noted that this will work only with Scyld ClusterWare, while the rest of your MPI program should work on any MPI platform.

Use of this feature differs from the command line approach, in that all options that need to be set on the command line can be set from within the program. This feature may be used only with programs specifically designed to take advantage of it, rather than any arbitrary MPI program. However, this option makes it possible to produce turn-key application and parallel library functions in which the parallelism is completely hidden.

Following is a brief example of the necessary source code to invoke mpirun with the -np 16 option from within a program, to run the program with 16 processes:

/* Standard MPI include file */
# include <mpi.h>

main(int argc, char **argv) {
        setenv("NP","16",1); // set up mpirun env vars
        MPI_Init(&argc,&argv);
        MPI_Finalize();
}

More details for setting mapping parameters within a program are provided in the Programmer's Guide.

Examples

The examples in this section illustrate certain aspects of running a hypothetical MPI-aware program named my-mpi-prog.

Example 7. Specifying the Number of Processes

This example shows a cluster execution of a hypothetical program named my-mpi-prog run with 4 processes:

[user@cluster user] $ NP=4 ./my-mpi-prog

An alternative syntax is as follows:

[user@cluster user] $ NP=4
[user@cluster user] $ export NP
[user@cluster user] $ ./my-mpi-prog

Note that the user specified neither the nodes to be used nor a mechanism for migrating the program to the nodes. The mapper does these tasks, and jobs are run on the nodes with the lowest CPU utilization.

In addition to specifying the number of processes to create, you can also exclude specific nodes as computing resources. In this example, we run my-mpi-prog again, but this time we not only specify the number of processes to be used (NP=6), but we also exclude of the master node (NO_LOCAL=1) and some cluster nodes (EXCLUDE=2:4:5) as computing resources.

[user@cluster user] $ NP=6 NO_LOCAL=1 EXCLUDE=2:4:5 ./my-mpi-prog

Running OpenMPI Programs

OpenMPI programs are those written to the MPI-2 specification. This section provides information needed to use programs with OpenMPI as implemented in Scyld ClusterWare.

Pre-Requisites to Running OpenMPI

A number of commands, such as mpirun, are duplicated between OpenMPI and other MPI implementations. The environment-modules package gives users a convenient way to switch between the various implementations. Each module bundles together various compiler-specific environment variables to configure your shell for building and running your application, and for accessing compiler-specific manpages. Be sure that you are loading the proper module to match the compiler that built the application you wish to run. For example, to load the OpenMPI module for use with the Intel compiler, do the following:

[user@cluster user] $ module load openmpi/intel

Currently, there are modules for the GNU, Intel, and PGI compilers. To see a list of all of the available modules:

[user@cluster user] $ module avail openmpi
------------------------------- /opt/modulefiles -------------------------------
openmpi/gnu/1.5.3   openmpi/intel/1.5.3 openmpi/pgi/1.5.3

For more information about creating your own modules, see http://modules.sourceforge.net and the manpages man module and man modulefile.

Using OpenMPI

OpenMPI does not honor the Scyld ClusterWare job mapping environment variables. You must either specify the list of hosts on the command line or inside a hostfile. To specify the list of hosts on the command line, use the -H option. The argument following -H is a comma separated list of hostnames, not node numbers. For example, to run a two process job, with one process running on node 0 and one on node 1:

[user@cluster user] $ mpirun -H n0,n1 -np 2 ./mpiprog

Support for running jobs over Infiniband using the OpenIB transport is included with OpenMPI distributed with Scyld ClusterWare. Much like running a job with MPICH over Infiniband, one must specifically request the use of OpenIB. For example:

[user@cluster user] $ mpirun --mca btl openib,sm,self -H n0,n1 -np 2 ./myprog

Read the OpenMPI mpirun man page for more information about, using a hostfile, and using other tunable options available through mpirun.

Running MPICH2 and MVAPICH2 Programs

MPICH2 and MVAPICH2 programs are those written to the MPI-2 specification. This section provides information needed to use programs with MPICH2 or MVAPICH2 as implemented in Scyld ClusterWare.

Pre-Requisites to Running MPICH2/MVAPICH2

As with Scyld OpenMPI, the Scyld MPICH2 and MVAPICH2 distributions are repackaged Open Source MPICH2 and MVAPICH2 that utilize environment modules to build and to execute applications. Each module bundles together various compiler-specific environment variables to configure your shell for building and running your application, and for accessing implementation- and compiler-specific manpages. You must use the same module to both build the application and to execute it. For example, to load the MPICH2 module for use with the Intel compiler, do the following:

[user@cluster user] $ module load mpich2/intel

Currently, there are modules for the GNU, Intel, and PGI compilers. To see a list of all of the available modules:

[user@cluster user] $ module avail mpich2 mvapich2
------------------------------- /opt/modulefiles -------------------------------
mpich2/gnu/1.3.2   mpich2/intel/1.3.2 mpich2/pgi/1.3.2

------------------------------- /opt/modulefiles -------------------------------
mvapich2/gnu/1.6   mvapich2/intel/1.6 mvapich2/pgi/1.6

For more information about creating your own modules, see http://modules.sourceforge.net and the manpages man module and man modulefile.

Using MPICH2

Unlike the Scyld ClusterWare MPICH implementation, MPICH2 does not honor the Scyld ClusterWare job mapping environment variables. Use mpiexec to execute MPICH2 applications. After loading an mpich2 module, see the man mpiexec manpage for specifics, and visit https://www.mpich.org for full documentation.

Using MVAPICH2

MVAPICH2 does not honor the Scyld ClusterWare job mapping environment variables. Use mpirun_rsh to execute MVAPICH2 applications. After loading an mvapich2 module, use mpirun_rsh --help to see specifics, and visit http://mvapich.cse.ohio-state.edu/ for full documentation.

Running PVM-Aware Programs

Parallel Virtual Machine (PVM) is an application programming interface for writing parallel applications, enabling a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Scyld has developed the Scyld PVM library, specifically tailored to allow PVM to take advantage of the technologies used in Scyld ClusterWare. A PVM-aware program is one that has been written to the PVM specification and linked against the Scyld PVM library.

A complete discussion of cluster configuration for PVM is beyond the scope of this document. However, a brief introduction is provided here, with the assumption that the reader has some background knowledge on using PVM.

You can start the master PVM daemon on the master node using the PVM console, pvm. To add a compute node to the virtual machine, issue an add . command, where # is replaced by a node's assigned number in the cluster.

Tip

You can generate a list of node numbers using bpstat command.

Alternately, you can start the PVM console with a hostfile filename on the command line. The hostfile should contain a .# for each compute node you want as part of the virtual machine. As with standard PVM, this method automatically spawns PVM slave daemons to the specified compute nodes in the cluster. From within the PVM console, use the conf command to list your virtual machine's configuration; the output will include a separate line for each node being used. Once your virtual machine has been configured, you can run your PVM applications as you normally would.

Porting Other Parallelized Programs

Programs written for use on other types of clusters may require various levels of change to function with Scyld ClusterWare. For instance:

  • Scripts or programs that invoke rsh can instead call bpsh.
  • Scripts or programs that invoke rcp can instead call bpcp.
  • beomap can be used with any script to load balance programs that are to be dispatched to the compute nodes.

For more information on porting applications, see the Programmer's Guide.

Running Serial Programs in Parallel

For jobs that are not "MPI-aware" or "PVM-aware", but need to be started in parallel, Scyld ClusterWare provides the parallel execution utilities mpprun and beorun. These utilities are more sophisticated than bpsh, in that they can automatically select ranges of nodes on which to start your program, run tasks on the master node, determine the number of CPUs on a node, and start a copy on each CPU. Thus, mpprun and beorun provide you with true "dynamic execution" capabilities, whereas bpsh provides "directed execution" only.

mpprun and beorun are very similar, and have similar parameters. They differ only in that mpprun runs jobs sequentially on the selected processors, while beorun runs jobs concurrently on the selected processors.

mpprun

mpprun is intended for applications rather than utilities, and runs them sequentially on the selected nodes. The basic syntax of mpprun is as follows:

[user@cluster user] $ mpprun [options]  app arg1 arg2...

where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.

Options

mpprun includes options for controlling various aspects of the job, including the ability to:

  • Specify the number of processors on which to start copies of the program
  • Start one copy on each node in the cluster
  • Start one copy on each CPU in the cluster
  • Force all jobs to run on the master node
  • Prevent any jobs from running on the master node

The most interesting of the options is the --map option, which lets the user specify which nodes will run copies of a program; an example is provided in the next section. This argument, if specified, overrides the mapper's selection of resources that it would otherwise use.

See the Reference Guide for a complete list of options for mpprun.

Examples

Run 16 tasks of program app:

[user@cluster user] $ mpprun -np 16  app infile outfile

Run 16 tasks of program app on any available nodes except nodes 2 and 3:

[user@cluster user] $ mpprun -np 16 --exclude 2:3 app infile outfile

Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:

[user@cluster user] $ mpprun --map 4:2:1:5 app infile outfile

beorun

beorun is intended for applications rather than utilities, and runs them concurrently on the selected nodes. The basic syntax of beorun is as follows:

[user@cluster user] $ beorun [options]  app arg1 arg2...

where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.

Options

beorun includes options for controlling various aspects of the job, including the ability to:

  • Specify the number of processors on which to start copies of the program
  • Start one copy on each node in the cluster
  • Start one copy on each CPU in the cluster
  • Force all jobs to run on the master node
  • Prevent any jobs from running on the master node

The most interesting of the options is the --map option, which lets the user specify which nodes will run copies of a program; an example is provided in the next section. This argument, if specified, overrides the mapper's selection of resources that it would otherwise use.

See the Reference Guide for a complete list of options for beorun.

Examples

Run 16 tasks of program app:

[user@cluster user] $ beorun -np 16 app infile outfile

Run 16 tasks of program app on any available nodes except nodes 2 and 3:

[user@cluster user] $ beorun -np 16 --exclude 2:3 app infile outfile

Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:

[user@cluster user] $ beorun --map 4:2:1:5 app infile outfile

Job Batching

Job Batching Options for ClusterWare

For Scyld ClusterWare, the default installation includes both the TORQUE resource manager and the Slurm workoad manager, each providing users an intuitive interface for for remotely initiating and managing batch jobs on distributed compute nodes. TORQUE is an Open Source tool based on standard OpenPBS. Slurm is another Open Source tool, employing the Open Source Munge for authentication and mysql (for ClusterWare 6) or mariadb (for ClusterWare 7 and beyond) for managing a database. Basic instructions for using TORQUE are provided in the next section. For more general product information, see http://www.adaptivecomputing.com/ for Adaptive Computing's TORQUE information and https://slurm.schedmd.com for Slurm information.

Only one job manager can be enabled at any one time. See the Scyld ClusterWare Administrator's Guide for details about how to enable either TORQUE or Slurm. If Slurm is the chosen job manager, then users must setup the PATH and LD_LIBRARY_PATH environment variables to properly access the Slurm commands. This is done automatically for users who login when the slurm service is running and the pbs_server is not running, via the /etc/profile.d/scyld.slurm.sh script. Alternatively, each Slurm user can manually execute module load slurm or can add that command line to (for example) the user's .bash_profile.

The https://slurm.schedmd.com Slurm website also provides an optional TORQUE wrapper to minimize the syntactic differences between TORQUE and Slurm commands and scripts. See https://slurm.schedmd.com/rosetta.pdf for a discussion of the differences between TORQUE and Slurm, and https://slurm.schedmd.com/faq.html#torque provides useful information about how to switch from PBS or TORQUE to Slurm.

Scyld also redistributes the Scyld Maui job scheduler, also derived from Adaptive Computing, that functions in conjunction with the TORQUE job manager. The alternative Moab job scheduler is also available from Adaptive Computing with a separate license, giving customers additional job scheduling, reporting, and monitoring capabilities.

In addition, Scyld provides support for most popular open source and commercial schedulers and resource managers, including SGE, LSF, and PBSPro. For the latest information, visit the Penguin Computing Support Portal at http://www.penguincomputing.com/support.

Job Batching with TORQUE

The default installation is configured as a simple job serializer with a single queue named batch.

You can use the TORQUE resource manager to run jobs, check job status, find out which nodes are running your job, and find job output.

Running a Job

To run a job with TORQUE, you can put the commands you would normally use into a job script, and then submit the job script to the cluster using qsub. The qsub program has a number of options that may be supplied on the command line or as special directives inside the job script. For the most part, these options should behave exactly the same in a job script or via the command line, but job scripts make it easier to manage your actions and their results.

Example 9. Starting a Job with a Job Script Using One Node

Following are some examples of running a job using qsub. For more detailed information on qsub, see the qsub man page.

The following script declares a job with the name "myjob", to be run using one node. The script uses the PBS -N directive, launches the job, and finally sends the current date and working directory to standard output.

#!/bin/sh

## Set the job name
#PBS -N myjob
#PBS -l nodes=1

# Run my job
/path/to/myjob

echo Date: $
echo Dir:  $PWD

You would submit "myjob" as follows:

[bjosh@iceberg]$ qsub -l nodes=1 myjob
15.iceberg

Example 10. Starting a Job from the Command Line

This example provides the command line equivalent of the job run in the example above. We enter all of the qsub options on the initial command line. Then qsub reads the job commands line-by-line until we type ^D, the end-of-file character. At that point, qsub queues the job and returns the Job ID.

[bjosh@iceberg]$ qsub -N myjob -l nodes=1:ppn=1 -j oe
cd $PBS_0_WORKDIR
echo Date: $
echo Dir:  $PWD
^D
16.iceberg

Example 11. Starting an MPI Job with a Job Script

The following script declares an MPI job named "mpijob". The script uses the PBS -N directive, prints out the nodes that will run the job, launches the job using mpirun, and finally prints out the current date and working directory. When submitting MPI jobs using TORQUE, it is recommended to simply call mpirun without any arguments. mpirun will detect that it is being launched from within TORQUE and assure that the job will be properly started on the nodes TORQUE has assigned to the job. In this case, TORQUE will properly manage and track resources used by the job.

## Set the job name
#PBS -N mpijob

# RUN my job
mpirun /path/to/mpijob

echo Date: $
echo Dir:  $PWD

To request 8 total processors to run "mpijob", you would submit the job as follows:

[bjosh@iceberg]$ qsub -l nodes=8 mpijob
17.iceberg

To request 8 total processors, using 4 nodes, each with 2 processors per node, you would submit the job as follows:

[bjosh@iceberg]$ qsub -l nodes=4:ppn=2 mpijob
18.iceberg

Checking Job Status

You can check the status of your job using qstat. The command line option qstat -n will display the status of queued jobs. To watch the progression of events, use the watch command to execute qstat -n every 2 seconds by default; type [CTRL]-C to interrupt watch when needed.

Example 12. Checking Job Status

This example shows how to check the status of the job named "myjob", which we ran on 1 node in the first example above, using the option to watch the progression of events.

[bjosh@iceberg]$ qsub myjob && watch qstat -n
iceberg:

JobID       Username        Queue   Jobname SessID  NDS     TSK     ReqdMemory      ReqdTime        S       ElapTime
15.iceberg  bjosh   default myjob   --      1       --      --      00:01   Q       --

Table 1. Useful Job Status Commands

Command Purpose
ps -ef | bpstat -P Display all running jobs, with node number for each
qstat -Q Display status of all queues
qstat -n Display status of queued jobs
qstat -f JOBID Display very detailed information about Job ID
pbsnodes -a Display status of all nodes

Finding Out Which Nodes Are Running a Job

To find out which nodes are running your job, use the following commands:

  • To find your Job Ids: qstat -an

  • To find the Process IDs of your jobs: ``qstat -f ``

  • To find the number of the node running your job: ``ps -ef | bpstat -P | grep ``

    The number of the node running your job will be displayed in the first column of output.

Finding Job Output

When your job terminates, TORQUE will store its output and error streams in files in the script's working directory.

  • Default output file: .o

    You can override the default using qsub with the ``-o `` option on the command line, or use the ``#PBS -o `` directive in your job script.

  • Default error file: .e

    You can override the default using qsub with the ``-e `` option on the command line, or use the ``#PBS -e `` directive in your job script.

  • To join the output and error streams into a single file, use qsub with the -j oe option on the command line, or use the #PBS -j oe directive in your job script.

Job Batching with POD Tools

POD Tools is a collection of tools for submitting TORQUE jobs to a remote cluster and for monitoring them. POD Tools is useful for, but not limited to, submitting and monitoring jobs to a remote Penguin On Demand cluster. POD Tools executes on both Scyld and non-Scyld client machines, and the Tools communicate with the beoweb service that must be executing on the target cluster.

The primary tool in POD Tools is POD Shell (podsh), which is a command-line interface that allows for remote job submission and monitoring. POD Shell is largely self-documented. Enter podsh --help for a list of possible commands and their formats.

The general usage is podsh [OPTIONS] [FILE/ID]. The action specifies what type of action to perform, such as submit (for submitting a new job) or status (for collecting status on all jobs or a specific job).

POD Shell can upload a TORQUE job script to the target cluster, where it will be added to the job queue. Additionally, POD Shell can be used to stage data in and out of the target cluster. Staging data in (i.e. copying data to the cluster) is performed across an unencrypted TCP socket. Staging data out (i.e. from the cluster back to the client machine) is performed using scp from the cluster to the client. In order for this transfer to be successful, password-less authentication must be in place using SSH keys between the cluster's master node and the client.

POD Shell uses a configuration file that supports both site-wide and user-local values. Site-wide values are stored in entries in /etc/podtools.conf. These settings can be overridden by values in a user's ~/.podtools/podtools.conf file. These values can again be overridden by command-line arguments passed to podsh. The template for podtools.conf is found at /opt/scyld/podtools/podtools.conf.template.

Using Singularity

Scyld ClusterWare 6 distributes Singularity, a powerful Linux container platform designed by Lawrence Berkeley National Laboratory.

Singularity enables users to have full control of their environment, allowing a non-privileged user to "swap out" the operating system on the host by executing a lightweight Singularity container environment and an application that executes within that environment. For example, Singularity can provide a user with the ability to create an Ubuntu image of their application, and run the containerized application on a RHEL6 or CentOS6 ClusterWare system in its native Ubuntu environment.

Refer to the Singularity documentation at https://www.sylabs.io/docs/ for instructions on how to create and use Singularity containers.

When running MPI-enabled applications with Singularity on Scyld ClusterWare, follow these additional instructions:

  • Always compile MPI applications inside a container image with the same MPI implementation and version you plan to use on your Scyld ClusterWare system. Refer to the Singularity documentation for currently supported MPI implementations.
  • Be aware of the MPI transports which are compatible with your containerized binary, and ensure that you use the same MPI transport when executing MPI applications through Singularity. For example, Scyld ClusterWare's OpenMPI packages support TCP, Verbs, PSM and PSM2 MPI transports, but not all operating systems will support this gamut of options. Adjust your mpirun accordingly on Scyld ClusterWare to use the MPI transport supported by your containerized application.

For example, after building a container image and an OpenMPI executable binary that was built for that image:

module load singularity
module load openmpi/gnu/2.0.2
mpirun -np 4 -H n0,n1,n2,n3 singularity exec <container.img> <container mpi binary>

File Systems

Data files used by the applications processed on the cluster may be stored in a variety of locations, including:

  • On the local disk of each node
  • On the master node's disk, shared with the nodes through a network file system
  • On disks on multiple nodes, shared with all nodes through the use of a parallel file system

The simplest approach is to store all files on the master node, as with the standard Network File System. Any files in your /home directory are shared via NFS with all the nodes in your cluster. This makes management of the files very simple, but in larger clusters the performance of NFS on the master node can become a bottleneck for I/O-intensive applications. If you are planning a large cluster, you should include disk drives that are separate from the system disk to contain your shared files; for example, place /home on a separate pair of RAID1 disks in the master node. A more scalable solution is to utilize a dedicated NFS server with a properly configured storage system for all shared files and programs, or a high performance NAS appliance.

Storing files on the local disk of each node removes the performance problem, but makes it difficult to share data between tasks on different nodes. Input files for programs must be distributed manually to each of the nodes, and output files from the nodes must be manually collected back on the master node. This mode of operation can still be useful for temporary files created by a process and then later reused on that same node.