BProc Overview

The heart of the Scyld ClusterWare single system image is the BProc system. BProc consists of kernel modules and daemons that work together to create a unified process space and implement directed process migration. BProc is used to create a single cluster process space, with processes anywhere on the cluster appearing on the master as if they were running there. This allows users to monitor and control processes using standard, unmodified Unix tools.

The BProc system provides a library API that is used to create and manage processes on the compute nodes, as well as provide cluster membership state and limit node access. These calls are used by libraries such as MPI and PVM to create and manage tasks. They may also be used directly by the programmer to implement parallel software that is less complex and more functional than most other cluster programming interfaces.

For applications that have intense, complex communication patterns working on a single end goal, it is normally advisable to use message passing libraries such as MPI or PVM. The MPI and PVM interfaces are portable, provide a rich set of communication operations, and are widely understood among application programmers. The Scyld ClusterWare system provides specialized implementations of PVM and MPI that internally use the BProc system to provide cleaner semantics and higher performance.

There are many applications where it is not appropriate to use these message passing libraries. Directly using the native BProc API can often result in a much less complex implementation or provide unique functionality difficult to achieve with a message passing system. Most programming libraries, environments, tools, and services are better implemented directly with the BProc API.

A master node in a Scyld ClusterWare system manages a collection of processes that exist in a distinct process space. Each process in this process space has a unique PID, and can be controlled using Unix signals. New processes created by a process in a given process space remain in the process space as children of the process that created them. When processes terminate, they notify the parent process through a signal. BProc allows a process within a master node's process space to execute on a compute node. This is accomplished by starting a process on the compute node — which technically is outside the master node's process space — and binding it to a "ghost" process on the master node. This ghost process represents the process on the master node, and any operations performed on the ghost process are, in turn, performed on the remote process. Further, the remote node's process table must be annotated to indicate that the process is actually bound to the process space of the master. Thus, if the process performs a system call relative to its process space, such as creating a new process, or calling the kill() system call, that system call is performed in the context of the master's process space.

A process running on a compute node has a PID in the process space of the compute node, and an independent process running on the compute node can see that process with that PID. From the perspective of the process itself, it appears to be executing on the master node and has a PID within the master node's process space.

The BProc API is divided into three groups of functions. First, the machine information functions provide a means of discovering what compute nodes are available to the program, what the status of nodes is, what the current node is, etc. Second, the process migration functions provide for starting processes, moving processes, and running programs on compute nodes. Finally, a set of process management functions allows programs to manage how processes are run on compute nodes.

Starting a Process with BProc

The most fundamental task one performs with BProc is to start a new process. For traditional Linux programs the mechanism for starting a new process is fork(). Using fork(), a Linux process creates an exact duplicate of itself, right up to and including the contents of memory and the context of the program. The new process is a child of the original process and has a new process ID which is returned to the original process. The fork() call returns zero to the new process.

BProc provides a variation on the fork() system call:

int bproc_rfork(int node);

This function behaves like Linux fork(), except that the process is started on a remote node specified by the node argument, and has a ghost process created on the master node. The function returns the PID of the ghost process to the original process and zero to the new process.

Note that there are important differences between Linux fork() and bproc_rfork(). BProc has a limited ability to deal with file descriptors when a process forks. Under Scyld ClusterWare, files are opened directly on the compute node, thus a file opened on one node before a call to bproc_rfork() does not translate cleanly into an open file on the remote node. Similarly, open sockets do not move when processes move and should not be expected to.

A process started with bproc_rfork does have stdout and stderr connected to the stdout and stderr of the original process and IO is automatically forwarded back to those sockets. Variations on the bproc_rfork() call allow the programmer to control how IO is forwarded for stdout, stderr, and stdin, and can also control how shared libraries are moved for the process. See the BProc reference manual for more details.

The bproc_rfork() call is a powerful mechanism for starting tasks that will act as part of a parallel application. Most applications that are written for Beowulf class computers consist of a single program that is replicated to all of the tasks that will run as part of that job. Each task identifies itself with a unique number, and in doing so each selects part of the total computation to perform. The bproc_rfork() call works well in this scenario because a program can readily duplicate itself using the bproc_rfork() system call. In order to coordinate the tasks, it is necessary to establish communication between the tasks and between each task and the original process. Again, the semantics of bproc_rfork() are ideal because the original task can set up a communication endpoint (such as a socket) before it calls bproc_rfork() to start the tasks at which point each task will automatically know the address of the original process and can establish communication. Once this is accomplished, the original process can exchange information between the tasks so that they can communicate among themselves.

/* Create running processes on nodes 1, 5, 24, and 65. */
int target_nodes[] = { 1, 5, 24, 65};
for (i = 0; i * I'm the child */
        break;  /* only the parent does bproc_rfork(). */
}

Thus the bproc_rfork() call forms the core of parallel processing libraries such as MPI or PVM by providing a convenient means for starting tasks and establishing communication. This same facility can be used to start parallel tasks when a non-MPI (or non-PVM) program is desired. While this is not as portable and convenient — and thus not well suited to most applications — it may be appropriate for some system services, library development, and toolkits.

BProc does not limit the programmer to using fork() semantics for starting processes. Some situations call for starting a new program from an existing process much like the Linux system call execve(). For this, BProc provides two different calls:

int bproc_rexec(int node, char *cmd, char **argv, char **envp);
int bproc_execmove(int node, char *cmd, char **argv, char **envp);

These two functions are almost identical in that they cause the current process to be overwritten by the contents of the executable file specified by cmd which is then executed from the beginning with the arguments and environment given in argv and envp. In this the semantics are very similar to the Linux execve() system call. The difference is that the process moves to the node specified, leaving a ghost process on the master node. All of the same caveats that apply to bproc_rfork() apply to these calls as well, and there are other variations that allow additional control of IO forwarding and library migration.

The difference between these two calls is where the arguments are evaluated. With bproc_rexec(), the executable file is expected to reside on the compute node where the process will run. Thus with this call, the process first moves to the new node, and then loads the new program. With the bproc_execmove() call, the executable file is expected to reside on the node where the original process resides. Thus, this call loads the new program first, and then moves the process to the new node. In the case where the executable file resides on a network file system that is visible from all nodes, the effect of these two calls is essentially the same, though the bproc_execmove() call may be more efficient if the original process exists on the node where the executable file actually resides.

Finally, BProc provides a means to simply move a process from one node to another:

int bproc_move(int node);

This call does not create a new process, other than possibly creating a ghost process, but simply causes the original process to move to a different node. Again, all caveats of bproc_fork() apply. This call is probably of most value in load balancing mechanisms but may be useful in situations where each task in a program needs to perform some IO on the master node to load its memory before moving out to a remote node.

Getting Information About the Parallel Machine

In order to start processes on compute nodes with BProc the programmer needs to know the number of the compute node. When using a library like MPI, some kind of scheduler selects the nodes to run each task on, thus hiding this from the programmer. Using BProc, some mechanism must be used to select a compute node to start processes on. If a scheduler is available, it can be called to get node numbers to run processes on. Otherwise, the programmer must select nodes. Similarly, if the programmer's task is to write a scheduler — as may be the case if special scheduling is needed for the application — then the programmer must have a means of learning what nodes are available to the BProc system for running processes.

BProc includes a set of system calls that allow a program to learn what nodes are known to BProc, and what the status of those nodes is. Under BProc, all of the compute nodes that are known to the system are numbered from 0 to P-1, where P is the number of compute nodes known to BProc. In addition, there is the master node, which is numbered -1. Thus, the first call a program will make is often:

int bproc_numnodes(void);

This call returns the number of compute nodes known to BProc. Sometimes it is important for a program to know which node it is currently executing on. It may choose to perform some actions only if it is on a special node or it may use the information to identify itself to other processes. BProc returns this information with the call:

int bproc_currnode(void);

This call returns the number of the node the process is currently executing on. Its return values range from -1 to P-1.

A node known to BProc is not necessarily available for running user processes. Not all nodes may be operational at any given point in time. Nodes may be in various stages of booting (and thus might be available momentarily) or might have suffered an error (which might bear reporting) or be shut down for maintenance. In any case, BProc provides a mechanism to learn the state of each compute node, as far as BProc is currently aware. The master node is always assumed to be up. The call provided for this is:

int bproc_nodestatus(int node);

This function returns one of several values, which are defined in the header file bproc.h as follows:

bproc_node_down, bproc_node_unavailable, bproc_node_error,
bproc_node_up, bproc_node,reboot, bproc_node_halt,
bproc_node_pwroff, bproc_node_boot

BProc is a low level system utility and does not generally get involved in node usage policies. Thus, if a node is up and permissions are set correctly, BProc will allow any user process to start a process on a node. In many environments there is a need to manage how nodes are used. For example, it may be desirable to guarantee that only one user is using a particular node for a period of time. It may be important to assign a set of nodes to a user or group of users for exclusive use over a period of time. It may be important to require users to start processes via some intermediary like a batch queue system or accounting system and thus prevent processes from being started directly.

In order to accommodate these issues, BProc provides a permission and access mechanism that is based on the Unix file access model. Under this model, every node is owned by a user and by a group and has execute permission bits for user, group and all. In order for a user to start a process on a node, an execute permission bit must be set for a matching user, matching group, or all.

These functions need not be used at all in many systems, but allow system programs to enforce policies controlling which users can run processes on which nodes. For example, a batch queue scheduler could decide to allocate 16 nodes to a job, and change the user of those nodes to match the user who submitted the job, thus preventing other users from inadvertently running processes on those nodes.

A program that is making scheduling decisions needs to know not only what nodes are known, and what nodes are actually operational, but what nodes have permissions set appropriately to allow processes to be run. BProc provides a call to retrieve this information for each node known to BProc.

int bproc_nodeinfo(int node, struct bproc_node_info_t *info);

This function fills in the following struct:

struct bproc_node_info_t
{
    int      num_nodes;         /* Info about your state of the world */
    int      curr_node;
    int      node;              /* Info about a particular node */
    int      status;
    uint32_t addr;
    int      user;
    int      group;
};

Note that this call not only returns permission and node ownership information, but also includes the node status and network address. Thus this single call provides the information to decide if a process may be started on a node, along with the information needed to contact the node.

Permission to start a process on a node is distinct from the decision of whether that node has the resources, such as available memory and CPU cycles, to support the process. The scheduling and mapping policy is support by resource information that may be retrieved through the Beostat API. See the "Programming with Beostat API" section in this manual.

In order to implement schedulers, system programs must be able to control the ownership and access permission of compute nodes. BProc provides functions that set those attributes. These functions can only be used by root or a user with appropriate permissions.

int bproc_setnodestatus(int node, int status);

This function allows a node management program to force node status to a particular state.

int  bproc_chmod(int node, int mode);
int  bproc_chown(int node, int user);
int  bproc_chgrp(int node, int group);

These functions change the user and group that own a node, or the User/Group/Others execute permission for a node. They use the same semantics as Unix file ownership and execute permission.

#define BPROC_X_OK 1
int bproc_access(node, mode);

This function checks if the current user may start a process on the specified node. This is similar to the access() system call for files. The only useful value for mode is BPROC_X_OK.

The last set of node management functions is used to implement multiple master systems and redundant, robust, and available servers.

int bproc_slave_chroot(int node, char *path);

This function sets a new root directory on a compute node, similar to the chroot() system call on the master node (local machine). This is used internally to implement isolated multiple master clusters, and allows a node to be installed with more than one configuration with the specific configuration set based on the process to be started.

int bproc_detach(long code);

This function allows a process to detach itself from control by the master. The master sees the code as the apparent exit status of the process.