File Systems

File Systems on a Cluster

File systems on a cluster consist of two types of file systems, local file systems and network file systems. The file /etc/fstab describes the filesystems mounted on the master node, and the file /etc/beowulf/fstab describes the filesystems mounted on each compute node. You may also create node-specific /etc/beowulf/fstab.N files, where N is a node number.

Local File Systems

Local file systems are the file systems that exist locally on each machine. In the Scyld ClusterWare setup, the master node has a local file system, typically ext3, and each node also has a local file system. The local file systems are used for storing data that is local to the machines.

Network/Cluster File Systems

Network file systems are used so that files can be shared across the cluster and every node in the cluster can see the exact same set of files. The default network file system for Scyld ClusterWare is NFS. NFS allows the contents of a directory on the server (by default the master node) to be accessed by the clients (the compute nodes). The default Scyld ClusterWare setup has the /home directory exported through NFS so that all the user home directories can be accessed on the compute nodes. Additionally, various other directories are mounted by default, as specified by /etc/beowulf/fstab or by a node-specific fstab.N.

Note that root's home directory is not in /home, and thus cannot access its home directory on the compute nodes. This should not be a problem, as normal compute jobs should not be run as "root".

NFS

NFS is the standard way to have files stored on one machine, yet be able to access them from other machines on the network as if they were stored locally.

NFS on Clusters

NFS in clusters is typically used so that if all the nodes need the same file, or set of files, they can access the file(s) through NFS. This way, if one changes the file, every node sees the change, and there is only one copy of the file that needs to be backed up.

Configuration of NFS

The Network File System (NFS) is what Scyld ClusterWare uses to allow users to access their home directories and other remote directories from compute nodes. (The User's Guide has a small discussion on good and bad ways to use NFS.) Two files control what directories are NFS mounted on the compute nodes. The first is /etc/exports. This tells the nfs daemon on the master node what directories it should allow to be mounted and who can access them. Scyld ClusterWare adds various commonly useful entries to /etc/exports. For example:

/home  @cluster(rw)

The /home says that /home can be nfs mounted, and @cluster(rw) says who can mount it and what forms of access are allowed. @cluster is a netgroup. It uses one word to represent several machines. In this case, it represents all your compute nodes. cluster is a special netgroup that is setup by beonss that automatically maps to all of your compute nodes. This makes it easy to specify something can be mounted by your compute nodes. The (rw) part specifies what permissions the compute node has when it mounts /home. In this case, all user processes on the compute nodes have read-write access to /home. There are more options that can go here, and you can find them detailed in man exports.

The second file is /etc/beowulf/fstab. (Note that it is possible to set up an individual fstab.N for a node. For this discussion, we will assume that you are using a global fstab for all nodes.) For example, one line in the default /etc/beowulf/fstab is the following:

$MASTER:/home  /home  nfs  nolock,nonfatal  0 0

This is the line that tells the compute nodes to try to mount /home when they boot:

  • The $MASTER is a variable that will automatically be expanded to the IP of the master node.
  • The first /home is the directory location on the master node.
  • The second /home is where it should be mounted on the compute node.
  • The nfs specifies that this is an nfs file system.
  • The nolock specifies that locking should be turned off with this nfs mount. We turn off locking so that we don't have to run daemons on the compute nodes. (If you need locking, see File Locking Over NFS for details.)
  • The nonfatal tells ClusterWare's /usr/lib/beoboot/bin/setup_fs script to treat a mount failure as a nonfatal problem. Without this nonfatal option, any mount failure leaves the compute node in an error state, thus making it unavailable to users.
  • The two 0's on the end are there to make the fstab like the standard fstab in /etc.

To add an nfs mount of /foo to all your compute nodes, first add the following line to the end of the /etc/exports file:

/foo  @cluster(rw)

Then execute exportfs -a as root. For the mount to take place the next time your compute nodes reboot, you must add the following line to the end of /etc/beowulf/fstab:

$MASTER:/foo  /foo  nfs  nolock  0 0

You can then reboot all your nodes to make the nfs mount happen. If you wish to mount the new exported filesystem without rebooting the compute nodes, you can issue the following two commands:

[root @cluster ~] # bpsh -a mkdir -p /foo
[root @cluster ~] # bpsh -a mount -t nfs -o nolock master:/foo /foo

Note that /foo will need to be adjusted for the directory you actually want.

If you wish to stop mounting a certain directory on the compute nodes, you can either remove the line from /etc/beowulf/fstab or just comment it out by inserting a '#' at the beginning of the line. You can leave untouched the entry referring to the filesystem in /etc/exports, or you can delete the reference, whichever you feel more comfortable with.

If you wish to unmount that directory on all the compute nodes without rebooting them, you can then run the following:

[root @cluster ~] # bpsh -a umount /foo

where /foo is the directory you no longer wish to have NFS mounted.

Caution

On compute nodes, NFS directories must be mounted using either a specific IP address or the $MASTER keyword; the hostname cannot be used. This is because fstab is evaluated before node name resolution is available.

File Locking Over NFS

By default, the compute nodes mount NFSv3 filesystems with locking turned off. If you have a program that requires locking, first ensure that the nfslock service is enabled on the master node and is executing:

[root @cluster ~] # chkconfig nfslock on
[root @cluster ~] # service nfslock start

Next, edit /etc/beowulf/fstab to remove the nolock keyword from the NFS mount entries.

Finally, reboot the cluster nodes to effect the NFS remounting with locking enabled.

NFSD Configuration

By default, when the master node reboots, the /etc/init.d/nfs script launches 8 NFS daemon threads to service client NFS requests. For large clusters this count may be insufficient. One symptom of an insufficiency is a syslog message, most commonly seen when you boot all the cluster nodes:

nfsd: too many open TCP sockets, consider increasing the number of nfsd threads

To increase the thread count (e.g., to 16):

[root @cluster ~] # echo 16 > /proc/fs/nfsd/threads

Ideally, the chosen thread count should be sufficient to eliminate the syslog complaints, but not significantly higher, as that would unnecessarily consume system resources. To make the new value persistent across master node reboots, create the file /etc/sysconfig/nfs, if it does not already exist, and add to it an entry of the form:

RPCNFSDCOUNT=16

A value of 1.5x to 2x the number of nodes is probably adequate, although perhaps excessive.

A more refined analysis starts with examining NFSD statistics:

[root @cluster ~] # grep th /proc/net/rpc/nfsd

which outputs thread statistics of the form:

th 16 10 26.774 5.801 0.035 0.000 0.019 0.008 0.003 0.011 0.000 0.040

From left to right, the 16 is the current number of NFSD threads, and the 10 is the number of times that all threads have been simultaneously busy. (Not all circumstances of all threads being busy results in that syslog message, but a high all-busy count does suggest that adding more threads may be beneficial.)

The remaining 10 numbers are histogram buckets that show how many accumulated seconds a percentage of the total number of threads have been simultaneously busy. In this example, 0-10% of the threads were busy 26.744 seconds, 10-20% of the threads were busy 5.801 seconds, and 90-100% of the threads were busy 0.040 seconds. High numbers at the end indicate that most or all of the threads are simultaneously busy for significant periods of time, which suggests that adding more threads may be beneficial.

ROMIO

ROMIO is a high-performance, portable implementation of MPI-IO, the I/O chapter in MPI-2: Extensions to the Message Passing Interface, and is included in the Scyld ClusterWare distribution. ROMIO is optimized for noncontiguous access patterns, which are common in parallel applications. It has an optimized implementation of collective I/O, an important optimization in parallel I/O.

Reasons to Use ROMIO

ROMIO gives you an abstraction layer on top of high performance input/output. The details for the file system may be implemented in various ways, but ROMIO prevents you from caring. Your binary code will run on an NFS file system here and a different file system there, without changing a line or recompiling. Although POSIX open(), read(), ... calls already do this, the virtual file system code to handle this abstraction is deep in the kernel.

You may need to use ROMIO to take advantage of new special and experimental file systems. It is easier and more portable to implement a ROMIO module for a new file system than a Linux-specific VFS kernel layer.

Since ROMIO is an abstraction layer, it has the freedom to be implemented arbitrarily. For example, it could be implemented on top of the POSIX Asynchronous and List I/O calls for real-time performance reasons. The end-user application is shielded from caring, and benefits from careful optimization of the I/O details by experts.

Installation and Configuration of ROMIO

ROMIO Over NFS

To use ROMIO on NFS, file locking with fcntl must work correctly on the NFS installation. First, since file locking is turned off by default, you need to turn on NFSv3 locking. See File Locking Over NFS. Now, to get the fcntl locks to work, you must mount the NFS file system with the noac option (no attribute caching). This is done by modifying the line for mounting /home in /etc/beowulf/fstab to look like the following:

$MASTER:/home  /home  nfs  noac,nonfatal  0 0

Turning off attribute caching may reduce performance, but it is necessary for correct behavior.

Other Cluster File Systems

There are variety of network file systems that can be used on a cluster. If you have questions regarding the use of any particular cluster file system with Scyld ClusterWare, contact Scyld Customer Support for assistance.