File Systems¶
File Systems on a Cluster¶
File systems on a cluster consist of two types of file systems, local
file systems and network file systems. The file /etc/fstab
describes
the filesystems mounted on the master node, and the file
/etc/beowulf/fstab
describes the filesystems mounted on each compute
node. You may also create node-specific /etc/beowulf/fstab.N
files, where N is a node number.
Local File Systems¶
Local file systems are the file systems that exist locally on each machine. In the Scyld ClusterWare setup, the master node has a local file system, typically ext3, and each node also has a local file system. The local file systems are used for storing data that is local to the machines.
Network/Cluster File Systems¶
Network file systems are used so that files can be shared across the
cluster and every node in the cluster can see the exact same set of
files. The default network file system for Scyld ClusterWare is NFS. NFS allows the
contents of a directory on the server (by default the master node) to be
accessed by the clients (the compute nodes). The default Scyld ClusterWare setup has
the /home
directory exported through NFS so that all the user home
directories can be accessed on the compute nodes. Additionally, various
other directories are mounted by default, as specified by
/etc/beowulf/fstab
or by a node-specific fstab.N
.
Note that root’s home directory is not in /home
, and thus cannot
access its home directory on the compute nodes. This should not be a
problem, as normal compute jobs should not be run as “root”.
NFS¶
NFS is the standard way to have files stored on one machine, yet be able to access them from other machines on the network as if they were stored locally.
NFS on Clusters¶
NFS in clusters is typically used so that if all the nodes need the same file, or set of files, they can access the file(s) through NFS. This way, if one changes the file, every node sees the change, and there is only one copy of the file that needs to be backed up.
Configuration of NFS¶
The Network File System (NFS) is what Scyld ClusterWare uses to allow users to access
their home directories and other remote directories from compute nodes.
(The User’s Guide has a small discussion on good and bad ways to use NFS.)
Two files control what directories are NFS mounted on the compute nodes.
The first is /etc/exports
. This tells the nfs daemon on the master
node what directories it should allow to be mounted and who can access
them. Scyld ClusterWare adds various commonly useful entries to /etc/exports
. For
example:
/home @cluster(rw)
The /home says that /home
can be nfs mounted, and @cluster(rw)
says who can mount it and what forms of access are allowed. @cluster
is a netgroup. It uses one word to represent several machines. In this
case, it represents all your compute nodes. cluster is a special
netgroup that is setup by beonss
that automatically maps to all of
your compute nodes. This makes it easy to specify something can be
mounted by your compute nodes. The (rw) part specifies what
permissions the compute node has when it mounts /home
. In this case,
all user processes on the compute nodes have read-write access to
/home
. There are more options that can go here, and you can find
them detailed in man exports
.
The second file is /etc/beowulf/fstab
. (Note that it is possible to
set up an individual fstab.
N for a node. For this discussion, we
will assume that you are using a global fstab
for all nodes.) For
example, one line in the default /etc/beowulf/fstab
is the
following:
$MASTER:/home /home nfs nolock,nonfatal 0 0
This is the line that tells the compute nodes to try to mount /home
when they boot:
The $MASTER is a variable that will automatically be expanded to the IP of the master node.
The first /home is the directory location on the master node.
The second /home is where it should be mounted on the compute node.
The nfs specifies that this is an nfs file system.
The nolock specifies that locking should be turned off with this nfs mount. We turn off locking so that we don’t have to run daemons on the compute nodes. (If you need locking, see File Locking Over NFS for details.)
The nonfatal tells ClusterWare’s
/usr/lib/beoboot/bin/setup_fs
script to treat a mount failure as a nonfatal problem. Without this nonfatal option, any mount failure leaves the compute node in an error state, thus making it unavailable to users.The two 0’s on the end are there to make the
fstab
like the standardfstab
in/etc
.
To add an nfs mount of /foo
to all your compute nodes, first add the
following line to the end of the /etc/exports
file:
/foo @cluster(rw)
Then execute exportfs -a
as root. For the mount to take place the
next time your compute nodes reboot, you must add the following line to
the end of /etc/beowulf/fstab
:
$MASTER:/foo /foo nfs nolock 0 0
You can then reboot all your nodes to make the nfs mount happen. If you wish to mount the new exported filesystem without rebooting the compute nodes, you can issue the following two commands:
[root @cluster ~] # bpsh -a mkdir -p /foo
[root @cluster ~] # bpsh -a mount -t nfs -o nolock master:/foo /foo
Note that /foo
will need to be adjusted for the directory you
actually want.
If you wish to stop mounting a certain directory on the compute nodes,
you can either remove the line from /etc/beowulf/fstab
or just
comment it out by inserting a ‘#’ at the beginning of the line. You can
leave untouched the entry referring to the filesystem in
/etc/exports
, or you can delete the reference, whichever you feel
more comfortable with.
If you wish to unmount that directory on all the compute nodes without rebooting them, you can then run the following:
[root @cluster ~] # bpsh -a umount /foo
where /foo
is the directory you no longer wish to have NFS mounted.
Caution
On compute nodes, NFS directories must be mounted using either a specific IP address or the $MASTER keyword; the hostname cannot be used. This is because
fstab
is evaluated before node name resolution is available.
File Locking Over NFS¶
By default, the compute nodes mount NFSv3 filesystems with locking turned off. If you have a program that requires locking, first ensure that the nfslock service is enabled on the master node and is executing:
[root @cluster ~] # systemctl enable nfslock
[root @cluster ~] # systemctl start nfslock
Next, edit /etc/beowulf/fstab
to remove the nolock keyword from
the NFS mount entries.
Finally, reboot the cluster nodes to effect the NFS remounting with locking enabled.
NFSD Configuration¶
By default, when the master node reboots, the /etc/init.d/nfs
script
launches 8 NFS daemon threads to service client NFS requests. For large
clusters this count may be insufficient. One symptom of an insufficiency
is a syslog message, most commonly seen when you boot all the cluster
nodes:
nfsd: too many open TCP sockets, consider increasing the number of nfsd threads
To increase the thread count (e.g., to 16):
[root @cluster ~] # echo 16 > /proc/fs/nfsd/threads
Ideally, the chosen thread count should be sufficient to eliminate the
syslog complaints, but not significantly higher, as that would
unnecessarily consume system resources. To make the new value persistent
across master node reboots, create the file /etc/sysconfig/nfs
, if
it does not already exist, and add to it an entry of the form:
RPCNFSDCOUNT=16
A value of 1.5x to 2x the number of nodes is probably adequate, although perhaps excessive.
A more refined analysis starts with examining NFSD statistics:
[root @cluster ~] # grep th /proc/net/rpc/nfsd
which outputs thread statistics of the form:
th 16 10 26.774 5.801 0.035 0.000 0.019 0.008 0.003 0.011 0.000 0.040
From left to right, the 16 is the current number of NFSD threads, and the 10 is the number of times that all threads have been simultaneously busy. (Not all circumstances of all threads being busy results in that syslog message, but a high all-busy count does suggest that adding more threads may be beneficial.)
The remaining 10 numbers are histogram buckets that show how many accumulated seconds a percentage of the total number of threads have been simultaneously busy. In this example, 0-10% of the threads were busy 26.744 seconds, 10-20% of the threads were busy 5.801 seconds, and 90-100% of the threads were busy 0.040 seconds. High numbers at the end indicate that most or all of the threads are simultaneously busy for significant periods of time, which suggests that adding more threads may be beneficial.
ROMIO¶
ROMIO is a high-performance, portable implementation of MPI-IO, the I/O chapter in MPI-2: Extensions to the Message Passing Interface, and is included in the Scyld ClusterWare distribution. ROMIO is optimized for noncontiguous access patterns, which are common in parallel applications. It has an optimized implementation of collective I/O, an important optimization in parallel I/O.
Reasons to Use ROMIO¶
ROMIO gives you an abstraction layer on top of high performance input/output. The details for the file system may be implemented in various ways, but ROMIO prevents you from caring. Your binary code will run on an NFS file system here and a different file system there, without changing a line or recompiling. Although POSIX open(), read(), … calls already do this, the virtual file system code to handle this abstraction is deep in the kernel.
You may need to use ROMIO to take advantage of new special and experimental file systems. It is easier and more portable to implement a ROMIO module for a new file system than a Linux-specific VFS kernel layer.
Since ROMIO is an abstraction layer, it has the freedom to be implemented arbitrarily. For example, it could be implemented on top of the POSIX Asynchronous and List I/O calls for real-time performance reasons. The end-user application is shielded from caring, and benefits from careful optimization of the I/O details by experts.
Installation and Configuration of ROMIO¶
ROMIO Over NFS¶
To use ROMIO on NFS, file locking with fcntl
must work correctly on
the NFS installation. First, since file locking is turned off by
default, you need to turn on NFSv3 locking. See File Locking Over NFS.
Now, to get the fcntl locks to work, you must mount the NFS file system with
the noac option (no attribute caching). This is done by modifying the line
for mounting /home in /etc/beowulf/fstab to look like the following:
$MASTER:/home /home nfs noac,nonfatal 0 0
Turning off attribute caching may reduce performance, but it is necessary for correct behavior.
Other Cluster File Systems¶
There are variety of network file systems that can be used on a cluster. If you have questions regarding the use of any particular cluster file system with Scyld ClusterWare, contact Scyld Customer Support for assistance.