Compute Node Disk Partitioning

Architectural Overview

The Scyld ClusterWare system uses a "disk-less administration" model for compute nodes. This means that the compute nodes boot and operate without the need for mounting any file system, either on a local disk or a network file system. By using this approach, the cluster system does not depend on the storage details or potential misconfiguration of the compute nodes, instead putting all configuration information and initialization control on the master.

This does not mean that the cluster cannot or does not use local disk storage or network file systems. Instead it allows the storage to be tailored to the needs of the application rather than the underlying cluster system.

The first operational issue after installing a cluster is initializing and using compute node storage. While the concept and process is similar to configuring the master machine, the "disk-less administration" model makes it much easier to change the storage layout on the compute nodes.

Operational Overview

Compute node hard disks are used for three primary purposes:

  • Swap Space — Expands the Virtual Memory of the local machine.
  • Application File Storage — Provides scratch space and persistent storage for application output.
  • System Caching — Increases the size and count of executables and libraries cached by the local node.

In addition, a local disk may be used to hold a cluster file system (used when the node acts as a file server to other nodes). To make this possible, Scyld provides programs to create disk partitions, a system to automatically create and check file systems on those partitions, and a mechanism to mount file systems.

Disk Partitioning Procedures

Deciding on a partitioning schema for the compute node disks is no easier than with the master node, but it can be changed more easily.

Compute node hard disks may be remotely partitioned from the master using beofdisk. This command automates the partitioning process, allowing all compute node disks with a matching hard drive geometry (cylinders, heads, sectors) to be partitioned simultaneously.

If the compute node hard disks have not been previously partitioned, you can use beofdisk to generate default partition tables for the compute node hard disks. The default partition table allocates three partitions, as follows:

  • A BeoBoot partition equal to 2 MB (currently unused)
  • A swap partition equal to 2 times the node's physical memory
  • A single root partition equal to the remainder of the disk

The partition table for each disk geometry is stored in the directory /etc/beowulf/fdisk on the master node, with the filename specified in nomenclature that reflects the disk type, position, and geometry. Example filenames are hda:2495:255:63, hdb:3322:255:63, and sda:2495:255:63.

The beofdisk command may also be used to read an existing partition table on a compute node hard disk, as long as that disk is properly positioned in the cluster. The command captures the partition table of the first hard disk of its type and geometry (cylinder, heads, sectors) in each position on a compute node's controller (e.g., sda or hdb). The script sequentially queries the compute nodes numbered 0 through N-1, where N is the number of nodes currently in the cluster.

Typical Partitioning

While it is not possible to predict every configuration that might be desired, the typical procedure to partition node disks is as follows:

  1. From the master node, capture partition tables for the compute nodes:

    [root@cluster ~]# beofdisk -q

    With the -q parameter, beofdisk queries all compute nodes. For the first drive found with a specific geometry (cylinders, heads, sectors), it reads the partition table and records it in a file. If the compute node disk has no partition table, this command creates a default partition set and reports the activity to the console.

    If the partition table on the disk is empty or invalid, it is captured and recorded as described, but no default partition set is created. You must create a default partition using the ``beofdisk -d `` command; see Default Partitioning.

  2. Based on the specific geometry of each drive, write the appropriate partition table to each drive of each compute node:

    [root@cluster ~]# beofdisk -w

    This technique is useful, for example, when you boot a single compute node with a local hard disk that is already partitioned, and you want the same partitioning applied to all compute nodes. You would boot the prototypical compute node, capture its partition table, boot the remaining compute nodes, and write that prototypical partition table to all nodes.

  3. Reboot all compute nodes to make the partitioning effective.

  4. If needed, update the file /etc/beowulf/fstab on the master node to record the mapping of the partitions on the compute node disks to the file systems.

Default Partitioning

To apply the recommended default partitioning to each disk of each compute node, follow these steps:

  1. Generate default partition maps to /etc/beowulf/fdisk:

    [root@cluster ~]# beofdisk -d
  2. Write the partition maps out to the nodes:

    [root@cluster ~]# beofdisk -w
  3. You must reboot the compute nodes before the new partitions are usable.

Generalized, User-Specified Partitions

To create a unique partition table for each disk type/position/geometry triplet, follow these steps:

  1. Remotely run the fdisk command on each compute node where the disk resides:

    [root@cluster ~]# bpsh n fdisk device

    where n is the node number or the first compute node with the drive geometry you want to partition, and device is the device you wish to partition (e.g., /dev/sda, /dev/hdb).

  2. Once you have created the partition table and written it to the disk using fdisk, capture it and write it to all disks with the same geometry using:

    [root@cluster ~]# beofdisk -w
  3. You must reboot the compute nodes before the new partitioning will be effective.

  4. You must then map file systems to partitions as described later in this chapter.

Unique Partitions

To generate a unique partition for a particular disk, follow these steps:

  1. Partition your disks using either default partitioning or generalized partitions as described above.

  2. From the master node, remotely run the fdisk command on the appropriate compute node to re-create a unique partition table using:

    [root@cluster ~]# bpsh n fdisk device

    where n is the compute node number for which you wish to create a unique partition table and device is the device you wish to partition (e.g., /dev/sda).

  3. You must then map file systems to partitions as described below.

Mapping Compute Node Partitions

If your compute node hard disks are already partitioned, edit the file /etc/beowulf/fstab on the master node to record the mapping of the partitions on your compute node disks to your file systems. This file contains example lines (commented out) showing the mapping of file systems to drives; read the comments in the fstab file for guidance.

  1. Query the disks on the compute nodes to determine how they are partitioned:

    [root@cluster ~]# beofdisk -q

    This creates a partition file in /etc/beowulf/fdisk, with a name similar to sda:512:128:32 and containing lines similar to the following:

    [root@cluster root]# cat sda:512:128:32
    /dev/sda1  :  start=    32,  size=  8160,   id=89,  bootable
    /dev/sda2  :  start=    8192,   size=    1048576,  Id=82
    /dev/sda3  :  start=    1056768,    size=    1040384,  Id=83
    /dev/sda4  :  start=    0, size=  0,  Id=0
  2. Read the comments in /etc/beowulf/fstab. Add the lines to the file to use the devices named in the sda file:

    # This is the default setup from beofdisk
    #/dev/hda2        swap     swap   defaults     0 0
    #/dev/hda3        /        ext2   defaults     0 0
    /dev/sda1         /boot    ext23  defaults     0 0
    /dev/sda2         swap     swap   defaults     0 0
    /dev/sda3         /scratch ext3   defaults     0 0
  3. After saving fstab, you must reboot the compute nodes for the changes to take affect.