Disk Partitioning

Partitioning allows disk storage space to be broken up into segments that are then accessible by the operating system. This chapter discusses disk partitioning concepts, the default partitioning used by Scyld ClusterWare, and some useful partitioning scenarios.

Scyld ClusterWare creates a RAM disk on the compute node by default during the initial boot process. This RAM disk is used to hold the final boot image downloaded from the master node. If you have diskless nodes, then this chapter does not pertain to you.

Disk Partitioning Concepts

Disk partitioning on a cluster is essentially no different than partitioning on any stand-alone computer, with a few exceptions.

On a stand-alone computer or server, the disk drive's file system(s) divide the storage available on the disk into different sections that are configured in ways and sizes to meet your particular needs. Each partition is a segment that can be accessed independently, like a separate disk drive. The partitions are configured and determined by the partition table contained on each disk.

Each partition table entry contains information about the locations on the disk where the partition starts and ends, the state of the partition (active or not), and the partition's type. Many partition types exist, such as Linux native, AIX, DOS, etc.. The cluster administrator can determine the appropriate partition types for his/her own system.

Disk partitioning on a cluster is very much determined by the cluster system hardware and the requirements of the application(s) that will be running on the cluster, for instance:

  • Some applications are very process intensive but not very data intensive. In such instances, the cluster may best utilize a RAM disk in the default partitioning scheme. The speed of the RAM will provide better performance, and not having a harddrive will provide some cost savings.
  • Some applications are very data intensive but not very process intensive. In these cases, a hard disk is either required (given the size of the data set the application is working with) and/or is a very inexpensive solution over purchasing an equivalent amount of memory.

The harddrive partitioning scheme is very dependent on the application needs, the other tools that will interface with the data, and the preferences of the end-user.

Disk Partitioning with ClusterWare

This section briefly describes the disk partitioning process for the master node and compute nodes in a Scyld cluster.

Master Node

On the master node of a Scyld cluster, the disk partitioning administration is identical to that on any stand-alone Linux server. As part of installing Red Hat Linux, you are requested to select how you would like to partition the master node's hard disk. After installation, the disk partitioning can be modified, checked, and utilized via traditional Linux tools such as fdisk, sfdisk, cfdisk, mount, etc.

Compute Nodes

The compute nodes of a Scyld cluster are slightly different from a traditional, stand-alone Linux server. Each compute node hard disk needs to be formatted and partitioned to be useful to the applications running on the cluster. However, not too many people would enjoy partitioning 64 or more nodes manually.

To simplify this task, Scyld ClusterWare provides the beofdisk tool, which allows remote partitioning of the compute node hard disks. It is very similar in operation to fdisk, but allows many nodes to be partitioned at once. The use of beofdisk for compute node partitioning is covered in more detail in Partitioning Scenarios.

Default Partitioning

This section addresses the default partitioning schemes used by Scyld ClusterWare.

Master Node

The default Scyld partition table allocates 4 partitions:

  • /boot partition
  • /home partition
  • / partition
  • Swap partition = 2 times physical memory

Most administrators will want to change this to meet the requirements of their particular cluster.

Compute Nodes

The default partition table allocates three partitions for each compute node:

  • BeoBoot partition = 2 MB
  • Swap partition = half the compute node's physical memory or half the disk, whichever is smaller
  • Single root partition = remainder of disk

For diskless operation, the default method of configuring the compute nodes at boot time is to run off a RAM disk. This "diskless" configuration is appropriate for many applications, but not all. Typical usage requires configuration and partitioning of the compute node hard disks, which is covered in the partitioning scenarios discussed in the following section.

Partitioning Scenarios

This section discusses how to implement two of the most common partitioning scenarios in Scyld ClusterWare:

  • Apply the default partitioning to all disks in the cluster
  • Specify your own manual but homogeneous partitioning to all disks in the cluster

The Scyld beofdisk tool can read an existing partition table on a compute node. It sequentially queries compute nodes beginning with node 0. For each new type/position/geometry it finds, it looks for an existing partition table file in /etc/beowulf/fdisk. If no partition table is present, a new one is generated that uses the default scheme. For each device/drive geometry it finds, beofdisk creates a file in /etc/beowulf/fdisk/. These files can then be modified by hand. Whether modified or using the default options, the files can be written back to the harddrives.

Caution

If you attempt to boot a node with an unpartitioned harddrive that is specified in /etc/beowulf/fstab (or a node-specific fstab.N for node N), then that node boots to an error state unless the fstab entry includes the "nonfatal" option. See the Reference Guide or man beowulf-fstab for details.

Applying the Default Partitioning

To apply the default disk partitioning scheme (as recommended by the Scyld beofdisk tool) to the compute nodes, following these steps:

Query all the harddrives on the compute nodes and write out partition table files for them that contain the suggested partitioning:

[root@cluster ~] # beofdisk -d
        Creating a default partition table for hda:2495:255:63
        Creating a default partition table for hda:1222:255:63

Read the partition table files, and partition the harddrives on the compute nodes so that they match:

[root@cluster ~] # beofdisk -w

To use the new partitions you created, modify the /etc/beowulf/fstab file to specify how the partitions on the compute node should be mounted. The contents of /etc/beowulf/fstab should be in the standard fstab format.

To format the disk(s) on reboot, change "mkfs never" to "mkfs always" in the cluster config file /etc/beowulf/config.

To try out the new partitioning, reboot the compute nodes with the following:

[root@cluster ~] # bpctl -S all -R

**Caution**

To prevent disks from being reformatted on subsequent reboots,
change "mkfs always" back to "mkfs never" in ``/etc/beowulf/config``
after the nodes have booted.

Specifying Manual Partitioning

You can manually apply your own homogeneous partitioning scheme to the partition tables, instead of taking the suggested defaults. There are two methods for doing this:

  • The recommended method involves running fdisk on the first node (node 0) of the cluster, and then on every first node that has a unique type of hard disk.
  • The other method is to manually edit the partition table text file retrieved by the beofdisk query.

For example, assume that your cluster has 6 compute nodes, and that all disks have 255 heads and 63 sectors (this is the most common). Nodes 0, 1, and 5 have a single IDE hard disk with 2500 cylinders. Nodes 2, 3, and 4 have a first IDE disk with 2000 cylinders, and node 4 has a SCSI disk with 5000 cylinders. This cluster could be partitioned as follows:

  1. Partition the disk on node 0:

    [root@cluster ~] # bpsh 0 fdisk /dev/hda
    

    Follow the steps through the standard fdisk method of partitioning the disk.

  2. Manually partition the disk on node 2 with fdisk:

    [root@cluster ~] # bpsh 2 fdisk /dev/hda
    

    Again, follow the steps through the standard fdisk method of partitioning the disk.

  3. Manually partition the SCSI disk on node 4 with fdisk:

    [root@cluster ~] # bpsh 4 fdisk /dev/sda
    

    Again, follow the steps through the standard fdisk method of partitioning the disk.

  4. Next, query the compute nodes to get all the partition table files written for their harddrives by using the command ``beofdisk -q ``.

    At this point, the 3 partition tables will be translated into text descriptions, and 3 files will be put in the directory /etc/beowulf/fdisk. The file names will be hda:2500:255:63, hda:2000:255:63, and sda:5000:255:63. These file names represent the way the compute node harddrives are currently partitioned.

    You have the option to skip the fdisk command and just edit these files manually. The danger is that there are lots of rules about what combinations of values are allowed, so it is easy to make an invalid partition table. Most of these rules are explained as comments at the top of the file.

  5. Now write out the partitioning scheme using the command beofdisk -w.

    When specifying unique partitioning for certain nodes, you must also specify a unique fstab for each node that has a unique partition table. To do this, create the file /etc/beowulf/fstab.<nodenumber>. If this file exists, the node_up script will use that as the fstab for the compute node; otherwise, it will default to /etc/beowulf/fstab. Each instance of /etc/beowulf/fstab.<nodenumber> should be in the same format as /etc/beowulf/fstab.

  6. To format the disk(s) on reboot, change "mkfs never" to "mkfs always" in the cluster config file /etc/beowulf/config.

  7. To try out the new partitioning, reboot the compute nodes with the following:

    [root@cluster ~] # bpctl -S all -R
    

    Caution

To prevent disks from being reformatted on subsequent reboots, change the "mkfs always" back to "mkfs never" in /etc/beowulf/config after the nodes have booted.