Troubleshooting ClusterWare¶
ClusterWare Log Files¶
The /var/log/clusterware/
folder contains several log files that
may help diagnose problems. Additionally, the ClusterWare database
service may have useful information in its logs.
For etcd, see /var/log/clusterware/etcd.log
.
On a typical head node the /var/log/clusterware/
folder
contains api_access_log
and api_error_log
files. These are the
Apache logs for the service providing the REST API. The log level
available in this file is controlled by the Pyramid logging
configuration in the /opt/scyld/clusterware/conf/pyramid.ini
file. The Pyramid project documentation contains details of the
pertinent variables
https://docs.pylonsproject.org/projects/pyramid/en/latest/narr/logging.html
A selection of log statements from the api_error_log
are also
logged to the ClusterWare database and then copied to the
logging folder on each head node. A separate log file is created for
each head node and is named based on the head node UID,
i.e. head_293aafd3f635448e9aaa76fc998ebc0c.log
. This should allow
a cluster administrator to diagnose many problems without needing to
contact every head node individually. The log level for this file is
controlled by the logging.level variable in each head node's
/opt/scyld/clusterware/conf/base.ini
file. The default log level
of WARNING should be useful but not overly verbose. The options from
most terse to most verbose are AUDIT, ERROR, WARNING, INFO, DEBUG.
The various /var/log/clusterware/*
logfiles are periodically rotated,
as directed by the /etc/logrotate.d/clusterware
,
/etc/logrotate.d/clusterware-dnsmasq
, and
/etc/logrotate.d/clusterware-iscdhcp
configuration files
that distributed distributed in the clusterware, clusterware-dnsmasq,
and clusterware-iscdhcp RPMs, respectively.
Note
If the local cluster administrator modifies the
/etc/logrotate.d/clusterware
file, then a subsequent update of
clusterware RPM will install a new version as
/etc/logrotate.d/clusterware.rpmnew
.
The cluster administrator should merge this clusterware.rpmnew
into the local customized /etc/logrotate.d/clusterware
.
Similar treatment of clusterware-dnsmasq
and clusterware-iscdhcp
is advised.
Command-Line Monitoring of Nodes¶
ClusterWare provides two primary methods to monitor cluster
performance and health: the command line scyld-nodectl status
tool
and a more extensive graphical user interface (see Monitoring Graphical Interface).
More basic node status can be obtained through the scyld-nodectl
command. For example, a cluster administrator can view the status of
all nodes in the cluster:
# Terse status:
[admin@virthead]$ scyld-nodectl status
n[0] up
n[1] down
n[2] new
# Verbose status:
[admin@virthead]$ scyld-nodectl status --long
Nodes
n0
ip: 10.10.24.100
last_modified: 2019-04-16 05:02:26 UTC (0:00:02 ago)
state: up
uptime: 143729.68
n1
down_reason: boot timeout
ip: 10.10.42.102
last_modified: 2019-04-15 09:03:20 UTC (19:59:08 ago)
last_uptime: 59.61
state: down
n2: {}
From this sample output we can see that n0 is up and has recently (2 seconds earlier) sent status information back to the head node. This status information is sent by each compute node to its parent head node once every 10 seconds, although this period can be overridden with the _status_secs node attribute. The IP address shown here is the IP reported by the compute node and should match the IP provided in the node database object unless the database has been changed and the node has not yet been rebooted.
Compute node n1 is currently down because of a "boot timeout".
This means that the node attempted to boot, and the node's initial
"up" status message to the head node was not received.
This could happen
due to a boot failure such as a missing network driver, a networking
failure preventing the node from communicating with the head node, or
if the cw-status-updater
service provided by the
clusterware-node package is not running on the compute node.
Other possible values for down_reason include "node stopped
sending status" or "clean shutdown".
There is no status information about n2 because it was added to the
system and has never been booted. Additional node status can be viewed
with scyld-nodectl status -L
(an abbreviation of --long-long
)
that includes the most recent full
hostname, kernel command line, loaded modules, loadavg, free RAM,
kernel release, and SELinux status. As with other scyld-*ctl
commands, the output can also be provided as JSON to simplify parsing
and scripting.
For large clusters the --long
(or -l
) display can be unwieldy,
so the status command defaults to a summary.
Each row of output corresponds to a different node status and lists
the nodes in a format that can then be passed to the --ids
argument of scyld-nodectl
. Passing an additional --refresh
argument will cause the tool to start an ncurses application that will
display the summary in the terminal and periodically refresh the
display:
scyld-nodectl status --refresh
This mode can be useful when adding new nodes to the system by booting them one at a time as described in Node Creation with Unknown MAC address(es).
Failing PXE Network Boot¶
If a compute node fails to join the cluster when booted via PXE network boot, there are several places to look, as discussed below.
Rule out physical problems. Check for disconnected Ethernet cables, malfunctioning network equipment, etc.
Confirm the node's MAC is in the database. Search for the node by MAC address to confirm it is registered with the ClusterWare system:
scyld-nodectl -i 00:11:22:33:44:55 ls -l
Check the system logs.
Specifically look for the node's MAC address in the api_error_log
and head_*.log
files. These files will contain AUDIT statements
whenever a compute node boots, e.g.,
Booting node (MAC=08:00:27:f0:44:35) as iscsi using boot config b7412619fe28424ebe1f7c5f3474009d.
Booting node (MAC=52:54:00:a6:f3:3c) as rwram using boot config f72edc4388964cd9919346dfeb21cd2c.
If there are no "Booting node" log statements,
then the failure is most likely happening at the DHCP stage,
and the head nodes' isc-dhcpd.log
log files may contain useful information.
As a last resort, check if the head node is seeing the compute node's
DHCP requests, or whether another server is answering, using the Linux
tcpdump
utility. The following example shows a correct dialog
between compute node 0 (10.10.100.100) and the head node.
[root@cluster ~]# tcpdump -i eth1 -c 10
Listening on eth1, link-type EN10MB (Ethernet),
capture size 96 bytes
18:22:07.901571 IP master.bootpc > 255.255.255.255.bootps:
BOOTP/DHCP, Request from .0, length: 548
18:22:07.902579 IP .-1.bootps > 255.255.255.255.bootpc:
BOOTP/DHCP, Reply, length: 430
18:22:09.974536 IP master.bootpc > 255.255.255.255.bootps:
BOOTP/DHCP, Request from .0, length: 548
18:22:09.974882 IP .-1.bootps > 255.255.255.255.bootpc:
BOOTP/DHCP, Reply, length: 430
18:22:09.977268 arp who-has .-1 tell 10.10.100.100
18:22:09.977285 arp reply .-1 is-at 00:0c:29:3b:4e:50
18:22:09.977565 IP 10.10.100.100.2070 > .-1.tftp: 32 RRQ
"bootimg::loader" octet tsize 0
18:22:09.978299 IP .-1.32772 > 10.10.100.100.2070:
UDP, length 14
10 packets captured
32 packets received by filter
0 packets dropped by kernel
Verify that ClusterWare services are running. Check the status of ClusterWare services with the commands:
systemctl status clusterware
systemctl status clusterware-dhcpd
systemctl status clusterware-dnsmasq
Restart ClusterWare services from the command line using:
sudo systemctl restart clusterware
Check the switch configuration. If the compute nodes fail to boot immediately on power-up but successfully boot later, the problem may lie with the configuration of a managed switch.
Some Ethernet switches delay forwarding packets for approximately one minute after link is established, attempting to verify that no network loop has been created ("spanning tree"). This delay is longer than the PXE boot timeout on some servers.
Disable the spanning tree check on the switch. The parameter is typically named "fast link enable".
Kickstart Failing¶
If a node has been configured to kickstart using a boot configuration provided by a repo created from an ISO file but is failing, then check the console output for the node. If the node is entering the "Dracut Emergency Shell" from the dracut timeout scripts, then you will need to retry and see what messages were on screen prior to the "Warning: dracut-initqueue timeout" messages that flood the screen. One common error is "Warning: anaconda: failed to fetch stage2 from <URL>", where the URL points to a repo on the head node. If this message occurs, please check that you have uploaded the correct ISO into the system.
For CentOS and RHEL, the "boot" ISO files such as
CentOS-8.1.1911-x86_64-boot.iso
do contain the files necessary to
initiate the kickstart process, but do not contain the full package
repositories. The cluster administrator must provide appropriate URLs
in the kickstart file, or must upload a more complete ISO such as
CentOS-8.1.1911-x86_64-dvd1.iso
using the scyld-clusterctl
command. For example, to replace the ISO originally uploaded into a
newly created centos8_repo
repo:
scyld-clusterctl repos -i centos8_repo update iso=@CentOS-8.1.1911-x86_64-dvd1.iso
Head Node Filesystem Is 100% Full¶
If a head node filesystem(s) that contains ClusterWare data
(typically the root filesystem) is 100% full,
then the administrator cannot execute scyld-*
commands
and ClusterWare cluster operations will fail.
Remove unnecessary objects from the ClusterWare database¶
Remove any unnecessary objects in the database that may be lingering after an earlier aborted operation:
sudo systemctl stop clusterware
sudo rm /opt/scyld/clusterware/storage/*.old.00
sudo systemctl start clusterware
If that does not release enough space to allow the scyld-*
commands to
execute, then delete the entire local cache of database objects:
sudo systemctl stop clusterware
sudo rm -fr /opt/scyld/clusterware/workspace/*
sudo systemctl start clusterware
Investigate InfluxDB retention of Telegraf data¶
If you continue to see influxdb messages in /var/log/messages
that complain "no space left on device",
or if the size of the /var/lib/influxdb/
directory is excessively large,
then InfluxDB may be retaining too much Telegraf time series data,
aka shards.
Examine with:
sudo systemctl restart influxdb
# View the summation of all the Telegraf shards
sudo du -sh /var/lib/influxdb/data/telegraf/autogen/
# View the space consumed by each Telegraf shard
sudo du -sh /var/lib/influxdb/data/telegraf/autogen/*
If the autogen
directory or any particular autogen
subdirectory
shard consumes a suspiciously large amount of storage,
then examine the retention policy with the influx
tool:
sudo influx
and now within the interactive tool you can can execute influx
commands:
> show retention policies on telegraf
The current ClusterWare defaults are a duration of 168h0m0s (save seven shards of Telegraf data) and a shardGroupDuration of 24h0m0s (each spanning one 24-hour day). You can reduce the current retention policy, if that makes sense for your cluster, with simple command. For example, reduce the above 7-shard duration to five, thereby reducing the number of saved shards by two:
> alter retention policy "autogen" on "telegraf" duration 5d
You can also delete individual unneeded shards. View the shards and their timestamps:
> show shards
and selectively delete any unneeded shard using its id number,
which is found in the show
output's first column:
> drop shard <shard-id>
When finished, exit the influx
tool with:
> exit
See https://docs.influxdata.com/influxdb/v1.8/ for more documentation.
Remove unnecessary PXEboot images, repos¶
If scyld-*
commands can now execute,
then view information for all images and repos, including their sizes:
scyld-imgctl ls -l
scyld-clusterctl repos ls -l
Consider selectively deleting unneeded images with:
scyld-imgctl -i <imageName> rm
and consider selectively deleting unneeded repos with:
scyld-clusterctl repos -i <repoName> rm
Otherwise¶
If scyld-*
commands still cannot execute,
and if your cluster really does need all its existing images, boot configs,
telegraf history, and other non-ClusterWare filesystem data,
then consider moving extraordinarily large directories
(e.g., /opt/scyld/clusterware/workspace/
,
as specified in /opt/scyld/clusterware/conf/base.ini
)
to another filesystem or even to another server,
and/or add storage space to the appropriate filesystem(s).
Exceeding System Limit of Network Connections¶
Clusters with a large number of nodes (e.g., many hundreds or more) may observe
a problem when executing a workload that attempts to communicate concurrently
with many or most of the nodes,
such as scyld-nodectl --up exec
or mpirun
executing a multi-threaded,
multi-node application.
The problem exhibits itself with an error message that refers to being unable
to allocate a TCP/IP socket or network connection,
or arp_cache reporting a "neighbor table overflow!" error.
A possible solution is to increase the number of available "neighbor" entries.
These are managed by a coordinated increase of gc_thresh1
,
gc_thresh2
, and gc_thresh3
values.
See https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt for
the semantics of these variables.
See the current values with:
sysctl net.ipv4.neigh.default.gc_thresh1
sysctl net.ipv4.neigh.default.gc_thresh2
sysctl net.ipv4.neigh.default.gc_thresh3
Default CentOS/RHEL values are 128, 512, and 1024, respectively. Experiment with higher values until your workloads are all successful, e.g.,:
sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=2048
sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=4096
sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=8192
See man sysctl.conf
for how to make the successful values persistent
across a reboot by putting them in a new /etc/sysctl.d/
file.
etcd Database Exceeds Size Limit¶
The etcd database has a hard limit of 2GB.
If exceeded, then all scyld-*
commands fail and
/var/log/clusterware/api_error_log will commonly grow in size as each
node's incoming status message cannot be serviced. The ClusterWare
api_error_log may also contain the following text:
etcdserver: mvcc: database space exceeded
Normally a head node thread executes in the background that triggers the discarding of database history (called compaction) and triggers database defragmentation (called defrag) if that is deemed necessary. In the rare event that this thread stops executing, then the etcd database grows until its size limit is reached.
This problem can solved with manual intervention by an administrator. Determine if the etcd database really does exceed its limit. For example:
[admin@head1]$ sudo du -hs /opt/scyld/clusterware-etcd/
2.1G /opt/scyld/clusterware-etcd
shows a size larger than 2GB, so you can proceed with the manual intervention.
First determine the current database revision. For example:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl get --write-out=json does_not_exist
{"header":{"cluster_id":9938544529041691203,"member_id":10295069852257988966,"revision":4752785,"raft_term":7}}
Subtract two or three thousand from the revision value 4752785 and compact to that new value:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl compaction 4750000
compacted revision 4750000
and trigger a defragmentation to reclaim space:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl defrag
Finished defragmenting etcd member[http://localhost:52379]
Then clear the alarm and reload the clusterware service:
[admin@head1]$ sudo /opt/scyld/clusterware-etcd/bin/etcdctl alarm disarm
[admin@head1]$ sudo systemctl reload clusterware
This restarts the head node thread that executes in the background and checks the etcd database size. Everything should now function normally.
Failing To Boot From Local Storage¶
If a compute node is configured to boot from local storage, and yet
after successfully booting it is actually instead using a RAM root filesystem,
then the problem may be that the initramfs image does not contain a
needed kernel module to mount the root filesystem on local storage.
Examine /opt/scyld/clusterware-node/atboot/cw-dracut.log
on the compute
node to determine if the mount failed and why.
If the problem is a missing kernel module, then add that to the initramfs.
For example, add the virtio_blk module, and rebuild the boot config:
scyld-mkramfs --update DefaultBoot --kver 3.10.0-957.27.2.el7.x86_64 --drivers virtio_blk
IP Forwarding¶
If IP forwarding is desired and is still not working, then search for the line containing "net.ipv4.ip_forward":
grep net.ipv4.ip_forward /etc/sysctl.conf
grep net.ipv4.ip_forward /etc/sysctl.d/*
If that line exists and the assigned value is set to zero, then IP forwarding will be disabled.
Soft Power Control Failures¶
If the scyld-nodectl reboot
or shutdown
commands always fall back on
hard power control, the shutdown process on the compute node may be taking too
long. When this happens the scyld-nodectl reboot
or shutdown
commands
will pause for several seconds waiting for the soft power change to
take place before falling back to direct power control through the
power_uri. A common cause for this is a network file system that is
slow to unmount. The cluster administrator should address the problem
delaying shutdown, but if it is unavoidable,
then the reboot
and shutdown
commands accept options to adjust the
timeout (--timeout <seconds>
), or you can specify to use only the soft
reboot (--soft
) without falling back to direct power control.
Head Nodes Disagree About Compute Node State¶
If two linked head nodes disagree about the status of the compute
nodes, this is usually due to clock skew between the head nodes. The
appropriate fix is to ensure that all head nodes are using the same
NTP / Chrony servers. The shared database includes the last time each
compute node provided a status update. If that time is too far in
the past, then a compute node is assumed to have stopped communicating
and is marked as "down". This mark is not recorded in the
database, but is instead applied as the data is returned to the
calling process such as scyld-nodectl status
.
Applications Report Excessive Interruptions and Jitter¶
In certain circumstances that are more common with real-time, performance sensitive multi-node applications, applications may occasionally suffer noticeable unwanted interruptions or "jitter" that affects the application's stability and predictability.
Some issues may be remedied by having the affected compute nodes execute in "busy mode", during which the node's cw-status-updater service severely reduces the scope of what information it periodically gathers and reports to the node's parent. That service's normal operation may exhibit an infrequent 1-2 second computation stall, which in a cluster with hundreds or thousands of nodes may affect a multi-node real-time application's otherwise rapid periodic sync'ing.
"Busy mode" can be enabled in one of three ways:
Set the node's boolean _busy reserved attribute to True with a case-insensitive 1, "on", "y", "yes", "t", or "true". See the _busy attribute in Reserved Attributes for details. Turn off "busy mode" by setting _busy to False with 1, "off", "n", "no", "f", or "false", or by clearing the _busy attribute completely.
Execute
touch /opt/scyld/clusterware-node/etc/busy.flag
on the node in a job scheduler prologue to enable andrm
that file in an epilogue to disable. Thisbusy.flag
method is ignored if the node's _busy attribute is explicitly set to True or False.The node's cw-status-updater service may heuristically decide on its own to execute in "busy mode". This method is overridden by the presence of
busy.flag
or by an explicit _busy attribute setting.
An additional approach is to employ cpusets to execute specific applications on specific node cores in order to minimize contention. See the _status_cpuset attribute in Reserved Attributes for details about how to do this for the cw-status-updater service, and consult your Linux distribution or job scheduler documentation for how to do this for your applications.
Managing Node Failures¶
In a large cluster the failure of individual compute nodes should be anticipated and planned for. Since many compute nodes are diskless, recovery should be relatively simple, consisting of rebooting the node once any hardware faults have been addressed. Disked nodes may require additional steps depending on the importance of the data on disk. Please refer to your operating system documentation for details.
A compute node failure can unexpectedly terminate a long running computation involving that node. We strongly encourage authors of such programs to use techniques such as application checkpointing to ensure that computations can be resumed with minimal loss.
Head Node Failure¶
To avoid issues like an Out-Of-Memory condition or similarly preventable failure, head nodes should generally not participate in the computations executing on the compute cluster. As a head node plays an important management role, its failure, although rare, has the potential to impact significantly more of the cluster than the failure of individual compute nodes. One common strategy for reducing the impact of a head node failure is to employ multiple head nodes in the cluster. See Managing Multiple Head Nodes for details.
Managing Large Clusters¶
Scyld ClusterWare head nodes generally scale well out-of-the-box, at least from the perspective of software, since the compute nodes' demands on a head node are primarily during node boot, and thereafter nodes generate regular, modest Telegraf networking traffic to the InfluxDB server to report node status, and generate sporadic networking traffic to whatever cluster filesystem(s) are employed for shared storage.
Very large clusters may exhibit scaling limitations due to hardware constraints of CPU counts, RAM sizes, and networking response time and throughput. Those limitations are visible to cluster administrators using well known monitoring tools.
Improve scaling of node booting¶
The clusterware service is a multi-threaded Python application started by the
Apache web server.
By default,
each head node will spawn up to 16 worker threads to handle incoming requests,
but for larger clusters (hundreds of nodes per head node) this number
can be adjusted as needed by changing the thread=16
value in
/opt/scyld/clusterware/conf/httpd_wsgi.conf
and restarting the clusterware service.
Finding Further Information¶
If you encounter a problem installing your Scyld cluster and find that this Installation & Administrator Guide cannot help you, the following are sources for more information:
The Changelog & Known Issues contains per-release specifics, and a Known Issues And Workarounds section.
The Reference Guide contains a technical reference to Scyld ClusterWare commands.
Contacting Penguin Computing Support¶
If you choose to contact Penguin Computing Support,
you may be asked to submit a system information snapshot.
Execute scyld-sysinfo --no-tar
to view this snapshot locally,
otherwise execute scyld-sysinfo
to produce the compressed tarball that
can be emailed or otherwise communicated to Penguin Computing.