Appendix: Monitoring Scheduler Info

Important

This software is a TECHNOLOGY PREVIEW that is being rolled out with limited features and limited support. Customers are encouraged to contact Penguin with suggestions for improvements, or ways that the tool could be adapted to work better in their environments.

Clusterware provides a separate daemon process that reads data from one or more inputs and pushes that data into one or more endpoints. The supported inputs are the Slurm batch scheduler and ClusterWare itself. The supported outputs are ClusterWare, InfluxDB, or an archive file. By selecting appropriate inputs and outputs, one could read data from ClusterWare and write it into InfluxDB, or read from Slurm and write into ClusterWare.

The scheduler data will show up in the ClusterWare node attributes and can then be viewed with:

$ scyld-nodectl ls -l
Nodes
  n0
    attributes
      _boot_config: DefaultBoot
      _sched_state: idle
      _sched_extra: Node is idle
      _sched_full: { ... full JSON blob }

Loosely speaking, _sched_state is a one-word summary of the state of the node (allocated, idle, down); _sched_extra is a one-line summary, potentially giving basic info on why the node might be in that state (e.g. not responding to pings might lead to a "down" indicator); and _sched_full is a JSON dump of all the information the scheduler provided for that node.

With the data in the nodes' attributes, admins can then use those attributes to select groups of nodes and target them with an action. For example, to list all nodes where Slurm is idle:

$ scyld-nodectl -s "attributes[_sched_state]==idle" ls
Nodes
  n0
  n1
  n2

One could similarly reboot all "down" nodes, or remotely execute a command to restart slurmd on any problematic nodes.

Note

At present, only the Slurm scheduler is supported, though other batch schedulers will likely be supported in the future.

sched-watcher Deployment

sched_watcher runs as a daemon process on a machine that has network access to both the batch system controller, e.g. slurmctld, and to the ClusterWare head node. While one could run sched_watcher directly on a head node, it is a better practice to run it on a ClusterWare management node to fully isolate any network or CPU load that it might generate.

For the sched_watcher server, the command-line tools for the batch scheduler must be installed, and it will be helpful to have the ClusterWare tools as well:

yum install clusterware-tools slurm-scyld-node

Copy the files from a head node:

scp -r /opt/scyld/clusterware/src/sched_watcher/* \
    mywatcher:/path/to/dest

On the sched_watcher server, prepare the SystemD service:

cp sched_watcher.service /etc/systemd/system/.

Modify the sched_watcher.conf file as needed (see below).

Create an authentication token using the scyld-adminctl tool:

scyld-adminctl token --lifetime 10d --outfile /tmp/cw_bearer_token

The default config file assumes /tmp/cw_bearer_token but any filename and path could be used. It is also possible to generate this token elsewhere and scp it to the sched-watcher server.

Enable and start the service:

systemctl enable sched_watcher
systemctl start sched_watcher

Verify Data

Once the sched_watcher tool is running, it should quickly push data to ClusterWare and InfluxDB. On a ClusterWare head or management node, try:

scyld-nodectl ls -l

and verify that the _sched_state and other attributes are now populated.

Similarly, one can look in the monitoring GUI and the same data should be visible there.

Note

By default, the update cycle is every 30 seconds.

Config Settings

There are three main sections to the config file: one for sched_watcher (main) settings, and one for each of the output options (currently slurm and influxdb).

For sched_watcher, the following options are supported:

  • token_file_path = /tmp/cw_bearer_token

    • Path to the authentication token file.

  • token_duration = 1h

    • Since the auth-token will have some lifespan embedded within it, sched_watcher will periodically re-read the file assuming that it will be refreshed prior to expiration. token_duration sets the time-frame for re-reading the file.

  • polling_interval = 30

    • Sets the time between sending of updates to ClusterWare. A longer interval can potentially reduce the load on the system, but the data will be more out-of-date.

  • sched_type = slurm

    • Sets the "type" of batch scheduler to retrieve data from. At present, only slurm is supported.

  • debug_level = 1

    • Enables debugging output.

  • input = slurm

    • A comma-separated list of input modules that will be used. At present, slurm and clusterware area supported.

  • output = clusterware, influxdb

    • A comma-separated list of output modules that should be used. It can include one or more of: clusterware, influxdb, or archive. If admins do not wish to use InfluxDB/Telegraf monitoring, it can be removed from this list.

For ClusterWare, only one option is currently supported:

  • base_url = http://parent-head-node/api/v1/

    • Sets the base URL for ClusterWare. Best practice would be to run sched_watcher on a ClusterWare management node, so parent-head-node will be kept up-to-date and will always point at a valid head node.

For Slurm, only one option is currently supported:

  • base_path = /opt/scyld/slurm/bin

    • Sets the base path to all of the Slurm command-line tools. This is where sched_watcher will look for the sinfo and squeue tools.

For InfluxDB, the options are:

  • base_url = udp://parent-head-node:8094

    • Sets the base URL for the InfluxDB service. Best practice would be to not run sched_watcher on a ClusterWare management node, so parent-head-node will be updated to point at a valid head node.

  • include_sched = true

    • For reduced data size, sched_watcher can enable or disable the _sched_state information.

  • include_extra = true

    • By default, sched_watcher will only push the _sched_state information. Setting this to true will also push the _sched_extra (one line) summary into InfluxDB. At this time, there is no support for sending the full JSON data into InfluxDB.

  • include_cw_data = false

    • Indicates if the data from ClusterWare should be written to the InfluxDB endpoint. For example, one might want to archive the ClusterWare data, but not send it to InfluxDB.

  • drop_cw_fields = *

    • A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The * is a wildcard that matches any number of any character.

  • keep_cw_fields = a.*

    • A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the drop_cw_fields filter. The * is a wildcard that matches any number of any character

For the archive file output, the options are:

  • output_file = /tmp/cw_archive

    • The full path to the archive file.

  • rotate_interval = 1d

    • How often the archive file should be rotated (can use h for hours, d for days).

  • zip_prog = /usr/bin/gzip

    • If given, the rotated (old) archive files will be compressed with the given tool to reduce storage requirements.

  • drop_cw_fields = *

    • A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The * is a wildcard that matches any number of any character.

  • keep_cw_fields = a.*

    • A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the drop_cw_fields filter. The * is a wildcard that matches any number of any character

Notes

  • The code currently runs as root so that it can read the config file in /opt/scyld/clusterware and also read the admin-created token file

  • The sched_watcher tool cannot refresh the auth-token that it's been given, so there must be some other (out-of-band) process to refresh that bearer token before it expires.

    • e.g. One could run a weekly cron job that executes the scyld-adminctl token command.

  • Restart/Reload sched_watcher after any changes to the config file.

    • To reload the config and auth-token files: systemctl reload sched_watcher

    • To completely restart the system: systemctl restart sched_watcher

  • The archived data is in a straightforward key=value format. Each line includes time=<Unix timestamp> and host=<hostname> followed by the data fields for that node.

    • The raw data is "flattened" into a single-level set of dotted keys, so clusterware.attributes.foo would become c.a.foo=value * attributes becomes a, status: s, hardware: h

  • There is simple filtering available with some outputs modules: keep_cw_fields and drop_cw_fields.

    • Both can be comma-separated lists of fields that should be included or excluded, and both can include a trailing wildcard ( * )

    • keep_cw_fields are retained in the output even if there is a matching drop_cw_fields key

    • drop_cw_fields=* and keep_cw_fields=a.*

      • drop all fields except for a.* fields (attributes)

    • drop_cw_fields= (empty) and keep_cw_fields=*

      • retain all fields

Example Config

[sched_watcher]
token_file_path = /tmp/cw_bearer_token
token_duration = 1h
polling_interval = 30
sched_type = slurm
debug_level = 1
# available inputs = clusterware, slurm
input = slurm
# available outputs = archive, clusterware, influxdb
output = clusterware

[clusterware]
base_url = http://parent-head-node/api/v1/

[slurm]
base_path = /opt/scyld/slurm/bin

[influxdb]
base_url = udp://parent-head-node:8094
include_sched = true
include_sched_extra = false
include_cw_data = false
# drop most fields ...
drop_cw_fields = *
# but keep these ...
keep_cw_fields = a.*

[archive]
output_file = /tmp/cw_archive
rotate_interval = 1d
zip_prog = /usr/bin/gzip
# drop_cw_fields = *
keep_cw_fields = *