Appendix: Monitoring Scheduler Info¶
Important
This software is a TECHNOLOGY PREVIEW that is being rolled out with limited features and limited support. Customers are encouraged to contact Penguin with suggestions for improvements, or ways that the tool could be adapted to work better in their environments.
Clusterware provides a separate daemon process that reads data from one or more inputs and pushes that data into one or more endpoints. The supported inputs are the Slurm batch scheduler and ClusterWare itself. The supported outputs are ClusterWare, InfluxDB, or an archive file. By selecting appropriate inputs and outputs, one could read data from ClusterWare and write it into InfluxDB, or read from Slurm and write into ClusterWare.
The scheduler data will show up in the ClusterWare node attributes and can then be viewed with:
$ scyld-nodectl ls -l
Nodes
n0
attributes
_boot_config: DefaultBoot
_sched_state: idle
_sched_extra: Node is idle
_sched_full: { ... full JSON blob }
Loosely speaking, _sched_state
is a one-word summary of the state
of the node (allocated, idle, down); _sched_extra
is a one-line summary,
potentially giving basic info on why the node might be in that state
(e.g. not responding to pings might lead to a "down" indicator); and
_sched_full
is a JSON dump of all the information the scheduler provided
for that node.
With the data in the nodes' attributes, admins can then use those attributes to select groups of nodes and target them with an action. For example, to list all nodes where Slurm is idle:
$ scyld-nodectl -s "attributes[_sched_state]==idle" ls
Nodes
n0
n1
n2
One could similarly reboot all "down" nodes, or remotely execute a command to restart slurmd on any problematic nodes.
Note
At present, only the Slurm scheduler is supported, though other batch schedulers will likely be supported in the future.
sched-watcher Deployment¶
sched_watcher
runs as a daemon process on a machine that has
network access to both the batch system controller, e.g. slurmctld
,
and to the ClusterWare head node.
While one could run sched_watcher
directly on a head
node, it is a better practice to run it on a ClusterWare
management node to fully isolate any network or CPU load that
it might generate.
For the sched_watcher
server, the command-line tools for the
batch scheduler must be installed, and it will be helpful to have the
ClusterWare tools as well:
yum install clusterware-tools slurm-scyld-node
Copy the files from a head node:
scp -r /opt/scyld/clusterware/src/sched_watcher/* \
mywatcher:/path/to/dest
On the sched_watcher
server, prepare the SystemD service:
cp sched_watcher.service /etc/systemd/system/.
Modify the sched_watcher.conf
file as needed (see below).
Create an authentication token using the scyld-adminctl
tool:
scyld-adminctl token --lifetime 10d --outfile /tmp/cw_bearer_token
The default config file assumes /tmp/cw_bearer_token
but any
filename and path could be used. It is also possible to generate this
token elsewhere and scp
it to the sched-watcher server.
Enable and start the service:
systemctl enable sched_watcher
systemctl start sched_watcher
Verify Data¶
Once the sched_watcher
tool is running, it should quickly
push data to ClusterWare and InfluxDB. On a ClusterWare head
or management node, try:
scyld-nodectl ls -l
and verify that the _sched_state
and other attributes are
now populated.
Similarly, one can look in the monitoring GUI and the same data should be visible there.
Note
By default, the update cycle is every 30 seconds.
Config Settings¶
There are three main sections to the config file: one for
sched_watcher
(main) settings, and one for each of the output
options (currently slurm
and influxdb
).
For sched_watcher
, the following options are supported:
token_file_path = /tmp/cw_bearer_token
Path to the authentication token file.
token_duration = 1h
Since the auth-token will have some lifespan embedded within it,
sched_watcher
will periodically re-read the file assuming that it will be refreshed prior to expiration.token_duration
sets the time-frame for re-reading the file.
polling_interval = 30
Sets the time between sending of updates to ClusterWare. A longer interval can potentially reduce the load on the system, but the data will be more out-of-date.
sched_type = slurm
Sets the "type" of batch scheduler to retrieve data from. At present, only
slurm
is supported.
debug_level = 1
Enables debugging output.
input = slurm
A comma-separated list of input modules that will be used. At present,
slurm
andclusterware
area supported.
output = clusterware, influxdb
A comma-separated list of output modules that should be used. It can include one or more of:
clusterware
,influxdb
, orarchive
. If admins do not wish to use InfluxDB/Telegraf monitoring, it can be removed from this list.
For ClusterWare, only one option is currently supported:
base_url = http://parent-head-node/api/v1/
Sets the base URL for ClusterWare. Best practice would be to run
sched_watcher
on a ClusterWare management node, soparent-head-node
will be kept up-to-date and will always point at a valid head node.
For Slurm, only one option is currently supported:
base_path = /opt/scyld/slurm/bin
Sets the base path to all of the Slurm command-line tools. This is where
sched_watcher
will look for thesinfo
andsqueue
tools.
For InfluxDB, the options are:
base_url = udp://parent-head-node:8094
Sets the base URL for the InfluxDB service. Best practice would be to not run
sched_watcher
on a ClusterWare management node, soparent-head-node
will be updated to point at a valid head node.
include_sched = true
For reduced data size,
sched_watcher
can enable or disable the_sched_state
information.
include_extra = true
By default,
sched_watcher
will only push the_sched_state
information. Setting this totrue
will also push the_sched_extra
(one line) summary into InfluxDB. At this time, there is no support for sending the full JSON data into InfluxDB.
include_cw_data = false
Indicates if the data from ClusterWare should be written to the InfluxDB endpoint. For example, one might want to archive the ClusterWare data, but not send it to InfluxDB.
drop_cw_fields = *
A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The
*
is a wildcard that matches any number of any character.
keep_cw_fields = a.*
A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the
drop_cw_fields
filter. The*
is a wildcard that matches any number of any character
For the archive file output, the options are:
output_file = /tmp/cw_archive
The full path to the archive file.
rotate_interval = 1d
How often the archive file should be rotated (can use
h
for hours,d
for days).
zip_prog = /usr/bin/gzip
If given, the rotated (old) archive files will be compressed with the given tool to reduce storage requirements.
drop_cw_fields = *
A simple filter system to allow some ClusterWare fields to be dropped, keeping all the others. The
*
is a wildcard that matches any number of any character.
keep_cw_fields = a.*
A simple filter system that will keep certain ClusterWare fields even if they were otherwise selected by the
drop_cw_fields
filter. The*
is a wildcard that matches any number of any character
Notes¶
The code currently runs as root so that it can read the config file in /opt/scyld/clusterware and also read the admin-created token file
The
sched_watcher
tool cannot refresh the auth-token that it's been given, so there must be some other (out-of-band) process to refresh that bearer token before it expires.
e.g. One could run a weekly cron job that executes the
scyld-adminctl token
command.Restart/Reload
sched_watcher
after any changes to the config file.
To reload the config and auth-token files:
systemctl reload sched_watcher
To completely restart the system:
systemctl restart sched_watcher
The archived data is in a straightforward key=value format. Each line includes
time=<Unix timestamp>
andhost=<hostname>
followed by the data fields for that node.
The raw data is "flattened" into a single-level set of dotted keys, so
clusterware.attributes.foo
would becomec.a.foo=value
*attributes
becomesa
,status
:s
,hardware
:h
There is simple filtering available with some outputs modules:
keep_cw_fields
anddrop_cw_fields
.
Both can be comma-separated lists of fields that should be included or excluded, and both can include a trailing wildcard (
*
)
keep_cw_fields
are retained in the output even if there is a matchingdrop_cw_fields
key
drop_cw_fields=*
andkeep_cw_fields=a.*
drop all fields except for
a.*
fields (attributes)
drop_cw_fields=
(empty) andkeep_cw_fields=*
retain all fields
Example Config
[sched_watcher]
token_file_path = /tmp/cw_bearer_token
token_duration = 1h
polling_interval = 30
sched_type = slurm
debug_level = 1
# available inputs = clusterware, slurm
input = slurm
# available outputs = archive, clusterware, influxdb
output = clusterware
[clusterware]
base_url = http://parent-head-node/api/v1/
[slurm]
base_path = /opt/scyld/slurm/bin
[influxdb]
base_url = udp://parent-head-node:8094
include_sched = true
include_sched_extra = false
include_cw_data = false
# drop most fields ...
drop_cw_fields = *
# but keep these ...
keep_cw_fields = a.*
[archive]
output_file = /tmp/cw_archive
rotate_interval = 1d
zip_prog = /usr/bin/gzip
# drop_cw_fields = *
keep_cw_fields = *