Desired topology and active services
SLURM super quick start guide
The guide is available here.
On Centos8Stream, we first install MUNGE, then install SLURM by the bzip package. Status of munge can be probed by
munge -n | unmunge
Installing the SLURM package is done manually
wget https://download.schedmd.com/slurm/slurm-23.11.3.tar.bz2
We then unzip, configure, and build SLURM
We then need to create, by the SLURM user “slurm”, the directories of
- log files
- PID files
- State save
and make them writable. We take inspiration from an existing slurm.conf
file.
So the directories will be
Function | Directory |
---|---|
PID files | — |
SlurmdSpoolDir | /var/spool/slurmd |
StateSaveLocation | /var/spool/slurmctld |
Beware, slurm
is not in the sudoers group (and it must not be!).
If the directories are already there, we must chown
them, and chmod
them to writable by slurm
.
This can be done only by a sudoer.
In order to use the SLURM API, we must link to the libraries
ldconfig -n <library-location>
in our case <library-location>
is in the downloaded (and built) package.
A handy alternative: RPM
We can use rpmbuild
to directly install and configure the directories and link the libraries.
We do it in the slurm
user.
This method show a missing dependency
error: Failed build dependencies:
mariadb-devel >= 5.0.0 is needed by slurm-23.11.3-1.el8.x86_64
that we can easily solve by
sudo yum install mariadb-devel
A warning in the rpmbuild
suggest to run libtool --finish /lib64/security
. We do it, installing libtool
first
We still have the problem that the commands are not available. The SLURM services are not available.
The RPM files are loacated into
~/rpmbuild/RPMS
so we need to run
rpm -i *.rpm
inside the rpms folder.
slurm.conf
We use the generator tool to make the slurm.conf
file
We can now enable the daemons
systemctl enable slurmctld
systemctl enable slurmdbd
systemctl enable slurmd
and activate them
systemctl start slurmctld
systemctl start slurmdbd
systemctl start slurmd
The order is important!
We have a problem in starting the
slurmdbd
service:Condition: start condition failed at Sun 2024-02-11 17:10:08 CET; 8min ago
└─ ConditionPathExists=/etc/slurm/slurmdbd.conf was not met
This is due toslurmdbd: error: s_p_parse_file: unable to read "/etc/slurm/slurmdbd.conf": Permission denied
Also, we got error
Feb 11 17:26:11 slurph.novalocal slurmdbd[7414]: slurmdbd: error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/lib/mys>
Seems like we have problems with the installation of MariaDB. Lets run
sudo dnf install mariadb-server
Finally we are able to start the service
systemctl start mariadb.service
Who is running SLURM?
Running
slurmdbd -Dvvv
We get errors like
slurmdbd.conf owned by 1000 not SlurmUser(1003)
Seems like we have problems in read from /var/run/<PID files>
One can generate a folder inside /var/run
and chown it, but since /var/run
is tmpfs
it will go after a system reboot.
The solution seems to be to run a mkdir + chown
script at startup. This is to be implemented.
The script looks like (/home/centos/startup_pid.sh
)
#! /bin/bash
mkdir /var/run/slurm
chown slurm /var/run/slurm
Still we have problems in connecting to the database. This is a problem also on the startup of slurmd.service
.
Obviously, we lack a DB configuration.
Database configuration
We follow commands from the previous link.
Unable to determine this slurmd’s nodename
This problem is related to the impossibility to detect runner01
.
Before running the slurmd
daemon, we need to setup the computing nodes.
Now our slurmctld
and slurmdbd
are running.
Configuraiton of the computing node(s)
Follow instruction from youtube.
After the configuration, and running the slurmd
on runner01
, the slurmctld.log
file on the slurPh
machine indicates a problem with
[2024-02-12T17:57:04.216] error: Node runner01 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
and most importantly, an authentication problem by Munge
[2024-02-12T17:57:18.718] error: Munge decode failed: Unauthorized credential for client UID=1003 GID=1003
We should check
munge -n | unmunge
Both machines return a success code.
runner01:
STATUS: Success (0)
ENCODE_HOST: runner01.novalocal (10.64.37.114)
ENCODE_TIME: 2024-02-12 18:02:47 +0100 (1707757367)
DECODE_TIME: 2024-02-12 18:02:47 +0100 (1707757367)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: centos (1000)
GID: centos (1000)
LENGTH: 0
slurPh:
STATUS: Success (0)
ENCODE_HOST: runner01.novalocal (10.64.37.114)
ENCODE_TIME: 2024-02-12 18:02:47 +0100 (1707757367)
DECODE_TIME: 2024-02-12 18:02:47 +0100 (1707757367)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: centos (1000)
GID: centos (1000)
LENGTH: 0
The problem seems to be that UID and GID are not the same in the two machines runner01:
File: /etc/slurm
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: fc01h/64513d Inode: 30103776 Links: 2
Access: (0777/drwxrwxrwx) Uid: ( 1001/ slurm) Gid: ( 1001/ slurm)
slurPh:
File: /etc/slurm
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: fc01h/64513d Inode: 25439750 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 1003/ slurm) Gid: ( 1003/ slurm)
the id munge
command returns, strangely:
runner01:
uid=991(munge) gid=988(munge) groups=988(munge)
slurPh:
uid=991(munge) gid=988(munge) groups=988(munge)
We need to set the right UID, and change the permissions to /etc/munge/munge.key
on both machines.
We run groupmod -g 1099 munge
, and usermod -u 1004 -g 1099 munge
. In order not to have conflicts with other users, we list all users by cut -d: -f1,3 /etc/passwd
. To list all groups with GID: getent group
.
With the changed users, the munge daemon fails to start.
In the log file, we also notice
2024-02-12 17:57:18 +0100 Info: Unauthorized credential for client UID=1003 GID=1003
previous attempts by slurm user on runner01 (?) to authenticate. The change of user created ownership issues on log files and on the PRNG generator folder. We fix it by
sudo chown munge:munge -R /etc/munge
sudo chown munge:munge -R /var/lib/munge
sudo chown munge:munge -R /var/log/munge
Ok, we still have unable to determine this node name
.
scontrol ping
on runner01 still gives
Slurmctld(primary) at slurph is DOWN
There is more: every execution of the ping on runner01 triggers the writing of
2024-02-12 18:34:26 +0100 Info: Unauthorized credential for client UID=1003 GID=1003
in /var/log/munge/munged.log
!
Error bursts of the corresponding event in /var/log/slurm/slurmctld.log
are like
[2024-02-12T18:44:00.281] error: Munge decode failed: Unauthorized credential for client UID=1003 GID=1003
[2024-02-12T18:44:00.281] auth/munge: _print_cred: ENCODED: Thu Jan 01 01:00:00 1970
[2024-02-12T18:44:00.281] auth/munge: _print_cred: DECODED: Thu Jan 01 01:00:00 1970
[2024-02-12T18:44:00.281] error: slurm_unpack_received_msg: [[runner01]:50994] auth_g_verify: REQUEST_PING has authentication error: Unspecified error
[2024-02-12T18:44:00.281] error: slurm_unpack_received_msg: [[runner01]:50994] Protocol authentication error
[2024-02-12T18:44:00.291] error: slurm_receive_msg [10.64.37.114:50994]: Protocol authentication error
Note that unauhtorized credential is very different from invalid credentials!
Change to Ubuntu OS
Due to the dubious status of CentOS in relationship to RHEL, we switch to Ubuntu. This would allow a clean installation of SLURM to be seamless.
Connection aliases commands to the new instances are, in fish:
alias runner01="ssh -L2082:10.64.37.221:22 florenzi@gate.cloudveneto.it"
alias slurPh="ssh -L2081:10.64.37.17:22 florenzi@gate.cloudveneto.it"
alias connect-slurPh="ssh -p 2081 -i ~/.ssh/certificates/slurPh-key.pem ubuntu@localhost"
alias connect-runner01="ssh -p 2082 -i ~/.ssh/certificates/slurPh-key.pem ubuntu@localhost"
slurmdbd.conf
We need to insert here the credentials to the MySQL database.
MySQL database is to be managed via SLURM using sacct
.
For example, for adding a new user, we have
sacctmgr create user name=<USERNAME> account=<GROUP>
CGroups
When launching a job, we immediately set the node state to drain
, and the log file show
[2024-02-20T16:06:16.263] error: slurmd error running JobId=50 on node(s)=runner01: Plugin initialization failed
This is an error related to the Cgroups plugin.
On the runner, we have in /var/log/slurmd.log
[2024-02-20T16:06:16.269] [50.batch] error: unable to mount memory cgroup namespace: Device or resource busy
[2024-02-20T16:06:16.269] [50.batch] error: unable to create memory cgroup namespace
[2024-02-20T16:06:16.269] [50.batch] error: Couldn't load specified plugin name for jobacct_gather/cgroup: Plugin init() callback failed
[2024-02-20T16:06:16.269] [50.batch] error: cannot create jobacct_gather context for jobacct_gather/cgroup
[2024-02-20T16:06:16.273] [50.batch] error: unable to mount cpuset cgroup namespace: Device or resource busy
[2024-02-20T16:06:16.273] [50.batch] error: unable to create cpuset cgroup namespace
[2024-02-20T16:06:16.273] [50.batch] error: unable to mount memory cgroup namespace: Device or resource busy
[2024-02-20T16:06:16.273] [50.batch] error: unable to create memory cgroup namespace
[2024-02-20T16:06:16.273] [50.batch] error: failure enabling memory enforcement: Unspecified error
[2024-02-20T16:06:16.273] [50.batch] error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
[2024-02-20T16:06:16.273] [50.batch] error: cannot create task context for task/cgroup
[2024-02-20T16:06:16.273] [50.batch] error: job_manager: exiting abnormally: Plugin initialization failed
[2024-02-20T16:06:16.273] [50.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:1011 status:0
[2024-02-20T16:06:16.275] [50.batch] done with job