Hotfixes / Known issues

Note

This page lists known issues per release and the manual workaround to apply on a running system, until the permanent fix lands in an update release.

For general installation problems see Installation troubleshooting; for operational problems after installation see Operational troubleshooting. New features and resolved bugs per release are listed in the Release Notes.

16 / 15.3 — DS389 runtime directories missing on a rebooted standby controller

Phase: operational / HA failover

Applies to TrinityX 15.3 and 16 on an HA controller pair using the ds389 LDAP backend. Resolved permanently in TRIX-1915.

After a standby controller reboots, promoting the Trinity stack onto it leaves ds389 stuck, and every resource ordered behind it (mariadb, slurmdbd, slurmctld, luna2-master, alertx-history) stays stopped:

# pcs status — ds389 on the rebooted standby
ds389  (systemd:dirsrv@local):  Stopped
ds389 start on <standby> could not be executed (Error: systemd start job for dirsrv@local.service failed with result 'failed')

# ns-slapd journal on that node
EMERG - main - Unable to access nsslapd-rundir: No such file or directory
EMERG - main - Ensure that user "dirsrv" has read and write permissions on /run/dirsrv

Cause: On the active controller, dscreate drops /etc/tmpfiles.d/dirsrv-local.conf, so systemd-tmpfiles recreates /run/dirsrv and /run/lock/dirsrv/slapd-local on every boot. On the standby, dscreate never runs, and the directories were only created once at install. Because /run is tmpfs, a reboot of the standby removes them and nothing recreates them — so the next failover of ds389 onto that node fails. The active controller is unaffected.

FIX: Install the same drop-in by hand on each standby controller. No reboot and no Ansible run are required, and the active controller needs no change.

# on each standby controller -- install the drop-in dscreate puts on the active
cat > /etc/tmpfiles.d/dirsrv-local.conf <<'EOF'
d /run/dirsrv 0770 dirsrv dirsrv
d /run/lock/dirsrv/ 0770 dirsrv dirsrv
d /run/lock/dirsrv/slapd-local 0770 dirsrv dirsrv
EOF

# materialise the directories now (no reboot needed)
systemd-tmpfiles --create /etc/tmpfiles.d/dirsrv-local.conf

If ds389 has already failed over to a standby and is stuck, clear the failed resource so pacemaker retries the start now that the runtime directory exists:

# from any cluster node
pcs resource cleanup ds389

To confirm, ls -ld /run/dirsrv /run/lock/dirsrv/slapd-local should show both owned by dirsrv:dirsrv. Thanks to the drop-in they now reappear automatically after any future reboot of the standby. The drop-in only declares tmpfs runtime directories and systemd-tmpfiles --create is safe to re-run; nothing here touches the LDAP database, DRBD or ZFS.


15.3 — In HA setup, building of compute-default.yml fails: "This playbook should only be run on the active controller"

Phase: install-time

Applies to TrinityX 15.3. Resolved in 15.3u1.

TASK [gathering Facts] *******************************************************************************
ok: [controller1]
TASK [failed] ****************************************************************************************
fatal: [controller1]: FAILED! => {"changed": false, "msg": "This playbook should only be run on the active controller"}

FIX: If you are sure that you are running the playbook from the active controller, please rerun the installation with -e primary=true

example:

# ansible-playbook compute-default.yml -e primary=true


Rocky/Alma 9.5 — DRBD kmod install error during HA shared-disk setup

Phase: install-time / HA shared disk

Applies to TrinityX HA installs on Rocky/Alma 9.5. The workaround applies while elrepo's kmod-drbd9x outpaces the available 9.5 kernel; it clears once elrepo aligns with the running kernel.

While installing, during the HA setup shared disk part, the following error is encountered:

Error: 
 Problem: cannot install the best candidate for the job
  - nothing provides kernel >= 5.14.0-570.12.1.el9_6 needed by kmod-drbd9x-9.2.13-5.el9_6.elrepo.x86_64 from elrepo
  - ....
  - ...

Cause: elrepo front-running Rocky/Alma 9.5 kernel versions, where kmod-drbd9x-9.2.13 requires a later kernel than the one installed.

FIX: install kmod-drbd9x-9.1.23 manually and then restart the installation of TrinityX:

# yum -y install kmod-drbd9x-9.1.23
# ansible-playbook controller.yml

See also: Installation troubleshooting · Operational troubleshooting · Release Notes