Managing High Availability

TrinityX High Availability (HA) keeps the cluster's central services available when a controller fails or is taken down for maintenance. It is built on Corosync and Pacemaker, which are operated through the pcs command. A typical HA setup has two controllers: one is active and runs the single-instance services and the shared (virtual) IP, while the other is a passive standby ready to take over.

For the design and the shared-storage side of HA, see Shared filesystem disks and the HA architecture in the installation manual. This page covers day-to-day operation.

What Pacemaker manages

The services that can only run on one controller at a time are managed by Pacemaker as cluster resources and are placed on the active controller. In a TrinityX cluster these typically are:

the shared (virtual) IP address that clients and nodes talk to;
the shared filesystems (for example the home and shared directories, and the HA state directory);
the database (MariaDB);
the directory service (OpenLDAP / 389 Directory Server);
luna2-master.service, which marks the controller it runs on as the Luna image master (see below).

Because these are Pacemaker resources, you do not start or stop them by hand — Pacemaker does. When an HA cluster starts up, starting the cluster brings up Corosync and Pacemaker, which elect an active controller and start the managed services there. Stopping or standby-ing a controller makes Pacemaker move them to the other controller.

TrinityX HA resources

The controller hostnames differ per installation, but the TrinityX resource names are stable — they derive from the shared-disk name (trinityx by default). Pacemaker arranges them in ordered groups so they start in sequence and move together:

Trinity — the floating (virtual) IP trinity-ip (and trinity-ip-external when an external floating IP is configured).
Trinity-zfs-trinityx — the shared storage: DRBD-trinityx (run as the promotable DRBD-trinityx-clone), wait-for-device-trinityx, trinity-zfs-trinityx and zfs-ready-trinityx.
Trinity-stack — the single-instance services: trinity-stack-ready, nfs-server, the directory service (openldap for OpenLDAP or ds389 for 389 Directory Server), mariadb, slurmdbd, slurmctld and luna2-master.

DRBD-trinityx-clone is a promotable resource: the controller shown as Promoted in pcs status is the active DRBD side, the Unpromoted controller is the passive side. The exact set of resources depends on the chosen storage carrier (DRBD, iSCSI or direct) and which services are installed; pcs status always shows the live list.

Common pcs commands

Run these as root from any controller in the cluster.

Show the overall state — which controller is active, which resources run where, and any failures:

# pcs status

Start or stop the whole cluster (Corosync/Pacemaker) on every node, or just the local one:

# pcs cluster start --all
# pcs cluster stop --all
# pcs cluster start          # this node only

Put a controller into standby to move all of its resources to the other controller (for maintenance), and bring it back afterwards:

# pcs node standby controller1
# pcs node unstandby controller1

List resources and where they are running:

# pcs resource status

Move a single resource to a specific controller, then clear the temporary location constraint that move creates:

# pcs resource move <resource> controller2
# pcs resource clear <resource>

Clear failed actions so Pacemaker re-evaluates a resource (or all resources) after you have fixed the underlying problem:

# pcs resource cleanup
# pcs resource cleanup <resource>

Inspect a single resource, follow the cluster live during a failover, list recorded failures, or dump the full configuration:

# pcs resource config <resource>
# watch -n 2 pcs status
# pcs resource failcount show --full
# pcs config

Use pcs status to find the actual resource names in your cluster, then refer to them in the commands above.

Analysing a failed HA state

When the HA state looks wrong — a failed action, a resource on the wrong controller, a controller offline, or no clear Promoted/Unpromoted DRBD state — work top-down:

pcs status — shows the active controller, the floating IP, the DRBD role, any standby/offline controllers and recorded failures.
pcs resource config <resource> — inspect the failing resource's agent, attributes and operation settings.
pcs resource failcount show --full — use this when a resource keeps refusing to restart or keeps moving unexpectedly.
journalctl -u <service> -b — Pacemaker reports that something failed; the service journal tells you why. For example journalctl -u pacemaker -u corosync -b, journalctl -u slurmctld -b or journalctl -u mariadb -b.

Testing failover

To verify that TrinityX can move the HA stack from the active controller to its peer, put the active controller into standby and watch the resources migrate:

# pcs status
# pcs node standby <active-controller>
# watch -n 2 pcs status

The Trinity, Trinity-zfs and Trinity-stack groups should move to the peer controller. Return the controller to service afterwards:

# pcs node unstandby <active-controller>

Recovering a failed service

When a single service has failed but the controller pair is still up, check the status, the resource definition, its failcount and the service journal, fix the underlying problem, then clear the failure so Pacemaker retries it:

# pcs resource failcount show <resource> --full
# journalctl -u <service> -b
#   ... fix the real issue ...
# pcs resource cleanup <resource>
# pcs status

Tuning operation timeouts

Pacemaker treats an operation as failed when it does not complete within its configured timeout. To change a timeout, remove the current operation entry and add it back with the new value — use the exact interval and timeout shown by pcs resource config <resource>:

# pcs resource op remove <resource> start interval=0s timeout=<old-timeout>
# pcs resource op add    <resource> start interval=0s timeout=<new-timeout>

The same applies to the stop and monitor operations. Increasing a timeout lets a slow resource finish cleanly; it does not fix the underlying cause of a real failure.

What not to touch in production

Do not systemctl enable services that Pacemaker manages — let Pacemaker start and stop them.
Do not casually disable DRBD-trinityx-clone, trinity-zfs-trinityx or trinity-ip while troubleshooting.
Do not put both controllers into standby at the same time.
Do not clear failures before you have checked the status, the failcounts and the logs.

The role of lmaster and the Luna master

The Luna daemon (luna2-daemon) runs on every controller, active and passive, and serves its API on all of them. The daemon is agnostic about which controller is the primary one — electing the active controller is Pacemaker's job, not Luna's.

Luna only needs one piece of HA information: which controller is the master, in the sense of being the source of truth for images. When images are synchronized between controllers, the master's images are the authoritative copy. This designation says nothing about which controller is "primary" for services; it only tells Luna whose images to trust when syncing.

That designation is handled by the lmaster utility:

# lmaster -w        # show which controller is currently the master
# lmaster -a        # show the HA values of all controllers
# lmaster -s        # set the controller this runs on as the master

lmaster -a reports, per controller, the enabled, master, insync, syncimages and overrule flags, which is the quickest way to confirm the HA picture from Luna's side.

You normally do not run lmaster -s by hand. Pacemaker manages luna2-master.service, whose only job is to run lmaster -s on the controller it starts on. So when Pacemaker makes a controller active, that controller automatically becomes Luna's image master — keeping the source of truth for images aligned with the active controller, while the Luna daemon itself stays indifferent to who that is.

Changing a local DRBD disk or target

When a local DRBD disk needs to be relocated, e.g. when wrongly assigned during installation or when it's preferred to use a different disk, consider the outlined steps. Note that these are in principle data destructive operations for the targetted disk. The new target disk should be wiped to ensure no left over signatures are present.

On the PASSIVE controller:

pcs node standby <passive controller>
vi /etc/drbd.d/trinity.res <-- and change the disk config to match the desired target disk
systemctl start drbd
drbdadm disconnect trinityx
drbdadm detach trinityx
drbdadm create-md trinityx
drbdadm adjust trinityx
drbdadm -- --discard-my-data connect trinityx

The new disk should be re-initialized and can be observed through drbdadm status

The next steps are required to let pacemaker control the DRBD disk again:

systemctl stop drbd
pcs node unstandby <passive controller>
pcs resource cleanup

The DRBD target should continue syncing with the primary and can be observed through drbdadm status