Troubleshooting

Basic diagnostics

TrinityX comes with a diagnosis tool which will check the required services.

# trinity_diagnosis
Trinity Core
        chronyd: active (running) since Tue 2023-10-17 20:45:29 CEST; 17h ago
        named: active (running) since Tue 2023-10-17 20:55:07 CEST; 17h ago
        dhcpd: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago
        mariadb: active (running) since Tue 2023-10-17 20:46:20 CEST; 17h ago
        nfs-server: active (exited) since Tue 2023-10-17 20:44:38 CEST; 17h ago
        nginx: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago


Luna
        luna2-daemon: active (running) since Tue 2023-10-17 20:49:51 CEST; 17h ago
        aria2c: active (running) since Wed 2023-10-18 13:50:51 CEST; 13min ago


LDAP
        slapd: active (running) since Tue 2023-10-17 20:45:31 CEST; 17h ago
        sssd: active (running) since Tue 2023-10-17 20:46:17 CEST; 17h ago


Slurm
        slurmctld: active (running) since Tue 2023-10-17 20:55:13 CEST; 17h ago


Monitoring core
        influxdb: active (running) since Tue 2023-10-17 20:54:32 CEST; 17h ago
        telegraf: active (running) since Tue 2023-10-17 20:55:17 CEST; 17h ago
        grafana-server: active (running) since Tue 2023-10-17 20:55:38 CEST; 17h ago
        sensu-server: active (running) since Tue 2023-10-17 20:55:31 CEST; 17h ago
        sensu-api: active (running) since Tue 2023-10-17 20:55:32 CEST; 17h ago
        rabbitmq-server: active (running) since Wed 2023-10-18 13:31:14 CEST; 33min ago


Trinity OOD
        httpd: active (running) since Tue 2023-10-17 20:55:16 CEST; 17h ago

Other troubleshooting utilities

Bundled with TrinityX comes a graphical diagnostic application that helps troubleshooting Infiniband issues. It show the topology and highlights troubled links, including the host and port where the problems are seen.

Centralized syslog

Syslog is configured to collect all the logs from the services on the controller(s) and the nodes. Logs are commonly used to troubleshoot problems and are typically found in /var/log:

component location
services /var/log/messages or their respective location
prometheus /var/log/prometheus/*
luna daemon, utils and cli /var/log/luna/*
nodes /var/log/cluster-messages/*

for the services configured by TrinityX, rotates are setup to prevent /var/log from filling up.

Problems encountered using TrinityX

Nodes cannot boot

during the initial (i)PXE request, zero bytes are returned. Follow the following steps on the controller(s):

  • Ensure there are at least two files in /var/lib/tftpboot with a size, i.e. not zero bytes:
    • luna_ipxe.efi
    • luna_undionly.kpxe
  • Check the contents of /etc/dhcp/dhcpd.conf. there should be a block that looks like depicted below.
  • Verify dhcpd and/or dhcpd6 is running. dhcpd6 is normally only needed to boot using IPv6
  • Check the log /var/log/luna/luna2-daemon.log for any (suspicious) errors or warnings
  • Verify whether the firewall is blocking traffic

Example dhcpd.conf block:

subnet 10.141.0.0 netmask 255.255.0.0 {
    max-lease-time 28800;
    if exists user-class and option user-class = "iPXE" {
        filename "http://10.141.255.254:7050/boot";
    } else {
        if option client-architecture = 00:07 {
            filename "luna_ipxe.efi";
        } elsif option client-architecture = 00:0e {
        # OpenPower do not need binary to execute.
        # Petitboot will request for config
        } else {
            filename "luna_undionly.kpxe";
        }
    }
    next-server 10.141.255.254;
    range 10.141.10.0 10.141.255.253;
    option routers 10.141.255.254;
    option domain-name "cluster";
    ddns-domainname "cluster.";
    ddns-rev-domainname "in-addr.arpa.";
    update-static-leases on;
    option luna-id "lunaclient";
}

Firewall

If the controller and compute playbooks have completed, the dhcpd service is running but there is no boot possible, it could be that the firewall is not configured properly.

The default setup is that there is one public and one trusted interface in firewalld. If the group_vars/all.yml is not configured correctly, the interfaces are placed in public which is limited.

Misconfigured:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens192 ens224
  sources:
  services: cockpit dhcpv6-client ssh
  ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
  [...]

Correct:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens192
  sources:
  services: cockpit dhcpv6-client ssh
  ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
  [...]

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: ens224

The interface can be switched using firewall-cmd (where ens224 is the internal interface):

firewall-cmd --remove-interface ens224 --zone=public
firewall-cmd --add-interface ens224 --zone=trusted --permanent
firewall-cmd --reload

Open OnDemand shows an internal server error

Upon accessing the URL for OOD, the below error is encountered.

Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator at root@localhost to inform them of the time this error occurred, and the actions you performed just before this error.

More information about this error may be available in the server error log.

Two main reasons:

  • The external FQDN (trix_external_fqdn) setting in the group_vars/all.yml config file, used during installation is not resolvable by the controller itself. Make sure the controller can resolve its own FQDN by pointing the forwarder to your own DNS server (that can resolve the external FQDN) or add an entry in /etc/hosts to match.
  • The certificate generated for the controller contains the FQDN as a valid host. The FQDN (trix_external_fqdn) is typically set in the group_vars/all.yml config file during installation time When this name does not match with how the external IP of the controller is resolved, the certificate is not valid.

Please refer to Open OnDemand section.

Luna Graphical applications show Internal Server Error

Due to changes on the luna daemon side, the graphical luna applications might render an Internal Server Error. This is most likely caused by updating luna, but not included the ood-apps. To solve this issue, update the ood-apps:

ansible-playbook controller.yml --tags=ood-apps,luna

Please also look at Updating using the cloned TrinityX branch.

Slurm Graphical Configurator application shows 'list index out of range' message

Starting the Slurm Graphical Configurator results in a white screen with the text
{"message":"list index out of range"}.
To solve this issue, update the config-manager and restart Open OnDemand services:

ansible-playbook controller.yml --tags=config-manager
systemctl restart htcacheclean httpd

After logging into the portal again, the application works as intended.

Please also look at Updating using the cloned TrinityX branch