Troubleshooting

TrinityX comes with a diagnosis tool which will check the required services.

# trinity_diagnosis
Trinity Core
        chronyd: active (running) since Tue 2023-10-17 20:45:29 CEST; 17h ago
        named: active (running) since Tue 2023-10-17 20:55:07 CEST; 17h ago
        dhcpd: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago
        mariadb: active (running) since Tue 2023-10-17 20:46:20 CEST; 17h ago
        nfs-server: active (exited) since Tue 2023-10-17 20:44:38 CEST; 17h ago
        nginx: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago


Luna
        luna2-daemon: active (running) since Tue 2023-10-17 20:49:51 CEST; 17h ago
        aria2c: active (running) since Wed 2023-10-18 13:50:51 CEST; 13min ago


LDAP
        slapd: active (running) since Tue 2023-10-17 20:45:31 CEST; 17h ago
        sssd: active (running) since Tue 2023-10-17 20:46:17 CEST; 17h ago


Slurm
        slurmctld: active (running) since Tue 2023-10-17 20:55:13 CEST; 17h ago


Monitoring core
        influxdb: active (running) since Tue 2023-10-17 20:54:32 CEST; 17h ago
        telegraf: active (running) since Tue 2023-10-17 20:55:17 CEST; 17h ago
        grafana-server: active (running) since Tue 2023-10-17 20:55:38 CEST; 17h ago
        sensu-server: active (running) since Tue 2023-10-17 20:55:31 CEST; 17h ago
        sensu-api: active (running) since Tue 2023-10-17 20:55:32 CEST; 17h ago
        rabbitmq-server: active (running) since Wed 2023-10-18 13:31:14 CEST; 33min ago


Trinity OOD
        httpd: active (running) since Tue 2023-10-17 20:55:16 CEST; 17h ago

Nodes cannot boot

during the initial (i)PXE request, zero bytes are returned. Follow the following steps on the controller(s):

Ensure there are at least two files in /var/lib/tftpboot with a size, i.e. not zero bytes:
- luna_ipxe.efi
- luna_undionly.kpxe
Check the contents of /etc/dhcp/dhcpd.conf. there should be a block that looks like depicted below.
Verify dhcpd and/or dhcpd6 is running. dhcpd6 is normally only needed to boot using IPv6
Check the log /var/log/luna/luna2-daemon.log for any (suspicious) errors or warnings
Verify whether the firewall is blocking traffic

Example dhcpd.conf block:

subnet 10.141.0.0 netmask 255.255.0.0 {
    max-lease-time 28800;
    if exists user-class and option user-class = "iPXE" {
        filename "http://10.141.255.254:7050/boot";
    } else {
        if option client-architecture = 00:07 {
            filename "luna_ipxe.efi";
        } elsif option client-architecture = 00:0e {
        # OpenPower do not need binary to execute.
        # Petitboot will request for config
        } else {
            filename "luna_undionly.kpxe";
        }
    }
    next-server 10.141.255.254;
    range 10.141.10.0 10.141.255.253;
    option routers 10.141.255.254;
    option domain-name "cluster";
    ddns-domainname "cluster.";
    ddns-rev-domainname "in-addr.arpa.";
    update-static-leases on;
    option luna-id "lunaclient";
}

Firewall

If the controller and compute playbooks have completed, the dhcpd service is running but there is no boot possible, it could be that the firewall is not configured properly.

The default setup is that there is one public and one trusted interface in firewalld. If the group_vars/all.yml is not configured correctly, the interfaces are placed in public which is limited.

Misconfigured:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens192 ens224
  sources:
  services: cockpit dhcpv6-client ssh
  ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
  [...]

Correct:

public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens192
  sources:
  services: cockpit dhcpv6-client ssh
  ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
  [...]

trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: ens224

The interface can be switched using firewall-cmd (where ens224 is the internal interface):

firewall-cmd --remove-interface ens224 --zone=public
firewall-cmd --add-interface ens224 --zone=trusted --permanent
firewall-cmd --reload

Ansible

When following the main branch, there may be updates to the variables in group_vars/all.yml.example which may not be incorporated with your group_vars/all.yml. A message such as the following may appear if this is the case:

# ansible-playbook controller.yml

PLAY [controllers] *****************************************************************************************************

TASK [Gathering Facts] *************************************************************************************************
fatal: [controller1]: FAILED! => {"msg": "The field 'environment' has an invalid value, which includes an undefined variable. The error was: 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined"}

To see what needs to be adjusted a diff can be done:

# diff group_vars/all.yml group_vars/all.yml.example

Other troubleshooting utilities

Bundled with TrinityX comes a graphical diagnostic application that helps troubleshooting Infiniband issues. It show the topology and highlights troubled links, including the host and port where the problems are seen.

Centralized syslog

Syslog is configured to collect all the logs from the services on the controller(s) and the nodes. Logs are commonly used to troubleshoot problems and are typically found in /var/log:

component	location
services	/var/log/messages or their respective location
prometheus	/var/log/prometheus/*
luna daemon, utils and cli	/var/log/luna/*
nodes	/var/log/cluster-messages/*

for the services configured by TrinityX, rotates are setup to prevent /var/log from filling up.