Troubleshooting
TrinityX comes with a diagnosis tool which will check the required services.
# trinity_diagnosis
Trinity Core
chronyd: active (running) since Tue 2023-10-17 20:45:29 CEST; 17h ago
named: active (running) since Tue 2023-10-17 20:55:07 CEST; 17h ago
dhcpd: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago
mariadb: active (running) since Tue 2023-10-17 20:46:20 CEST; 17h ago
nfs-server: active (exited) since Tue 2023-10-17 20:44:38 CEST; 17h ago
nginx: active (running) since Tue 2023-10-17 20:49:52 CEST; 17h ago
Luna
luna2-daemon: active (running) since Tue 2023-10-17 20:49:51 CEST; 17h ago
aria2c: active (running) since Wed 2023-10-18 13:50:51 CEST; 13min ago
LDAP
slapd: active (running) since Tue 2023-10-17 20:45:31 CEST; 17h ago
sssd: active (running) since Tue 2023-10-17 20:46:17 CEST; 17h ago
Slurm
slurmctld: active (running) since Tue 2023-10-17 20:55:13 CEST; 17h ago
Monitoring core
influxdb: active (running) since Tue 2023-10-17 20:54:32 CEST; 17h ago
telegraf: active (running) since Tue 2023-10-17 20:55:17 CEST; 17h ago
grafana-server: active (running) since Tue 2023-10-17 20:55:38 CEST; 17h ago
sensu-server: active (running) since Tue 2023-10-17 20:55:31 CEST; 17h ago
sensu-api: active (running) since Tue 2023-10-17 20:55:32 CEST; 17h ago
rabbitmq-server: active (running) since Wed 2023-10-18 13:31:14 CEST; 33min ago
Trinity OOD
httpd: active (running) since Tue 2023-10-17 20:55:16 CEST; 17h ago
Firewall
If the controller and compute playbooks have completed, the dhcpd service is running but there is no boot possible, it could be that the firewall is not configured properly.
The default setup is that there is one public and one trusted interface in firewalld. If the group_vars/all.yml
is not configured correctly, the interfaces are placed in public which is limited.
Misconfigured:
public (active)
target: default
icmp-block-inversion: no
interfaces: ens192 ens224
sources:
services: cockpit dhcpv6-client ssh
ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
[...]
Correct:
public (active)
target: default
icmp-block-inversion: no
interfaces: ens192
sources:
services: cockpit dhcpv6-client ssh
ports: 22/tcp 443/tcp 8080/tcp 3001/tcp 3000/tcp
[...]
trusted (active)
target: ACCEPT
icmp-block-inversion: no
interfaces: ens224
The interface can be switched using firewall-cmd
(where ens224 is the internal interface):
firewall-cmd --remove-interface ens224 --zone=public
firewall-cmd --add-interface ens224 --zone=trusted --permanent
firewall-cmd --reload
Ansible
When following the main
branch, there may be updates to the variables in group_vars/all.yml.example which may not be incorporated with your group_vars/all.yml. A message such as the following may appear if this is the case:
# ansible-playbook controller.yml
PLAY [controllers] *****************************************************************************************************
TASK [Gathering Facts] *************************************************************************************************
fatal: [controller1]: FAILED! => {"msg": "The field 'environment' has an invalid value, which includes an undefined variable. The error was: 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined. 'trix_external_fqdn' is undefined"}
To see what needs to be adjusted a diff can be done:
# diff group_vars/all.yml group_vars/all.yml.example
Other troubleshooting utilities
Bundled with TrinityX comes a graphical diagnostic application that helps troubleshooting Infiniband issues. It show the topology and highlights troubled links, including the host and port where the problems are seen.
Centralized syslog
Syslog is configured to collect all the logs from the services on the controller(s) and the nodes. Logs are commonly used to troubleshoot problems and are typically found in /var/log:
component | location |
---|---|
services | /var/log/messages or their respective location |
prometheus | /var/log/prometheus/* |
luna daemon, utils and cli | /var/log/luna/* |
nodes | /var/log/cluster-messages/* |
for the services configured by TrinityX, rotates are setup to prevent /var/log from filling up.