AlertX — Architecture, Alerting & Node-Draining

Component: AlertX (TrinityX monitoring, alerting & NHC auto-draining) Source: alertx (development), with the deployment role from trinityx-combined (development, site/roles/trinity/alertx) and the GUI from trinityx-ood (development, alertx/) Introduced: TrinityX 15

Scope — How AlertX fits together: the alertx CLI and the AlertX OnDemand GUI, the luna2-daemon Prometheus-rule plugins that are the single source of truth, the Prometheus/Alertmanager monitoring stack, and the AlertX drainer that turns NHC alerts into Slurm node drains. For day-to-day usage, see the AlertX admin guide; this page is the developer/integrator architecture view.


1. Overview & design philosophy

AlertX is TrinityX's unified monitoring + alerting + node-health system. It wraps Prometheus, Alertmanager, a set of in-house exporters, and a rule engine, and adds NHC (Node Health Checking) that automatically drains unhealthy nodes from the Slurm queue.

Its guiding idea is a single source of truth for rules. Classic NHC tools keep their own health checks separate from the monitoring system, so the two drift: nodes get drained without an alert, or alerts fire without draining. AlertX defines each rule once — the same rule both raises an alert and (if flagged) drains the node. Because metrics are collected continuously, a node's problem is visible the moment it occurs, rather than discovered by a periodic trial-and-error health probe.

1.1 Design principles

  • One rule set, two outcomes. Every rule lives in Prometheus; a rule labelled nhc: true additionally drives draining. Alerting and remediation never disagree because they read the same rule.
  • The daemon owns the rules. Both the CLI and the GUI are thin front-ends; the luna2-daemon stores the rule configuration and renders the Prometheus rule files (§6). Edit anywhere, converge everywhere.
  • Three rule domains. Generic (OS-level), Services (required daemons), and Hardware (auto-generated per node) — see §3.
  • Pull, don't trust. The drainer polls Alertmanager for the current truth and reconciles Slurm state every cycle, tagging only its own drains so it never fights a manual drain (§8).
  • Opt-in remediation. Draining only runs when the workload manager is Slurm; NHC is a subset of rules, so an alert drains a node only if its rule explicitly enables NHC.

2. Architecture & components

AlertX component architecture

AlertX spans three planes:

  • Management plane — the alertx CLI (§4) and the AlertX OOD GUI (§5) edit rules through the luna2-daemon's import/export plugins (§6), which render the Prometheus rule files. This is the single source of truth.
  • Monitoring plane — exporters feed Prometheus, which scrapes targets (discovered via the daemon), evaluates the rules, and pushes firing alerts to Alertmanager (§7).
  • Remediation plane — the AlertX drainer reads firing NHC alerts from Alertmanager and drains/undrains nodes in Slurm (§8).

2.1 Components & ports

Component Where Listen / port Notes
alertx CLI controller talks to daemon :7050, Prometheus :9090, Alertmanager :9093
AlertX OOD GUI controller (OnDemand) via OnDemand Passenger Flask SPA; same backends as the CLI
luna2-daemon controller :7050 (REST, JWT) hosts the prometheus_* import/export plugins + scrape http_sd
Prometheus server controller :9090 (HTTPS, TLS) scrape + rule evaluation
Alertmanager controller :9093 (HTTPS, TLS + basic-auth) alert routing; cluster peer :6783
node_exporter every node :14200 base node metrics (note: not the default 9100)
IPMI / luna exporters controller / nodes hardware + cluster metrics
AlertX drainer controller no listen port alertx-drainer.service; outbound poll every 60 s; Slurm only
alertx-hook every node alertx-hook.service, oneshot at first boot (§9)

Source note (development branch). Alertmanager still defines a legacy alertx-drainer webhook receiver pointing at :7150/listener, but no service listens there — the live mechanism is the drainer polling Alertmanager's /api/v2/alerts (§8). Treat the webhook wiring as vestigial.


3. Rule domains (Generic · Services · Hardware) and NHC

A rule is a Prometheus alerting rule plus AlertX metadata. AlertX organises them into three domains:

Domain CLI What it covers Typical source
Generic alertx generic OS-level conditions (CPU/IPMI faults, memory, filesystem, load) hand-configured
Services alertx service required daemons up (slurmd, sshd, sssd, rsyslogd …) hand-configured
Hardware alertx hardware per-node hardware baseline (counts, models, sizes) auto-generated on first boot and resettable after a hardware change

Two labels carried on the rendered rules drive behaviour:

  • nhc: true marks a rule as NHC-enabled — a firing alert for it will drain the affected node. Rules without it (or nhc: false, e.g. LunaDaemonProblemsOnControllers) only alert. NHC is a strict subset of the rules.
  • job=luna_controllers identifies a controller rule. From TrinityX 15.2+, a firing NHC controller rule drains every node in the cluster — the idea being that a controller-side problem (e.g. an exported /home filling up) affects all nodes (§8).

4. The AlertX CLI

alertx (the alertx repo, development, v1.0) is the scriptable front-end. main.py builds an argparse tree from five classes; each is a sub-command:

Sub-command Class Purpose
alertx overview Overview dashboard summary of cluster + alert state
alertx alerts [silent] Alerts list/filter firing alerts; silence an alert
alertx generic [show/add/change/remove] Generic manage Generic rules
alertx service [show/add/change/remove] Service manage Services rules
alertx hardware [global/reset] Hardware list per-node hardware rules; edit global settings; reset a node's baseline

Cross-cutting flags: -e/--edit (open the rule as JSON, or YAML, in $EDITOR), -X/--sortby, -v/--verbose, -V/--version, plus per-command filters and bash completion (shtab).

Backends. alertx.utils.rest.Rest talks to three endpoints:

  • luna2-daemon (:7050, JWT) — rule CRUD via the prometheus_rules / prometheus_hw_rules / prometheus_rules_settings plugins, and /config/node for the node list (§6).
  • Prometheus (:9090) — GET /api/v1/query?query=max by(hostname)(up) for node up-state and GET /api/v1/rules for live rule status.
  • Alertmanager (:9093, basic-auth) — GET /api/v2/alerts for the firing alerts shown by alertx alerts.

Config & auth. Config lives under /trinity/local/alertx/cli/config/alertx.ini (logging), luna.ini (the [API] daemon endpoint + credentials + SECRET_KEY), and a cached token.txt. Auth is the same JWT scheme as the luna2 utilities: POST /token → cache the JWT → send x-access-tokens: <jwt> on every daemon call. Alertmanager basic-auth credentials are read from /etc/trinity/passwords/prometheus/ (filename = user, contents = password). Rest().daemon_validation() runs before every command.


5. The AlertX OOD GUI

The GUI (trinityx-ood, development, alertx/) is the point-and-click counterpart of the CLI: a Flask app served as an Open OnDemand Passenger app, fronting a compiled Vite single-page app. passenger_wsgi.py exposes app from app.py; OnDemand mounts it under its per-app prefix (tile AlertX, category Monitoring).

It uses the same luna2-daemon backend as the CLI (rest.py), so the GUI and CLI are interchangeable editors of one rule set:

Route Backend call (luna2-daemon) Purpose
/get_rules / /save_config export / import prometheus_rules generic + service rule editor
/get_nodes / /save_nodes /config/node + export/import prometheus_hw_rules per-node hardware health checks
/get_global / /set_global export/import prometheus_rules_settings global hardware-rule settings
/proxy, /proxy_post (client-side) CORS proxy to Prometheus :9090 / Alertmanager :9093 for the live view
/ serves the SPA, injecting PROMQL_URL, APP_URL, ALERT_URL

Auth is the same JWT-from-luna.ini scheme (/tokenx-access-tokens), here using the OnDemand luna.ini path. The live alerts/PromQL view is rendered client-side in the browser, reaching Prometheus/Alertmanager through the Flask /proxy endpoints (Alertmanager credentials injected from the passwords dir).


6. The luna2-daemon backend — single source of truth

Neither front-end writes rule files directly. Both call the luna2-daemon's import/export plugin endpoints; the daemon stores the configuration in the luna database and renders the Prometheus rule files (trix.rules, plus a status-detail file). Editing from the CLI on one controller and the GUI on another therefore converges through the daemon (and replicates across HA controllers like any other daemon state).

AlertX rule configuration flow (single source of truth)

Daemon endpoint Used for
GET /export/prometheus_rules · POST /import/prometheus_rules Generic + Services rule definitions
GET /export/prometheus_hw_rules · POST /import/prometheus_hw_rules per-node Hardware rules
GET /export/prometheus_rules_settings · POST /import/prometheus_rules_settings global hardware-rule settings
GET /config/node node list (to attach hardware rules)
GET /export/prometheus Prometheus scrape targets (http_sd) for the luna_nodes / luna_controllers jobs
POST /token issue the JWT used by both front-ends

These plugins are the luna2-daemon plugins/export/prometheus* and plugins/import/prometheus* modules; they must be listed in the daemon's ALLOWED_EXPORTERS / ALLOWED_IMPORTERS. Because the scrape targets are served by /export/prometheus (http_sd), Prometheus always scrapes exactly the nodes luna knows about — node membership and monitoring stay in lock-step.


7. The monitoring pipeline

   exporters            Prometheus (:9090)            Alertmanager (:9093)
   ─────────            ──────────────────            ────────────────────
   node_exporter :14200 ─┐
   IPMI exporter        ─┼─ scrape (TLS) ──▶ evaluate rules ──▶ firing alerts ──▶ route
   luna exporter        ─┘   targets via             (trix.rules,                 (TLS +
   (per node + ctrl)        http_sd  daemon :7050     nhc / job labels)            basic-auth)
  • Scraping. Prometheus discovers targets through the daemon's http_sd endpoint (/export/prometheus) for the luna_nodes and luna_controllers jobs, plus static/file-SD jobs. The job=luna_controllers label is just the controller scrape-job name — the same label the drainer keys its global-drain on (§8).
  • Evaluation. Prometheus loads the rendered rule files and evaluates them; matching series raise alerts carrying the nhc and job labels and an annotations.description reason.
  • Routing. Firing alerts are pushed to Alertmanager over TLS with basic-auth. Alertmanager is the queryable source of currently active alerts for the CLI, the GUI live view, and the drainer.

8. NHC — automatic node draining

The drainer (alertx-drainer.service, from the trinityx-combined alertx role) is a Python loop — not a webhook receiver. It only runs when the workload manager is Slurm (alertx_drainer_enable). Every DRAIN_INTERVAL (default 60 s) it reconciles active NHC alerts against Slurm node state.

AlertX alert-to-drain decision flow

Each cycle:

  1. Pull alertsGET https://localhost:9093/api/v2/alerts (basic-auth), keep only alerts with label nhc=="true", map each to a node by its hostname label, take the reason from annotations.description.
  2. Read Slurmscontrol -o show nodes; a node counts as drainer-drained only if its State contains DRAIN and its Reason contains the marker TRIX-DRAINER:. This is how the drainer never touches a manually-drained node.
  3. Controller rule first — if any active NHC alert has job=="luna_controllers", drain every node (Reason=TRIX-DRAINER: Controller NHC: …) and skip per-node logic this cycle.
  4. Per node — not-drained + alert → drain; drainer-drained + no alert → resume (when AUTO_UNDRAIN); drained + changed reason → update reason.

The exact Slurm commands (reason always marker-prefixed):

scontrol update NodeName=<node> State=DRAIN  Reason=TRIX-DRAINER: <reason>
scontrol update NodeName=<node> State=RESUME
scontrol update NodeName=<node> Reason=TRIX-DRAINER: <new reason>

Because the drainer pulls the truth and tags its own work, it is self-healing: an alert that clears auto-undrains the node next cycle; a controller alert that clears releases the cluster.


9. Hardware-rule seeding at first boot

Hardware rules are per-node and derive from the node's actual hardware, so they are generated the first time a node boots:

  1. The node runs the oneshot alertx-hook.service/usr/local/sbin/alertx-hook.sh.
  2. The hook reads the node's luna.ini, fetches a TPM-bound JWT (POST {LUNA_URL}/tpm/<host> on :7050), then calls POST {LUNA_URL}/import/prometheus_hw_rules with [{"hostname": "<fqdn>", "force": false}] (retry/back-off).
  3. The luna2-daemon generates that node's hardware Prometheus rules; Prometheus picks them up, and from then on hardware faults on the node alert (and drain, if NHC-enabled).

alertx hardware reset (CLI) or the GUI re-baselines a node after an intentional hardware change, so the rules match the new configuration instead of alerting on the difference.


10. End-to-end example & reference

Adding a rule that drains on a CPU fault

  1. alertx generic add -e (or the GUI rule editor) → the rule is sent to POST /import/prometheus_rules, labelled nhc: true. (management plane)
  2. The luna2-daemon stores it and renders trix.rules; Prometheus reloads. (§6)
  3. A node's IPMI exporter reports a processor fault; Prometheus evaluates the rule and the alert fires with nhc=true, hostname=<node>. (§7)
  4. Alertmanager holds the alert; the drainer's next 60 s poll sees it. (§8)
  5. The drainer runs scontrol update NodeName=<node> State=DRAIN Reason=TRIX-DRAINER: <description>. The node stops accepting new jobs. (§8)
  6. The fault is fixed, the alert clears; the next cycle runs scontrol update NodeName=<node> State=RESUME. (§8)

Quick reference

I want to… Look at
Add/change an alert rule alertx generic/service or the GUI → daemon import/prometheus_rules (§4, §6)
Re-baseline a node's hardware rules alertx hardware resetimport/prometheus_hw_rules (§9)
See what's firing alertx alerts / GUI live view → Alertmanager /api/v2/alerts (§4)
Understand a drain check the node's Reason for TRIX-DRAINER:; the drainer logs to /var/log/alertx (§8)
Enable/disable auto-draining the alertx role's alertx_drainer_enable (Slurm only) + per-rule nhc flag (§8, §3)
Change scrape targets they follow luna node membership via daemon /export/prometheus (§6, §7)

Source: alertx, trinityx-combined (site/roles/trinity/alertx) and trinityx-ood (alertx/), all on development.