AlertX — Architecture, Alerting & Node-Draining

Component: AlertX (TrinityX monitoring, alerting & NHC auto-draining) Source: alertx (development), with the deployment role from trinityx-combined (development, site/roles/trinity/alertx) and the GUI from trinityx-ood (development, alertx/) Introduced: TrinityX 15

Scope — How AlertX fits together: the alertx CLI and the AlertX OnDemand GUI, the luna2-daemon Prometheus-rule plugins that are the single source of truth, the Prometheus/Alertmanager monitoring stack, and the AlertX drainer that turns NHC alerts into Slurm node drains. For day-to-day usage, see the AlertX admin guide; this page is the developer/integrator architecture view.

1. Overview & design philosophy

AlertX is TrinityX's unified monitoring + alerting + node-health system. It wraps Prometheus, Alertmanager, a set of in-house exporters, and a rule engine, and adds NHC (Node Health Checking) that automatically drains unhealthy nodes from the Slurm queue.

Its guiding idea is a single source of truth for rules. Classic NHC tools keep their own health checks separate from the monitoring system, so the two drift: nodes get drained without an alert, or alerts fire without draining. AlertX defines each rule once — the same rule both raises an alert and (if flagged) drains the node. Because metrics are collected continuously, a node's problem is visible the moment it occurs, rather than discovered by a periodic trial-and-error health probe.

1.1 Design principles

One rule set, two outcomes. Every rule lives in Prometheus; a rule labelled nhc: true additionally drives draining. Alerting and remediation never disagree because they read the same rule.
The daemon owns the rules. Both the CLI and the GUI are thin front-ends; the luna2-daemon stores the rule configuration and renders the Prometheus rule files (§6). Edit anywhere, converge everywhere.
Three rule domains. Generic (OS-level), Services (required daemons), and Hardware (auto-generated per node) — see §3.
Pull, don't trust. The drainer polls Alertmanager for the current truth and reconciles Slurm state every cycle, tagging only its own drains so it never fights a manual drain (§8).
Opt-in remediation. Draining only runs when the workload manager is Slurm; NHC is a subset of rules, so an alert drains a node only if its rule explicitly enables NHC.

2. Architecture & components

AlertX component architecture

AlertX spans three planes:

Management plane — the alertx CLI (§4) and the AlertX OOD GUI (§5) edit rules through the luna2-daemon's import/export plugins (§6), which render the Prometheus rule files. This is the single source of truth.
Monitoring plane — exporters feed Prometheus, which scrapes targets (discovered via the daemon), evaluates the rules, and pushes firing alerts to Alertmanager (§7).
Remediation plane — the AlertX drainer reads firing NHC alerts from Alertmanager and drains/undrains nodes in Slurm (§8).

2.1 Components & ports

Component	Where	Listen / port	Notes
`alertx` CLI	controller	—	talks to daemon `:7050`, Prometheus `:9090`, Alertmanager `:9093`
AlertX OOD GUI	controller (OnDemand)	via OnDemand Passenger	Flask SPA; same backends as the CLI
luna2-daemon	controller	`:7050` (REST, JWT)	hosts the `prometheus_*` import/export plugins + scrape `http_sd`
Prometheus server	controller	`:9090` (HTTPS, TLS)	scrape + rule evaluation
Alertmanager	controller	`:9093` (HTTPS, TLS + basic-auth)	alert routing; cluster peer `:6783`
node_exporter	every node	`:14200`	base node metrics (note: not the default 9100)
IPMI / luna exporters	controller / nodes	—	hardware + cluster metrics
AlertX drainer	controller	no listen port	`alertx-drainer.service`; outbound poll every 60 s; Slurm only
alertx-hook	every node	—	`alertx-hook.service`, oneshot at first boot (§9)

Source note (development branch). Alertmanager still defines a legacy alertx-drainer webhook receiver pointing at :7150/listener, but no service listens there — the live mechanism is the drainer polling Alertmanager's /api/v2/alerts (§8). Treat the webhook wiring as vestigial.

3. Rule domains (Generic · Services · Hardware) and NHC

A rule is a Prometheus alerting rule plus AlertX metadata. AlertX organises them into three domains:

Domain	CLI	What it covers	Typical source
Generic	`alertx generic`	OS-level conditions (CPU/IPMI faults, memory, filesystem, load)	hand-configured
Services	`alertx service`	required daemons up (`slurmd`, `sshd`, `sssd`, `rsyslogd` …)	hand-configured
Hardware	`alertx hardware`	per-node hardware baseline (counts, models, sizes)	auto-generated on first boot and resettable after a hardware change

Two labels carried on the rendered rules drive behaviour:

nhc: true marks a rule as NHC-enabled — a firing alert for it will drain the affected node. Rules without it (or nhc: false, e.g. LunaDaemonProblemsOnControllers) only alert. NHC is a strict subset of the rules.
job=luna_controllers identifies a controller rule. From TrinityX 15.2+, a firing NHC controller rule drains every node in the cluster — the idea being that a controller-side problem (e.g. an exported /home filling up) affects all nodes (§8).

4. The AlertX CLI

alertx (the alertx repo, development, v1.0) is the scriptable front-end. main.py builds an argparse tree from five classes; each is a sub-command:

Sub-command	Class	Purpose
`alertx overview`	`Overview`	dashboard summary of cluster + alert state
`alertx alerts [silent]`	`Alerts`	list/filter firing alerts; silence an alert
`alertx generic [show/add/change/remove]`	`Generic`	manage Generic rules
`alertx service [show/add/change/remove]`	`Service`	manage Services rules
`alertx hardware [global/reset]`	`Hardware`	list per-node hardware rules; edit global settings; reset a node's baseline

Cross-cutting flags: -e/--edit (open the rule as JSON, or YAML, in $EDITOR), -X/--sortby, -v/--verbose, -V/--version, plus per-command filters and bash completion (shtab).

Backends. alertx.utils.rest.Rest talks to three endpoints:

luna2-daemon (:7050, JWT) — rule CRUD via the prometheus_rules / prometheus_hw_rules / prometheus_rules_settings plugins, and /config/node for the node list (§6).
Prometheus (:9090) — GET /api/v1/query?query=max by(hostname)(up) for node up-state and GET /api/v1/rules for live rule status.
Alertmanager (:9093, basic-auth) — GET /api/v2/alerts for the firing alerts shown by alertx alerts.

Config & auth. Config lives under /trinity/local/alertx/cli/config/ — alertx.ini (logging), luna.ini (the [API] daemon endpoint + credentials + SECRET_KEY), and a cached token.txt. Auth is the same JWT scheme as the luna2 utilities: POST /token → cache the JWT → send x-access-tokens: <jwt> on every daemon call. Alertmanager basic-auth credentials are read from /etc/trinity/passwords/prometheus/ (filename = user, contents = password). Rest().daemon_validation() runs before every command.

5. The AlertX OOD GUI

The GUI (trinityx-ood, development, alertx/) is the point-and-click counterpart of the CLI: a Flask app served as an Open OnDemand Passenger app, fronting a compiled Vite single-page app. passenger_wsgi.py exposes app from app.py; OnDemand mounts it under its per-app prefix (tile AlertX, category Monitoring).

It uses the same luna2-daemon backend as the CLI (rest.py), so the GUI and CLI are interchangeable editors of one rule set:

Route	Backend call (luna2-daemon)	Purpose
`/get_rules` / `/save_config`	`export` / `import` `prometheus_rules`	generic + service rule editor
`/get_nodes` / `/save_nodes`	`/config/node` + `export`/`import` `prometheus_hw_rules`	per-node hardware health checks
`/get_global` / `/set_global`	`export`/`import` `prometheus_rules_settings`	global hardware-rule settings
`/proxy`, `/proxy_post`	(client-side)	CORS proxy to Prometheus `:9090` / Alertmanager `:9093` for the live view
`/`	—	serves the SPA, injecting `PROMQL_URL`, `APP_URL`, `ALERT_URL`

Auth is the same JWT-from-luna.ini scheme (/token → x-access-tokens), here using the OnDemand luna.ini path. The live alerts/PromQL view is rendered client-side in the browser, reaching Prometheus/Alertmanager through the Flask /proxy endpoints (Alertmanager credentials injected from the passwords dir).

6. The luna2-daemon backend — single source of truth

Neither front-end writes rule files directly. Both call the luna2-daemon's import/export plugin endpoints; the daemon stores the configuration in the luna database and renders the Prometheus rule files (trix.rules, plus a status-detail file). Editing from the CLI on one controller and the GUI on another therefore converges through the daemon (and replicates across HA controllers like any other daemon state).

AlertX rule configuration flow (single source of truth)

Daemon endpoint	Used for
`GET /export/prometheus_rules` · `POST /import/prometheus_rules`	Generic + Services rule definitions
`GET /export/prometheus_hw_rules` · `POST /import/prometheus_hw_rules`	per-node Hardware rules
`GET /export/prometheus_rules_settings` · `POST /import/prometheus_rules_settings`	global hardware-rule settings
`GET /config/node`	node list (to attach hardware rules)
`GET /export/prometheus`	Prometheus scrape targets (`http_sd`) for the `luna_nodes` / `luna_controllers` jobs
`POST /token`	issue the JWT used by both front-ends

These plugins are the luna2-daemon plugins/export/prometheus* and plugins/import/prometheus* modules; they must be listed in the daemon's ALLOWED_EXPORTERS / ALLOWED_IMPORTERS. Because the scrape targets are served by /export/prometheus (http_sd), Prometheus always scrapes exactly the nodes luna knows about — node membership and monitoring stay in lock-step.

7. The monitoring pipeline

   exporters            Prometheus (:9090)            Alertmanager (:9093)
   ─────────            ──────────────────            ────────────────────
   node_exporter :14200 ─┐
   IPMI exporter        ─┼─ scrape (TLS) ──▶ evaluate rules ──▶ firing alerts ──▶ route
   luna exporter        ─┘   targets via             (trix.rules,                 (TLS +
   (per node + ctrl)        http_sd  daemon :7050     nhc / job labels)            basic-auth)

Scraping. Prometheus discovers targets through the daemon's http_sd endpoint (/export/prometheus) for the luna_nodes and luna_controllers jobs, plus static/file-SD jobs. The job=luna_controllers label is just the controller scrape-job name — the same label the drainer keys its global-drain on (§8).
Evaluation. Prometheus loads the rendered rule files and evaluates them; matching series raise alerts carrying the nhc and job labels and an annotations.description reason.
Routing. Firing alerts are pushed to Alertmanager over TLS with basic-auth. Alertmanager is the queryable source of currently active alerts for the CLI, the GUI live view, and the drainer.

8. NHC — automatic node draining

The drainer (alertx-drainer.service, from the trinityx-combined alertx role) is a Python loop — not a webhook receiver. It only runs when the workload manager is Slurm (alertx_drainer_enable). Every DRAIN_INTERVAL (default 60 s) it reconciles active NHC alerts against Slurm node state.

AlertX alert-to-drain decision flow

Each cycle:

Pull alerts — GET https://localhost:9093/api/v2/alerts (basic-auth), keep only alerts with label nhc=="true", map each to a node by its hostname label, take the reason from annotations.description.
Read Slurm — scontrol -o show nodes; a node counts as drainer-drained only if its State contains DRAIN and its Reason contains the marker TRIX-DRAINER:. This is how the drainer never touches a manually-drained node.
Controller rule first — if any active NHC alert has job=="luna_controllers", drain every node (Reason=TRIX-DRAINER: Controller NHC: …) and skip per-node logic this cycle.
Per node — not-drained + alert → drain; drainer-drained + no alert → resume (when AUTO_UNDRAIN); drained + changed reason → update reason.

The exact Slurm commands (reason always marker-prefixed):

scontrol update NodeName=<node> State=DRAIN  Reason=TRIX-DRAINER: <reason>
scontrol update NodeName=<node> State=RESUME
scontrol update NodeName=<node> Reason=TRIX-DRAINER: <new reason>

Because the drainer pulls the truth and tags its own work, it is self-healing: an alert that clears auto-undrains the node next cycle; a controller alert that clears releases the cluster.

9. Hardware-rule seeding at first boot

Hardware rules are per-node and derive from the node's actual hardware, so they are generated the first time a node boots:

The node runs the oneshot alertx-hook.service → /usr/local/sbin/alertx-hook.sh.
The hook reads the node's luna.ini, fetches a TPM-bound JWT (POST {LUNA_URL}/tpm/<host> on :7050), then calls POST {LUNA_URL}/import/prometheus_hw_rules with [{"hostname": "<fqdn>", "force": false}] (retry/back-off).
The luna2-daemon generates that node's hardware Prometheus rules; Prometheus picks them up, and from then on hardware faults on the node alert (and drain, if NHC-enabled).

alertx hardware reset (CLI) or the GUI re-baselines a node after an intentional hardware change, so the rules match the new configuration instead of alerting on the difference.

10. End-to-end example & reference

Adding a rule that drains on a CPU fault

alertx generic add -e (or the GUI rule editor) → the rule is sent to POST /import/prometheus_rules, labelled nhc: true. (management plane)
The luna2-daemon stores it and renders trix.rules; Prometheus reloads. (§6)
A node's IPMI exporter reports a processor fault; Prometheus evaluates the rule and the alert fires with nhc=true, hostname=<node>. (§7)
Alertmanager holds the alert; the drainer's next 60 s poll sees it. (§8)
The drainer runs scontrol update NodeName=<node> State=DRAIN Reason=TRIX-DRAINER: <description>. The node stops accepting new jobs. (§8)
The fault is fixed, the alert clears; the next cycle runs scontrol update NodeName=<node> State=RESUME. (§8)

Quick reference

I want to…	Look at
Add/change an alert rule	`alertx generic`/`service` or the GUI → daemon `import/prometheus_rules` (§4, §6)
Re-baseline a node's hardware rules	`alertx hardware reset` → `import/prometheus_hw_rules` (§9)
See what's firing	`alertx alerts` / GUI live view → Alertmanager `/api/v2/alerts` (§4)
Understand a drain	check the node's `Reason` for `TRIX-DRAINER:`; the drainer logs to `/var/log/alertx` (§8)
Enable/disable auto-draining	the alertx role's `alertx_drainer_enable` (Slurm only) + per-rule `nhc` flag (§8, §3)
Change scrape targets	they follow luna node membership via daemon `/export/prometheus` (§6, §7)

Source: alertx, trinityx-combined (site/roles/trinity/alertx) and trinityx-ood (alertx/), all on development.