General
TrinityX is a collection of well known, established and world wide adopted opensource components, complemented by in house developed components and applications. The power lies in the integration of these components where normally most of the time is being spend configuring a cluster or system.
A subset of components are listed below:
name | function |
---|---|
Luna | provisioning |
OpenLDAP/SSSD | Authentication/Authorization |
Grafana | Monitoring |
Slurm | HPC scheduler |
Kubernetes | Container scheduler and deployment |
Prometheus | Collecting metrics |
OOD / OpenOndemand | Graphical user interface with apps |
OpenHPC | HPC software stack |
Corosync/Pacemaker | Resource availability H/A |
Luna, our provisioning engine
We believe one of the most powerful and flexible provisioners around is Luna, also referred to as Luna-2. It supports image based and kickstart based provisioning, supports vlans and IPv6. The Luna philosophy is to reduce repetitiveness which is typical for larger sets of nodes or clusters. Common nodes are typically part of a group with the same configuration. Settings are inherited from the group to the nodes where nodes can override settings where they differ from the group.
Luna is modular where plugins allow for tuning without touching the core or engine. The default bundle of plugins offers functionality for diskless booting, RAID1 for nodes, networking configuration, provisioning like http and torrent, image creation and operations like image grab and push, roles and many more.
Luna, the daemon on the controllers, provides an open, secure API to which the command line interface (CLI), the OpenOndemand based Luna applications and third party applications talk to. Typically the luna-daemons run on the headnodes or controllers, the luna-CLI, called luna and luna-utils, can be installed on other nodes or servers if desired.
High Availability philosophy
TrinityX comes out of the box with High Availability support, also known as HA. The goal of an HA setup is to allow for (nearly) non-interrupted service availability, utilizing HA capabilities of these services, even though not all services can handle redundancy in a native way. TrinityX incorporated a flexible and customizable shared disk mechanism, think of NFS, iSCSI, DAS, NAS or DRBD based, for these services for which only one instance can be started, typically on the active controller. The shared disk(s) provides the data for these services. Think of databases, state files but also the home directories.
Luna has native HA capabilities built-in and handles (read: synchronization amongst) multiple running instances natively. Other services that support multiple running instances are DHCP and DNS. Since Luna configures these services, an optimal load balanced provisioning mechanism is achieved, utilizing all available controllers.
TrinityX clusters
TrinityX clusters are based on the Beowulf concept, but can be tailored to support other infrastructures than just HPC.
A standard TrinityX installation includes many hardware and software components, such as the server hardware, network equipment and software components such as the workload manager or the operating system. All these components must be present to provide all services required in a modern HPC system.
TrinityX clusters can contain more than just compute resources, such as login- storage- and cloud nodes.
Controller node
The Controller node is the most important nodes in the system. An additional controller node can be installed to provide high availability of services. The controller node is responsible for:
- providing the most basic network services such as DHCP, TFTP and DNS to install all the other nodes,
- providing configuration management to configure these nodes,
- providing the ability to (remotely) control the nodes by using the IPMI, iDRAC or KVM interface,
- providing the workload manager which in turn manages all the computational resources in the cluster,
- provide a centralized authentication system to manage and authenticate the users,
- providing shared storage for the applications and home directories of users, although this task may be offloaded to a dedicated storage node or an external storage system,
- provide an environment for users where they can compile applications, submit jobs. This task may be offloaded to a dedicated login node,
- provide a monitoring system to view and analyze the cluster and individual system performance.
With TrinityX, Ansible is used to install and configure the above components. Since the deployment is done through Ansible, one can alter the code and optimize it to their needs before the installation. Think of role delegation as in having dedicated servers for certain tasks like provisioning, monitoring or login or bastion hosts.
Compute nodes
Compute nodes provide the majority of computational power in the cluster. There can be several types of nodes in a cluster, depending on the expected workload. They can differ in the CPU type, the amount of memory or specific disk or SSD configurations. Additionally they can be equipped with accelerators such as GPU's.
The operating system can be installed diskless (i.e. in-memory) or diskful. All the applications and storage required to run computations need to be provided externally (e.g. via the controller node or storage node).
Login nodes or bastion hosts
Login nodes are entry points to the cluster. In smaller sized installations the controller node also allows for user logins. In larger clusters this is generally not recommended as the load and number of issues potentially increases with the number of users. To offload these risks and load, one or more seperate login nodes can be introduced.
Storage nodes
Storage nodes provide a place for users and applications to store, retrieve and process data. This can consist of a single node or a cluster of storage nodes, depending on the requirements. Multiple nodes may exist in the same cluster, each serving a different type of store (e.g. users home directories and a scratch space).
Network considerations
The Beowulf concept generally has an internal cluster network (cluster
) used for management and may double as computation network. Larger installation typically have a seperate internal high-speed network which runs on InfiniBand or RoCE hardware for computations (e.g. MPI). Seperate datanetworks may also be added to connect to the individual nodes to keep this traffic off the computation network.
An optional management (ipmi
) network may also be connected to seperate the BMC interfaces for security reasons.
TrinityX generally manages the cluster network (i.e. DHCP is advertised on this network) to allow the nodes to boot.
Ansible Playbooks
Ansible is used to perform the integration and configuration steps, based on input set by variables. Typically these variables, like network details, workload manager choice and other tuning are set, in e.g. group_vars/all.yml
, before the installation. TrinityX comes bundled with a set of playbooks, one for each functionality:
playbook | role |
---|---|
controller.yml | playbook that sets up the controller(s) |
compute-default | playbook to create a compute image based on the OS release running on the controller, with scheduler support |
compute-ubuntu | playbook to create an ubuntu based compute image with scheduler support |
compute-centos | playbook to create a centos(9) based compute image with scheduler support |
k3s-default | playbook to create a compute image based on the OS release running on the controller, with kubernetes support |
k3s-ubuntu | playbook to create an ubuntu based compute image with kubernetes support |
k3s-centos | playbook to create a centos(9) based compute image with kubernetes support |
TrinityX has a mechanism we call Any-Image. This approach allows one to create the above listed image types, but also allowing different OS releases. This way it would be possible to e.g. have a controller running Rocky-8 while creating a Centos-9 or RHEL-9 compute image, using a 'base' or docker image.