TOSS: Tri-Lab Operating System Stack

Who uses TOSS?
TOSS Components
Deploying TOSS
TOSS Requirements
TOSS Limitations
The TOSS Release Cadence
Supported Versions
Comparison of TOSS with other HPC OSes
History of TOSS
The TOSS Team and Other Contributors
Footnotes
Citing / Referencing TOSS

The Tri-Lab Operating System Stack (TOSS) is a RedHat-based, limited release, OS distribution for HPC systems. The goal of the TOSS project has been to increase efficiencies in the ASC Tri-Lab community with respect to both the utility and the cost of a common computing environment. The project delivers a fully functional cluster operating system, based on Red Hat Linux, capable of running MPI jobs at scale on hardware across the Tri-Lab complex. Today, practically all LLNL Livermore Computing systems run TOSS.¹

TOSS integrates a commodity base operating system with cluster administration tools, a parallel filesystem, batch scheduler, and reference development environment. These components are then rigorously tested on hardware platforms and with codes that are relevant to the NNSA missions.²

Who uses TOSS?

TOSS is developed and maintained by Livermore Computing (LC) at Lawrence Livermore National Laboratory and explicitly developed to support the ASC Tri-Lab community, including the HPC centers at Los Alamos National Laboratory and Sandia National Laboratories. In addition to these US Department of Energy laboratories, TOSS is used by the Navy Nuclear Laboratory and NASA. Some sites choose to use TOSS as turnkey solution, with minimal or no changes to the base distribution. Other sites adapt the distribution to meet the needs of their local communities.

While TOSS is a community effort, it is shared with sites who are able to fully support their own communities. The TOSS distribution does not come with any monetary exchange or support agreement.

TOSS Components

Red Hat Enterprise Linux (RHEL) is the base OS for TOSS, and provides the majority of packages (see Table 1 for historical example). Additional packages are brought in from Fedora’s Extra Packages for Enterprise Linux (EPEL) collection. The current major version of TOSS, TOSS 4, is based on RHEL8.

Building on a commercial Linux distribution enables TOSS developers to leverage Red Hat’s extensive engineering resources to mitigate many non HPC-specific bugs and security issues. Additionally, managing and working on a TOSS system remains similar to RHEL, and software built for RHEL generally works out-of-the-box on TOSS. While nearly all RHEL packages are used unmodified, a small number are rebuilt. These modified versions often remain in place for only a few releases, providing additional diagnostics for root cause analysis, critical bug fixes, or new hardware support until RHEL’s packages incorporate the necessary functionality.

The RHEL kernel is also minimally modified to add additional missing hardware drivers and critical bug fixes. TOSS thus further hardens RHEL for existing NNSA platforms and workloads, while rapidly supporting incoming platforms of interest.

TOSS integrates several additional components to create a robust HPC environment. These include:

The Slurm batch scheduler and Flux resource management framework.
Support for Lustre clients and servers, including support for ZFS
Podman and CharlieCloud are provided for containerized workflows.
The Mellanox OpenFabrics Enterprise Distribution for Linux (MOFED)—provided as an optional replacement for RHEL’s InfiniBand fabric stack
A number of HPC-centric system management tools are also added, many developed at LLNL or by other national laboratories. These include:
- conman
- munge
- pdsh
- powerman
- the LDMS metrics collection system.
Some inclusions are vendor-specific, like certain NVIDIA drivers and AMD's ROCm

More on Resource Managers

TOSS ships with both Slurm and Flux. Moreover, TOSS will probably include Slurm even when it's no longer something widely used at LLNL.

TOSS includes a small reference development environment (DE), including multiple compilers, and GPU and MPI libraries. This environment is used for integration testing, verification, and cross-site compatibility.

Deploying TOSS

Deploying TOSS is similar to deploying RedHat onto HPC clusters. We expect that sites interested in using TOSS will have their own RedHat support contracts / RHEL license, along with the required expertise to support their local systems.

TOSS Requirements

Expertise

Sites interested in using TOSS on production systems must have their own RHEL license. They must also have the personnel and expertise to fully support their HPC systems. At a minimum, there must be one Linux-familiar, HPC-training system administrator.

For centers considering TOSS, take this test: If you are given you a stock Red Hat DVD, can you get a cluster going? If not, then TOSS is not right for your center. Put in other terms, the center needs at least one Linux-familiar, HPC-trained system administrator. As mentioned above, the center needs a RHEL license if they are going to use TOSS in production.

Hardware

Beyond Red Hat supported hardware, TOSS supports some additional elements, such as HPE Slingshot.

Independence

LC provides guidance but does not offer paid-for commercial support for TOSS. When other facilities use TOSS, they may put their own tools and packages on top of TOSS, or they may ask TOSS developers to pull them in. Requests for adding packages to TOSS will depend on the discretion of LC and the needs of NNSA centers.

TOSS Limitations

TOSS’s focus on cross-platform compatibility and production stability impose some limitations on the platforms and software features that it can easily support. For instance, system architectures that diverge considerably from commodity platforms would impose significant maintenance costs for TOSS. Similarly, TOSS is not an appropriate proving ground for experimental software features that may have unintended stability or performance consequences, such as invasive changes to the Linux kernel’s scheduler or memory management.

While TOSS provides the necessary tools to operate an HPC cluster, it is not a turn-key solution. Sites retain great latitude in how TOSS runs on their systems, including choices in configuration management, provisioning, and monitoring. Consequently, sites running TOSS still require highly skilled HPC system architects and administrators, albeit in fewer numbers.

Rather than providing a complete and fast-moving development environment, TOSS aims to provide stable systems and ABIs that rich development environments (DEs) can be built upon. LLNL separately maintains a DE on TOSS systems, containing a broad range of compilers, debuggers, MPIs, and other development tools. This separation enables the DE to more rapidly incorporate new software and respond to user needs with few worries about overall system stability. Separate work is being done to expand this Tri-Laboratory Computer Environment (TCE) for use at other computer centers with the next major version of TOSS. LLNL also supports Spack, which enables users to easily build and deploy custom development environments on TOSS systems.

In other words:

TOSS isn't a turnkey solution like vendors provide–it's more a set of recipes than a fully cooked meal.
TOSS isn't for everyone–it's much more attuned to big clusters and NNSA workloads. It's not trying to make something that works absolutely everyone.
TOSS isn't available to everyone. Right now you have to be part of the extended NNSA community and be willing to accept that LC and Tri-lab needs are going to take precedence. Other DOE labs, like NERSC, have chosen not to use it because of this caveat.
TOSS isn't fully open source–all the tools are either open source or redistributable. RHEL is open source, but Livermore Computing pays for a support license and anyone LC shares TOSS with is expected to get their own license–at least if they bring it beyond testing and into production.
TOSS isn't ready–to become open source, but the team is considering putting some packages on GitHub. TOSS cannot be fully open source because of some closer-to-proprietary elements and the RHEL license issue. One notable example of what TOSS doesn't include is the CrayPE compiler, since that is proprietary.

The TOSS Release Cadence

TOSS follows a regular release cadence, with monthly updates to address routine bug and security fixes as required by ESN order. Each update is integration tested using LLNL’s Synthetic WorkLoad to ensure that it is stable for production use. Minor releases include more significant component updates, and occur every 6-9 months, generally in line with upstream vendor releases. New major releases occur every 3-5 years, and involve major version changes to the base operating system. TOSS endeavors to provide a stable user ABI throughout the entire life cycle of a major release, limiting the need for users to recompile their code after system updates. A major goal of TOSS is the reduction of costs across NNSA laboratories. Standardization on a single operating environment across platforms reduces the time spent by application developers porting and optimizing codes and workflows, and enables system administrators to more easily work across clusters without requiring platform-specific training. Standardization across multiple sites enhances collaboration and reduces duplication of efforts. Maintaining the project across platform lifecycles helps to ensure that previously-encountered issues are not replicated on new platforms. Finally, TOSS reduces development and support costs by using off-the-shelf components when possible, and performing local development only when necessary to fill gaps or ensure efficient code execution.²

Supported Versions

As TOSS is built on RedHat Enterprise Linux, the versions (and support) tend to be directly related to the release and support of the "upstream" RHEL versions.

TOSS Supported Versions
TOSS Version	Upstream RHEL Version	Release Date	End of Support
3.7-x	7.9	October 2020	June 2024
4.5-x	8.7	November 2022	May 2023
4.6-x	8.8	June 2023	December 2023
4.7-x	8.9	December 2023	May 2024
4.8-x	8.10	June 2024	TBD
5.2-x	9.7	November 2025	TBD

Comparison of TOSS with other HPC OSes

As noted above, TOSS is not the best solution for every center. Here are some other HPC OS projects and how they compare to TOSS:

Scientific Linux

Scientific Linux shares a common goal with TOSS—providing a common, reliable OS platform for a particular group of scientific sites. Like TOSS, Scientific Linux is also based on RHEL, but re-compiles the RHEL source packages rather than including them directly. By doing so, Scientific Linux is unencumbered by Red Hat’s licensing and is freely distributable, but is not eligible for Red Hat support. Scientific Linux also does not directly target HPC, and as such does not integrate features like a batch scheduler or parallel filesystem.

Cray Linux Environment

The Cray Linux Environment (CLE) also extends a commodity Linux distribution (SUSE) to support HPC users. Unlike TOSS, it is highly optimized to run only on Cray’s systems, and is not supported across multiple hardware vendors. Similarly to TOSS, CLE supports the Slurm resource manager, but also provides support for several additional batch systems. CLE also bundles custom provisioning, configuration management, and monitoring tools into a turn-key operating environment.²

OpenHPC Project

The OpenHPC project provides repositories of HPC-centric packages compiled for multiple operating systems. Unlike TOSS, OpenHPC does not provide a base operating system, and much of its focus is on the development environment rather than system software. The Tri-Lab Common Environment (TCE) provides similar functionality to OpenHPC but is better optimized for TOSS and NNSA platforms and workloads. As a community project, OpenHPC is not able to integrate commercial components, such as compilers, and support is only provided on an ad-hoc basis.

Compliance Baseline

The TOSS team partnered with DISA (Defense Information Systems Agency) to create a STIG (Security Technical Implementation Guide) see the official DISA announcement for more details. The guide itself can be found and downloaded at: https://public.cyber.mil/stigs/downloads/

History of TOSS

Since 2007: The History of Evolving Compute Platforms is the History of TOSS

Commodity Technology Systems and TOSS

Capacity computing—the use of smaller and less expensive commercial-grade systems to run parallel problems with more modest computational requirements—allows the National Nuclear Security Administration’s (NNSA’s) more powerful supercomputers, or “capability” systems, to be dedicated to the larger, more complex calculations critical to stockpile stewardship. Reducing the total cost of ownership for robust and scalable HPC clusters is a significant challenge that impacts many programs at LLNL.

In 2007, LLNL successfully led a first-of-a-kind effort to build a common capacity hardware environment, called the Tri-Lab Linux Capacity Clusters (TLCC1), at the three NNSA laboratories—Lawrence Livermore, Los Alamos, and Sandia. The TLCC1 experience proved that deploying a common hardware environment at all three sites greatly reduced the time and cost to deploy each HPC cluster.

The Tri-Lab Operating System Stack (TOSS) was created to run on the TLCC systems from their inception. The goal of the TOSS project has been to increase efficiencies in the ASC tri-lab community with respect to both the utility and the cost of a common computing environment. The project delivers a fully functional cluster operating system, based on Red Hat Linux, capable of running MPI jobs at scale on hardware across the tri-lab complex.

TOSS provides a complete product with full lifecycle support. Well-defined processes for release management, packaging, quality assurance testing, configuration management, and bug tracking are used to ensure a production-quality software environment can be deployed across the tri-lab in a consistent and manageable fashion.

Building on TLCC1’s success, the second generation of tri-lab clusters (TLCC2) was deployed in 2011 and 2012. For TLCC2, LLNL and its laboratory and industry partners—Los Alamos, Sandia, Appro, Intel, QLogic, and Red Hat—sited 12 small- to large-scale commodity clusters for NNSA’s Advanced Simulation and Computing (ASC) and Stockpile Stewardship Program.

After their deployment, the TLCC2 clusters proved to be some of the most scalable, reliable, and cost-effective clusters that LLNL has ever brought into service. Users can run larger simulations with a higher job throughput than was previously possible on commodity systems. And with a consistent user environment, seamless software environment, and common operating system, the TLCC2 capacity computers made it easy for users to collaborate and resolve problems at any site.

Delivery of the third generation of tri-lab clusters, referred to as Commodity Technology Systems (CTS-1) , began in April 2016. In support of this and subsequent large commodity system deployments, computer scientists at the three laboratories continued their partnership with Red Hat to support TOSS on systems of up to 10,000 nodes.

CTS-2, the fourth generation of tri-lab Linux clusters, began arriving in 2022 and continues to the present. Systems far surpass their predecessors in terms of compute power. Dane, the largest system of this generation, has peak performance of 10.6 petaflops as of 2025.

Advanced Technology Systems and TOSS

LLNL's most performant machines are called Advanced Technology Systems (ATSes). Prior to the CORAL-2 procurement, these systems ran proprietary vendor operating systems (OS). The El Capitan RFP included a requirement that the system be able to run TOSS, but the process of making that happen was not that simple. Regardless of this requirement, getting HPE to run TOSS on El Capitan systems the way Livermore Computing wanted was a challenge. However, once HPE got their hands on TOSS–and Livermore Computing proved that we knew what we were doing, they could see the advantage.

Many vendors don't have good reproducibility, but now that TOSS is on HPE through the El Capitan systems and other CORAL-2 procurement machines, we've already gone a long way toward reproducibility for future users.

The TOSS Team and Other Contributors

Trent D'Hooge, Jim Foraker, Olaf Faaland; ZFS/Flux/Slurm teams are indirectly part of the TOSS team–pretty much every LC group has someone who contributes to TOSS.

In addition to the LLNL core team and contributors, developers from Sandia, LANL, and NASA have pushed software packages through the TOSS BuildFarm, and TOSS developers pull in tools from, for example, CEA–so other institutions can contribute indirectly. LANL's Open CHAMI is something TOSS might pull in in the future.

Footnotes

Much of the above content, including but not exclusive to the marked footnotes are pulled from these two documents:

Citing / Referencing TOSS

To reference TOSS, please cite the following paper:

Edgar A. León, Trent D’Hooge, Nathan Hanford, Ian Karlin, Ramesh Pankajakshan, Jim Foraker, Chris Chambreau, and Matthew L. Leininger. TOSS-2020: A Commodity Software Stack for HPC. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC’20. IEEE Computer Society, November 2020.