Tioga | HPC @ LLNL

Job Limits

Each LC platform is a shared resource. Users are expected to adhere to the following usage policies to ensure that the resources can be effectively and productively used by everyone. You can view the policies on a system itself by running:

news job.lim.MACHINENAME
Web Version of Tioga Job Limits

Tioga is an CORAL2 early access system. There are 2 login nodes and 24 compute nodes in a pdebug partition. The compute nodes have 64 AMD EPYC cores/node, 4 AMD gfx90a gpus/node, and 512 GB memory per node. Tioga is running TOSS 4

There is 1 scheduling pool:

pdebug—1536 cores, 96 GPUs (24 nodes)

Scheduling

Tioga jobs are scheduled using FLUX. Scheduling is not technically enforced so users are expected to monitor their own behavior and keep themselves within the current limits while following these policies:

Users will not compile on the login nodes during daytime hours

A user can have a maximum of 192 processors and 12 gpus with a runtime of up to 4 hours in queue during the day with the following exceptions:

An occasional one hour max job for debugging that takes 193-256 processors as long as it is the user's only job in the queue.

Daytime is 0800-2000 Mondays-Fridays not including holidays

To prevent runaway jobs, there's a technical maximum per job of 12 hours

No production runs allowed, only development and debugging

Users won't run computationally intensive work on the login node

We are all family and expect developers to play nice. However if someone's job(s) have taken over the machine:

Call them or send them an email.

Email ramblings-help@llnl.gov with a screenshot so we can take care of the situation by killing work that violates policy

This approach will be revisited later and additional limits will be set if necessary. If someone monopolizes the machine, developers can always shift to other resources.

The queue can be found by typing "flux jobs -A" at the prompt

There are two tioga nodes that are scheduled by Slurm for JupyterHub users. You can see the status of these nodes by running Slurm commands with fully qualified paths (e.g., /bin/sinfo, /bin/squeue, etc).

Documentation

El Cap EA Systems CZ Confluence

Flux Tutorial

Contact

Please call or send email to the LC Hotline if you have questions. LC Hotline | phone: 925-422-4531 | email: lc-hotline@llnl.gov
Zone	CZ
Vendor	HPE Cray
User-Available Nodes	Login Nodes* 2 Batch Nodes 0 Debug Nodes 30 Total Nodes 32
CPUs	CPU Architecture AMD Trento Cores/Node 64 Total Cores 2,048
GPUs	GPU Architecture AMD MI-250X Total GPUs 128 GPUs per compute node 4 GPU global memory (GiB) 128.00
Memory Total (GiB)	12,288
CPU Memory/Node (GiB)	512
Peak Performance	Peak PFLOPS (CPUs) 0.064 Peak PFLOPs (GPUs) 5.760 Peak PFLOPS (CPUs+GPUs) 5.824
Clock Speed (GHz)	2.0
OS	TOSS 4
Interconnect	HPE Slingshot 11
Scheduling Policy (main batch queue)	node-scheduled
Scheduler	Flux
Program	ASC, M&IC
Class	ATS-4/EA, CORAL-2
Year Commissioned	2022
Compilers	See Compilers page
Documentation	El Capitan Early Access Systems Documentation on LC Confluence

Job Limits