Job Limits

Each LC platform is a shared resource. Users are expected to adhere to the following usage policies to ensure that the resources can be effectively and productively used by everyone. You can view the policies on a system itself by running:

news job.lim.MACHINENAME

Web Version of Tioga Job Limits

Tioga is an CORAL2 early access system. There are 2 login nodes and 24 compute nodes in a pdebug partition. The compute nodes have 64 AMD EPYC cores/node, 4 AMD gfx90a gpus/node, and 512 GB memory per node. Tioga is running TOSS 4

There is 1 scheduling pool:

  • pdebug—1536 cores, 96 GPUs (24 nodes)

Scheduling

Tioga jobs are scheduled using FLUX. Scheduling is not technically enforced so users are expected to monitor their own behavior and keep themselves within the current limits while following these policies:

  • Users will not compile on the login nodes during daytime hours
  • A user can have a maximum of 192 processors and 12 gpus with a runtime of up to 4 hours in queue during the day with the following exceptions:
    • An occasional one hour max job for debugging that takes 193-256 processors as long as it is the user's only job in the queue.
  • Daytime is 0800-2000 Mondays-Fridays not including holidays
  • To prevent runaway jobs, there's a technical maximum per job of 12 hours
  • No production runs allowed, only development and debugging
  • Users won't run computationally intensive work on the login node

We are all family and expect developers to play nice. However if someone's job(s) have taken over the machine:

  • Call them or send them an email.
  • Email ramblings-help@llnl.gov with a screenshot so we can take care of the situation by killing work that violates policy

This approach will be revisited later and additional limits will be set if necessary. If someone monopolizes the machine, developers can always shift to other resources.

The queue can be found by typing "flux jobs -A" at the prompt

There are two tioga nodes that are scheduled by Slurm for JupyterHub users. You can see the status of these nodes by running Slurm commands with fully qualified paths (e.g., /bin/sinfo, /bin/squeue, etc).

Documentation

Contact

Please call or send email to the LC Hotline if you have questions. LC Hotline | phone: 925-422-4531 | email: lc-hotline@llnl.gov

Zone
CZ
Vendor
HPE Cray
User-Available Nodes
Login Nodes*
2
Batch Nodes
0
Debug Nodes
30
Total Nodes
32
CPUs
CPU Architecture
AMD Trento
Cores/Node
64
Total Cores
2,048
GPUs
GPU Architecture
AMD MI-250X
Total GPUs
128
GPUs per compute node
4
GPU peak performance (TFLOP/s double precision)
45.00
GPU global memory (GB)
128.00
Memory Total (GB)
12,288
CPU Memory/Node (GB)
512
Peak Performance
Peak TFLOPS (CPUs)
64.0
Peak TFLOPs (GPUs)
5,760.0
Peak TFLOPS (CPUs+GPUs)
5,824.00
Clock Speed (GHz)
2.0
OS
TOSS 4
Interconnect
HPE Slingshot 11
Parallel job type
multiple nodes per job
Program
ASC, M&IC
Class
ATS-4/EA, CORAL-2
Password Authentication
OTP, Kerberos, ssh keys
Year Commissioned
2022
Compilers

See Compilers page

Documentation