Matrix | HPC @ LLNL

Smallest Allocation: Includes&nbsp; 1 GPU &nbsp;and its&nbsp; 28 local CPU cores .
Larger Allocations: Include&nbsp; multiple GPUs &nbsp;and all CPU cores local to those GPUs.
Shared Nodes: Multiple users can share a node, each with dedicated GPUs, memory, and cores.
Multi-Node Jobs: A job with multiple GPUs can span multiple nodes, even if using less than a full node.

NOTE This system currently has limited availability.

Job Limits

Each LC platform is a shared resource. Users are expected to adhere to the following usage policies to ensure that the resources can be effectively and productively used by everyone. You can view the policies on a system itself by running:

news job.lim.MACHINENAME

Web Version of Matrix Job Limits

Hardware

Each Matrix node is based on Intel Sapphire Rapids processor with 56 cores per socket, 2 sockets per node, and 512 GB DDR5 memory, as well as 4 Nvidia H100 GPUs.

Scheduling

Matrix is GPU scheduled. This means that the minimum allocation will include 1 GPU, 28 cores, and 128 GB of memory and all allocations will be multiples of 1 GPU, 28 cores, 128 GB of memory. Please use the Slurm '-G', '--gpus', or '--gpus-per-task' flags to specify how many resources you will need for your job. If you do not specify a GPU option, your allocation may not have the resources available that you expect. You can also allocate whole nodes with '--exclusive'.

Additional scheduling examples can be found on below.

Matrix Scheduling Examples

Matrix is a GPU-scheduled machine where resources are allocated based on GPUs rather than entire nodes. Below are the key principles of how resources are allocated:

**Smallest Allocation**: Includes **1 GPU** and its **28 local CPU cores**.

**Larger Allocations**: Include **multiple GPUs** and all CPU cores local to those GPUs.

**Shared Nodes**: Multiple users can share a node, each with dedicated GPUs, memory, and cores.

**Multi-Node Jobs**: A job with multiple GPUs can span multiple nodes, even if using less than a full node.

**Flags and Their Behavior**

-G **(Number of GPUs)**

Specifies the number of GPUs needed for the job. Each GPU comes with its associated CPU cores and memory.

**Examples**:

salloc -G 1 : Allocates **1 GPU** and **28 CPU cores**.

salloc -G 2 : Allocates **2 GPUs** and **56 CPU cores**.

**Notes**:

If the requested GPUs are available on the same node, the job will remain on one node.

If the requested GPUs are not available on the same node, the job may span multiple nodes

**-n (Number of Tasks)**

Specifies the number of tasks (processes) required for the job, with **one CPU core allocated per task**.

**Examples**:

salloc -n 1 : Allocates **1 GPU** and **28 CPU cores** (smallest allocation).

salloc -n 28: Allocates **1 GPU** and **28 CPU cores** (fully utilizes the cores associated with the GPU).

salloc -n 29 :Allocates **2 GPUs** and **56 CPU cores** (spans multiple GPUs when more than 28 tasks are requested).

**-N (Number of Nodes)**

Specifies the number of nodes over which the allocation will be spread.

**Examples**:

salloc -N 1 : Allocates **1 GPU** and **28 CPU cores** (smallest possible allocation on one node).

salloc -N 2 : Allocates **2 GPUs** and **56 CPU cores**, distributed across **two nodes**.

**Notes**:

salloc -G 2 -N 1 : Allocates **2 GPUs** on the **same node**.

salloc -G 2 -N 2 : Allocates **1 GPU on each of 2 nodes**.

**--exclusive (Exclusive Node Access)**

Allocates the **entire node** for all nodes in the request. This ensures no other jobs share the node.

**Examples**:

salloc -G 1 --exclusive : Allocates **4 GPUs** and **112 CPU cores** on the same node (entire node).

salloc -G 5 --exclusive : Allocates **8 GPUs** and **256 CPU cores**, split across **2 nodes**.

**Additional Notes**

**Multi-GPU Allocation**:

salloc -G 2 allocates **2 GPUs and 56 cores** on the same node if possible.

If the GPUs are not available on the same node, the job will be split across **2 nodes**.

**Node-Specific Allocation**:

salloc -G 2 -N 1 : Ensures **2 GPUs** are allocated from the **same node**.

salloc -G 2 -N 2 : Allocates **1 GPU on each of 2 nodes**.
Zone	CZ
Vendor	Dell
User-Available Nodes	Login Nodes* matrix[1,2] Batch Nodes 28 Debug Nodes 2 Total Nodes 28
CPUs	CPU Architecture Intel(R) Xeon(R) Platinum 8480+ Cores/Node 112 Total Cores 1,792
GPUs	GPU Architecture NVIDIA H100 Total GPUs 112 GPUs per compute node 4 GPU global memory (GiB) 320.00
Memory Total (GiB)	8,064
CPU Memory/Node (GiB)	504
Peak Performance	Peak PFLOPS (CPUs) 0.198 Peak PFLOPs (GPUs) 3.800 Peak PFLOPS (CPUs+GPUs) 4.000
Clock Speed (GHz)	3.7
OS	TOSS 4
Interconnect	IB
Parallel job type	multiple nodes per user
Scheduler	Slurm
Recommended location for parallel file space	/p/lustre{...}
Class	CTS-2
Year Commissioned	2025
Compilers	See Compilers page

Job Limits