Lab 3: GPU Partitioning with HAMi
This lab continues from Lab 1. You have one physical Tesla T4 with 15360 MiB of VRAM. In this lab you will run multiple Pods on that single card, each with its own enforced VRAM and compute limit, and verify that the isolation is real: a Pod that tries to allocate past its slice gets a CUDA OOM while its neighbors keep running.
Every command and output in this lab was captured from a live cluster built with Lab 1 (HAMi v2.9.0, GPU Operator v25.3.0, Kubernetes v1.34).
What You'll Learn
- The vGPU resource types HAMi adds to Kubernetes
- Running multiple Pods on one physical GPU
- How HAMi enforces VRAM ceilings inside containers
- Proving memory isolation with an OOM test
- Limiting compute with
gpucoresand the two utilization policies - Where to see the HAMi scheduler's decisions
Lab Overview
Prerequisites
- A cluster from Lab 1: Online Installation: HAMi installed, GPU node labeled
gpu=on - The manifests from
examples/03-gpu-partitioning/
Step 1: Understand HAMi Resource Types
HAMi extends Kubernetes with custom GPU resources. A Pod combines them to describe its slice:
| Resource | Meaning |
|---|---|
nvidia.com/gpu | Number of vGPUs the container requests |
nvidia.com/gpumem | VRAM per vGPU, in MiB |
nvidia.com/gpumem-percentage | VRAM per vGPU as a percentage of the card (alternative to gpumem) |
nvidia.com/gpucores | Compute capacity per vGPU, as a percentage of the card's SMs |
Check what the node offers:
NODE_NAME=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl get node ${NODE_NAME} -o jsonpath='{.status.allocatable.nvidia\.com/gpu}'
10
One physical T4 is registered as 10 schedulable vGPUs (the
countfield of the registration annotation you verified in Lab 1). Each can carry its owngpumemandgpucoreslimits.
Step 2: Two Pods Sharing One GPU
Deploy two Pods that each request 1 vGPU with a 4000 MiB VRAM slice. Review the manifest first:
apiVersion: v1
kind: Pod
metadata:
name: gpumem-pod-a
spec:
restartPolicy: Never
containers:
- name: app
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 1 # request 1 vGPU
nvidia.com/gpumem: 4000 # limit this Pod to 4000 MiB of VRAM
gpumem-pod-b.yaml is identical except for the name. Apply both:
kubectl apply -f gpumem-pod-a.yaml -f gpumem-pod-b.yaml
kubectl get pods gpumem-pod-a gpumem-pod-b -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
gpumem-pod-a 1/1 Running 0 27s 10.244.218.156 hami-workshop <none> <none>
gpumem-pod-b 1/1 Running 0 27s 10.244.218.157 hami-workshop <none> <none>
Both Pods are Running on a node that has only one physical GPU. Verify they actually landed on the same card by reading the allocation annotation the HAMi scheduler writes on each Pod:
kubectl get pod gpumem-pod-a -o jsonpath='{.metadata.annotations.hami\.io/vgpu-devices-allocated}'; echo
kubectl get pod gpumem-pod-b -o jsonpath='{.metadata.annotations.hami\.io/vgpu-devices-allocated}'; echo
GPU-859b872c-0ba2-97b0-10b4-8b7185c55039,NVIDIA,4000,0:;
GPU-859b872c-0ba2-97b0-10b4-8b7185c55039,NVIDIA,4000,0:;
Same device UUID, 4000 MiB each. Two Pods, one physical T4. The annotation format is
{UUID},{vendor},{VRAM MiB},{compute %}.
Step 3: The VRAM Ceiling
The critical question: what does the GPU look like from inside a Pod?
kubectl exec gpumem-pod-a -- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06 Driver Version: 570.124.06 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 57C P8 16W / 70W | 0MiB / 4000MiB | 0% Default |
+-----------------------------------------+------------------------+----------------------+
0MiB / 4000MiB: the container sees a 4000 MiB GPU, not the physical 15360 MiB. This is HAMi-core at work, a library injected into the container viaLD_PRELOADthat intercepts CUDA and NVML calls and rewrites the memory numbers. The other Pod sees its own independent 4000 MiB ceiling on the same physical card.
Step 4: Prove Memory Isolation with an OOM Test
A ceiling in nvidia-smi output is nice, but does the limit actually hold? oom-test-pod.yaml allocates GPU memory in 512 MiB chunks until it fails:
apiVersion: v1
kind: Pod
metadata:
name: oom-test-pod
spec:
restartPolicy: Never
containers:
- name: app
image: pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime
command:
- python
- -c
- |
import torch
chunks = []
try:
while True:
# allocate 512 MiB per iteration
chunks.append(torch.empty(512, 1024, 1024, dtype=torch.uint8, device="cuda"))
print(f"Allocated {len(chunks) * 512} MiB", flush=True)
except RuntimeError as e:
print(f"Hit the limit after {len(chunks) * 512} MiB:", flush=True)
print(str(e).split(".")[0], flush=True)
resources:
limits:
nvidia.com/gpu: 1 # request 1 vGPU
nvidia.com/gpumem: 4000 # limit this Pod to 4000 MiB of VRAM
kubectl apply -f oom-test-pod.yaml
While the image pulls, watch the HAMi scheduler make its decision:
kubectl describe pod oom-test-pod | tail -3
Normal FilteringSucceed 96s hami-scheduler find fit node(hami-workshop), 0 nodes not fit, 1 nodes fit(hami-workshop:7.21)
Normal BindingSucceed 96s hami-scheduler Successfully binding node [hami-workshop] to default/oom-test-pod
Normal Pulling 95s kubelet Pulling image "pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime"
FilteringSucceedshows the scheduler scoring nodes (herehami-workshop:7.21), andBindingSucceedshows it binding the Pod. These events come from hami-scheduler, not the default scheduler. Note it found a fit even though two Pods already occupy the GPU: 8000 of 15360 MiB are reserved, so a third 4000 MiB slice still fits.
Wait for the Pod to complete, then read its logs:
kubectl logs oom-test-pod | tail -8
Allocated 2560 MiB
Allocated 3072 MiB
Allocated 3584 MiB
[HAMI-core ERROR (pid:1 thread=... allocator.c:52)]: Device 0 OOM 4399824896 / 4194304000
Hit the limit after 3584 MiB:
CUDA out of memory
The allocation loop runs fine up to 3584 MiB. The next 512 MiB chunk would put the total (plus CUDA context overhead) past the slice, and HAMi-core denies it: it tried to use 4399824896 bytes against the 4194304000-byte (4000 MiB) limit. The application receives a standard
CUDA out of memoryerror, exactly what it would get on a real 4 GB card.
Meanwhile, check the neighbors:
kubectl get pods gpumem-pod-a gpumem-pod-b
NAME READY STATUS RESTARTS AGE
gpumem-pod-a 1/1 Running 0 5m12s
gpumem-pod-b 1/1 Running 0 5m12s
The OOM was contained entirely inside the offending Pod. This is the isolation property that makes it safe to pack multiple workloads on one card.
Step 5: Limit Compute with gpucores
VRAM is one dimension; compute is the other. gpucores-pod.yaml requests 30% of the card's compute and runs an infinite matrix multiplication:
env:
# "force" caps utilization strictly at gpucores.
# The default policy allows bursting above the cap while the GPU is idle.
- name: GPU_CORE_UTILIZATION_POLICY
value: "force"
resources:
limits:
nvidia.com/gpu: 1 # request 1 vGPU
nvidia.com/gpumem: 4000 # limit this Pod to 4000 MiB of VRAM
nvidia.com/gpucores: 30 # limit this Pod to 30% of GPU compute
kubectl apply -f gpucores-pod.yaml
Check the environment HAMi injected into the container:
kubectl exec gpucores-pod -- env | grep -E 'CUDA_DEVICE|NVIDIA_VISIBLE'
NVIDIA_VISIBLE_DEVICES=GPU-859b872c-0ba2-97b0-10b4-8b7185c55039
CUDA_DEVICE_MEMORY_LIMIT_0=4000m
CUDA_DEVICE_SM_LIMIT=30
CUDA_DEVICE_MEMORY_SHARED_CACHE=/usr/local/vgpu/b08c450f-718b-4c25-ae0b-e02251097ed9.cache
CUDA_DEVICE_MEMORY_LIMIT_0andCUDA_DEVICE_SM_LIMITare read by HAMi-core to enforce the VRAM ceiling and the compute cap. The shared cache file coordinates usage accounting between Pods on the same card. Watch the GPU utilization from the host side (via the driver Pod) while the workload runs:
DRIVER_POD=$(kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset -o name | head -1)
for i in $(seq 1 12); do
kubectl -n gpu-operator exec ${DRIVER_POD} -- nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
sleep 5
done
32 %
82 %
0 %
84 %
0 %
60 %
0 %
57 %
11 %
25 %
0 %
9 %
HAMi-core throttles by duty cycle: bursts of work followed by enforced idle periods. Individual samples jump around, but the average of these 12 samples is exactly 30%. Prometheus confirms it over a longer window:
kubectl exec -n monitoring prometheus-prometheus-kube-prometheus-prometheus-0 -c prometheus -- \
promtool query instant http://localhost:9090 'avg_over_time(DCGM_FI_DEV_GPU_UTIL[3m])'
{..., UUID="GPU-859b872c-...", modelName="Tesla T4", ...} => 27.555555555555557
About 30%, as requested. Note the two policies:
- default: the Pod may burst above its
gpucoresshare while the GPU is otherwise idle. Good for utilization, looser isolation.- force (
GPU_CORE_UTILIZATION_POLICY=force): strict cap at all times, as measured above. Good for predictable performance isolation.Without the
forceenv, you would see this same workload run at 100% utilization on an idle card, which is intentional: HAMi gives idle capacity away rather than wasting it.
Step 6: Cleanup
kubectl delete pod gpumem-pod-a gpumem-pod-b oom-test-pod gpucores-pod --ignore-not-found
What This Lab Proved
| Claim | Evidence |
|---|---|
| Multiple Pods can share one physical GPU | Two Running Pods, same device UUID in their allocation annotations |
| VRAM limits are enforced, not cosmetic | nvidia-smi ceiling of 4000 MiB; allocation denied past the limit with CUDA OOM |
| A Pod hitting its limit does not disturb neighbors | OOM Pod Completed while both neighbors stayed Running |
| Compute limits hold | 30% requested; 12-sample average exactly 30%, DCGM 3-minute average 27.6% |
Next Steps
Read HAMi Cluster Architecture to map every component you just exercised, or continue experimenting: try nvidia.com/gpumem-percentage, run more than two Pods on the card, or fill the card's 10 vGPU slots and watch the scheduler refuse the 11th Pod.