Version: v2.5.1

Deploy HAMi using Helm

This guide will cover:

Configure nvidia container runtime in each GPU nodes
Install HAMi using helm
Launch a vGPU task
Check if the corresponding device resources are limited inside container

Prerequisites

Helm version v3+
kubectl version v1.16+
CUDA version v10.2+
NVIDIA Driver v440+

Installation

1. Configure nvidia-container-toolkit

Configure nvidia-container-toolkit

Execute the following steps on all your GPU nodes.

This guide assumes pre-installation of NVIDIA drivers and the nvidia-container-toolkit. Additionally, it assumes configuration of the nvidia-container-runtime as the default low-level runtime.

Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Example for debian-based systems with `Docker` and `containerd`

Install the `nvidia-container-toolkit`

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Configure `Docker`

When running Kubernetes with Docker, edit the configuration file, typically located at /etc/docker/daemon.json, to set up nvidia-container-runtime as the default low-level runtime:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

And then restart Docker:

sudo systemctl daemon-reload && systemctl restart docker

Configure `containerd`

When running Kubernetes with containerd, modify the configuration file typically located at /etc/containerd/config.toml, to set up nvidia-container-runtime as the default low-level runtime:

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

And then restart containerd:

sudo systemctl daemon-reload && systemctl restart containerd

2. Label your nodes

Label your GPU nodes for scheduling with HAMi by adding the label "gpu=on". Without this label, the nodes cannot be managed by the HAMi scheduler.

kubectl label nodes <node-name> gpu=on

3. Deploy HAMi using Helm

First, you need to check your Kubernetes version by using the following command:

kubectl version

Then, add the HAMi repo in helm

helm repo add hami-charts https://project-hami.github.io/HAMi/
helm repo update

During installation, set the Kubernetes scheduler image version to match your Kubernetes server version. For instance, if your cluster server version is 1.16.8, use the following command for deployment:

helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag=v1.16.8 -n kube-system

If everything goes well, you will see both vgpu-device-plugin and vgpu-scheduler pods are in the Running state

Demo

1. Submit demo task

Containers can now request NVIDIA vGPUs using the `nvidia.com/gpu`` resource type.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 vGPUs
          nvidia.com/gpumem: 10240 # Each vGPU contains 10240m device memory (Optional,Integer)

Verify in container resource control

Execute the following query command:

kubectl exec -it gpu-pod -- nvidia-smi

The result should be

[HAMI-core Msg(28:140561996502848:libvgpu.c:836)]: Initializing.....
Wed Apr 10 09:28:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           On  |   00000000:3E:00.0 Off |                    0 |
| N/A   29C    P0             24W /  250W |       0MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(28:140561996502848:multiprocess_memory_limit.c:434)]: Calling exit handler 28

Prerequisites​

Installation​

1. Configure nvidia-container-toolkit​

Example for debian-based systems with Docker and containerd​

Install the nvidia-container-toolkit​

Configure Docker​

Configure containerd​

2. Label your nodes​

3. Deploy HAMi using Helm​

Demo​

1. Submit demo task​

Verify in container resource control​