跳转到文档内容
版本:下一个

GPU Utilization Metrics

Summary

HAMi supports dividing a single NVIDIA GPU card into several vGPU cards to efficiently utilize GPU capacity. However, when a vGPU is assigned to a Pod, HAMi does not currently expose per-Pod GPU utilization metrics. This makes it impossible for users to observe how much GPU each Pod is actually consuming.

This document describes the design for adding per-Pod vGPU utilization monitoring.

Motivation

Goals

  • Support monitoring of per-Pod vGPU utilization.

Non-Goals

  • Does not support monitoring GPU utilization for non-NVIDIA GPUs.

Design Details

Extend shared_region with a gpu_util Field

A new gpu_util field is added to shrreg_proc_slot_t to record per-process GPU utilization:

typedef struct {
uint64_t dec_util;
uint64_t enc_util;
uint64_t sm_util;
} device_gpu_t;

typedef struct {
int32_t pid;
int32_t hostpid;
device_memory_t used[CUDA_DEVICE_MAX_COUNT];
uint64_t monitorused[CUDA_DEVICE_MAX_COUNT];
int32_t status;
device_gpu_t gpu_util[CUDA_DEVICE_MAX_COUNT]; // new field
} shrreg_proc_slot_t;


int set_gpu_device_gpu_monitor(int32_t pid, int dev, unsigned int smUtil) {
int i;
ensure_initialized();
lock_shrreg();
for (i = 0; i < region_info.shared_region->proc_num; i++) {
if (region_info.shared_region->procs[i].hostpid == pid) {
region_info.shared_region->procs[i].gpu_util[dev].smUtil = smUtil;
break;
}
}
unlock_shrreg();
return 1;
}

Update get_used_gpu_utilization

The get_used_gpu_utilization method is updated to record the GPU usage rate of the current pid:

int get_used_gpu_utilization(int *userutil, int *sysprocnum) {
// ...
for (i = 0; i < processes_num; i++) {
set_gpu_device_memory_monitor(processes_sample[i].pid, cudadev, summonitor);
set_gpu_device_gpu_monitor(processes_sample[i].pid, cudadev, processes_sample[i].smUtil); // new
}
// ...
return 0;
}

Expose Metrics via vGPUMonitor

vGPUMonitor is updated to read sm_util from the shared region and expose it as a Prometheus metric:

ctrDeviceUtilizationdesc = prometheus.NewDesc(
"Device_utilization_desc_of_container",
"Container device utilization description",
[]string{"podnamespace", "podname", "ctrname", "vdeviceid", "deviceuuid"}, nil,
)

func getTotalUtilization(usage podusage, vidx int) deviceUtilization {
added := deviceUtilization{decUtil: 0, encUtil: 0, smUtil: 0}
for _, val := range usage.sr.procs {
added.decUtil += val.gpuUtil[vidx].decUtil
added.encUtil += val.gpuUtil[vidx].encUtil
added.smUtil += val.gpuUtil[vidx].smUtil
}
return added
}

utilization := getTotalUtilization(srPodList[sridx], i)

ch <- prometheus.MustNewConstMetric(
ctrDeviceUtilizationdesc,
prometheus.GaugeValue,
float64(utilization.smUtil),
val.Namespace, val.Name, ctrName, fmt.Sprint(i), uuid,
)

Test Plan

Deploy multiple Pods that actively use GPU on the same node and verify that HAMi exposes accurate per-Pod GPU utilization rates via the Prometheus metrics endpoint.

CNCFHAMi 是 CNCF Sandbox 项目