Scheduler Policy
Summary
Current in a cluster with many GPU nodes, nodes are not binpack or spread when making scheduling decisions, nor are GPU cards binpack or spread when using vGPU.
Proposal
We add a node-scheduler-policy and gpu-scheduler-policy to config, then scheduler to use this policy can impl node binpack or spread or GPU binpack or spread or topology-aware. The topology-aware policy only takes effect with Nvidia GPUs.
User can set Pod annotation to change this default policy, use hami.io/node-scheduler-policy and hami.io/gpu-scheduler-policy to overlay scheduler config.
User Stories
This is a GPU cluster, having two node, the following story takes this cluster as a prerequisite.

Story 1
node binpack, use one node’s GPU card whenever possible, egs:
-
cluster resources:
- node1: GPU having 4 GPU device
- node2: GPU having 4 GPU device
-
request:
- pod1: User 1 GPU
- pod2: User 1 GPU
-
scheduler result:
- pod1: scheduler to node1
- pod2: scheduler to node1
Story 2
node spread, use GPU cards from different nodes as much as possible, egs:
-
cluster resources:
- node1: GPU having 4 GPU device
- node2: GPU having 4 GPU device
-
request:
- pod1: User 1 GPU
- pod2: User 1 GPU
-
scheduler result:
- pod1: scheduler to node1
- pod2: scheduler to node2
Story 3
GPU binpack, use the same GPU card as much as possible, egs:
-
cluster resources:
- node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4
-
request:
- pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
- pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
-
scheduler result:
- pod1: scheduler to node1, select GPU1 this device
- pod2: scheduler to node1, select GPU1 this device
Story 4
GPU spread, use different GPU cards when possible, egs:
-
cluster resources:
- node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4
-
request:
- pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
- pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
-
scheduler result:
- pod1: scheduler to node1, select GPU1 this device
- pod2: scheduler to node1, select GPU2 this device
Design Details
Node-scheduler-policy

Binpack
Binpack mainly considers node resource usage. The more full the usage, the higher the score.
score: ((request + used) / allocatable) * 10
- Binpack scoring information for Node 1 is as follows
Node1 score: ((1+3)/4) * 10= 10
- Binpack scoring information for Node 2 is as follows
Node2 score: ((1+2)/4) * 10= 7.5
So, in Binpack policy we can select Node1.
Spread
Spread mainly considers node resource usage. The less it is used, the lower the score, but the higher the priority.
score: ((request + used) / allocatable) * 10
- Spread scoring information for Node 1 is as follows
Node1 score: ((1+3)/4) * 10= 10
- Spread scoring information for Node 2 is as follows
Node2 score: ((1+2)/4) * 10= 7.5
So, in Spread policy we can select Node2.
GPU-scheduler-policy

Binpack
Binpack mainly focuses on the computing power and video memory usage of each card. The more it is used, the higher the score.
score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10
- Binpack scoring information for GPU 1 is as follows
GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
- Binpack scoring information for GPU 2 is as follows
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75
So, in Binpack policy we can select GPU2.
Spread
Spread mainly focuses on the computing power and video memory usage of each card. The less it is used, the higher the score.
score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10
- Spread scoring information for GPU 1 is as follows
GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
- Spread scoring information for GPU 2 is as follows
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75
So, in Spread policy we can select GPU1.
Topology-aware
Nvidia Topology-aware (Nvidia GPU Only)
Nvidia Topology-aware primarily focuses on the topological relationships between each GPU (queried using the nvidia-smi topo -m command). The hami-device-plugin calculates scores between GPUs based on these relationships—the higher the bandwidth between GPUs, the higher the score. For example:
[
{
"uuid": "gpu0",
"score": {
"gpu1": "100",
"gpu2": "100",
"gpu3": "200"
}
},
{
"uuid": "gpu1",
"score": {
"gpu0": "100",
"gpu2": "200",
"gpu3": "100"
}
},
{
"uuid": "gpu2",
"score": {
"gpu0": "100",
"gpu1": "200",
"gpu3": "200"
}
},
{
"uuid": "gpu3",
"score": {
"gpu0": "200",
"gpu1": "100",
"gpu2": "200"
}
}
]
One GPU
When a Pod requests only one GPU, the GPU with the worst communication performance with other GPUs is prioritized—the lower the score, the higher the scheduling priority. For example:
- The sum of scores for gpu0 with other GPUs is as follows:
gpu0 score: 100 + 100 + 200 = 400
- The sum of scores for gpu1 with other GPUs is as follows:
gpu1 score: 100 + 200 + 100 = 400
- The sum of scores for gpu2 with other GPUs is as follows:
gpu2 score: 100 + 200 + 200 = 500
- The sum of scores for gpu3 with other GPUs is as follows:
gpu3 score: 200 + 100 + 200 = 500
Therefore, when a Pod requests only one GPU, we randomly select either gpu0 or gpu1.
More than one GPU
When a Pod requests multiple GPUs (more than one), the combination with the highest score is prioritized—the higher the score, the higher the scheduling priority.
For example: If a Pod requests 3 GPUs, take gpu0, gpu1, gpu2 as an example. The score is calculated as:
totalScore = score(gpu0, gpu1) + score(gpu0, gpu2) + score(gpu1, gpu2)
- The score for gpu0, gpu1, gpu2 is as follows:
(gpu0, gpu1, gpu2) totalScore: 100 + 100 + 200 = 400
- The score for gpu0, gpu1, gpu3 is as follows:
(gpu0, gpu1, gpu3) totalScore: 100 + 200 + 100 = 400
- The score for gpu1, gpu2, gpu3 is as follows:
(gpu1, gpu2, gpu3) totalScore: 200 + 100 + 200 = 500
Therefore, when a Pod requests 3 GPUs, we allocate gpu1, gpu2, gpu3.