Skip to main content
Version: v2.6.0

Scheduler Policy

Summary

Current in a cluster with many GPU nodes, nodes are not binpack or spread when making scheduling decisions, nor are GPU cards binpack or spread when using vGPU.

Proposal

We add a node-scheduler-policy and gpu-scheduler-policy to config, then scheduler to use this policy can impl node binpack or spread or GPU binpack or spread or topology-aware. The topology-aware policy only takes effect with Nvidia GPUs.

User can set Pod annotation to change this default policy, use hami.io/node-scheduler-policy and hami.io/gpu-scheduler-policy to overlay scheduler config.

User Stories

This is a GPU cluster, having two node, the following story takes this cluster as a prerequisite.

scheduler-policy-story.png

Story 1

node binpack, use one node’s GPU card whenever possible, egs:

  • cluster resources:

    • node1: GPU having 4 GPU device
    • node2: GPU having 4 GPU device
  • request:

    • pod1: User 1 GPU
    • pod2: User 1 GPU
  • scheduler result:

    • pod1: scheduler to node1
    • pod2: scheduler to node1

Story 2

node spread, use GPU cards from different nodes as much as possible, egs:

  • cluster resources:

    • node1: GPU having 4 GPU device
    • node2: GPU having 4 GPU device
  • request:

    • pod1: User 1 GPU
    • pod2: User 1 GPU
  • scheduler result:

    • pod1: scheduler to node1
    • pod2: scheduler to node2

Story 3

GPU binpack, use the same GPU card as much as possible, egs:

  • cluster resources:

    • node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4
  • request:

    • pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
    • pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
  • scheduler result:

    • pod1: scheduler to node1, select GPU1 this device
    • pod2: scheduler to node1, select GPU1 this device

Story 4

GPU spread, use different GPU cards when possible, egs:

  • cluster resources:

    • node1: GPU having 4 GPU device, they are GPU1,GPU2,GPU3,GPU4
  • request:

    • pod1: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
    • pod2: User 1 GPU, gpucore is 20%, gpumem-percentage is 20%
  • scheduler result:

    • pod1: scheduler to node1, select GPU1 this device
    • pod2: scheduler to node1, select GPU2 this device

Design Details

Node-scheduler-policy

node-shceduler-policy-demo.png

Binpack

Binpack mainly considers node resource usage. The more full the usage, the higher the score.

score: ((request + used) / allocatable) * 10 
  1. Binpack scoring information for Node 1 is as follows
Node1 score: ((1+3)/4) * 10= 10
  1. Binpack scoring information for Node 2 is as follows
Node2 score: ((1+2)/4) * 10= 7.5

So, in Binpack policy we can select Node1.

Spread

Spread mainly considers node resource usage. The less it is used, the lower the score, but the higher the priority.

score: ((request + used) / allocatable) * 10 
  1. Spread scoring information for Node 1 is as follows
Node1 score: ((1+3)/4) * 10= 10
  1. Spread scoring information for Node 2 is as follows
Node2 score: ((1+2)/4) * 10= 7.5

So, in Spread policy we can select Node2.

GPU-scheduler-policy

gpu-scheduler-policy-demo.png

Binpack

Binpack mainly focuses on the computing power and video memory usage of each card. The more it is used, the higher the score.

score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10
  1. Binpack scoring information for GPU 1 is as follows
GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
  1. Binpack scoring information for GPU 2 is as follows
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75

So, in Binpack policy we can select GPU2.

Spread

Spread mainly focuses on the computing power and video memory usage of each card. The less it is used, the higher the score.

score: ((request.core + used.core) / allocatable.core + (request.mem + used.mem) / allocatable.mem)) * 10
  1. Spread scoring information for GPU 1 is as follows
GPU1 Score: ((20+10)/100 + (1000+2000)/8000)) * 10 = 6.75
  1. Spread scoring information for GPU 2 is as follows
GPU2 Score: ((20+70)/100 + (1000+6000)/8000)) * 10 = 17.75

So, in Spread policy we can select GPU1.

Topology-aware

Nvidia Topology-aware (Nvidia GPU Only)

Nvidia Topology-aware primarily focuses on the topological relationships between each GPU (queried using the nvidia-smi topo -m command). The hami-device-plugin calculates scores between GPUs based on these relationships—the higher the bandwidth between GPUs, the higher the score. For example:

[  
{
"uuid": "gpu0",
"score": {
"gpu1": "100",
"gpu2": "100",
"gpu3": "200"
}
},
{
"uuid": "gpu1",
"score": {
"gpu0": "100",
"gpu2": "200",
"gpu3": "100"
}
},
{
"uuid": "gpu2",
"score": {
"gpu0": "100",
"gpu1": "200",
"gpu3": "200"
}
},
{
"uuid": "gpu3",
"score": {
"gpu0": "200",
"gpu1": "100",
"gpu2": "200"
}
}
]
One GPU

When a Pod requests only one GPU, the GPU with the worst communication performance with other GPUs is prioritized—the lower the score, the higher the scheduling priority. For example:

  1. The sum of scores for gpu0 with other GPUs is as follows:
gpu0 score: 100 + 100 + 200 = 400  
  1. The sum of scores for gpu1 with other GPUs is as follows:
gpu1 score: 100 + 200 + 100 = 400  
  1. The sum of scores for gpu2 with other GPUs is as follows:
gpu2 score: 100 + 200 + 200 = 500  
  1. The sum of scores for gpu3 with other GPUs is as follows:
gpu3 score: 200 + 100 + 200 = 500  

Therefore, when a Pod requests only one GPU, we randomly select either gpu0 or gpu1.

More than one GPU

When a Pod requests multiple GPUs (more than one), the combination with the highest score is prioritized—the higher the score, the higher the scheduling priority.

For example: If a Pod requests 3 GPUs, take gpu0, gpu1, gpu2 as an example. The score is calculated as:
totalScore = score(gpu0, gpu1) + score(gpu0, gpu2) + score(gpu1, gpu2)

  1. The score for gpu0, gpu1, gpu2 is as follows:
(gpu0, gpu1, gpu2) totalScore: 100 + 100 + 200 = 400  
  1. The score for gpu0, gpu1, gpu3 is as follows:
(gpu0, gpu1, gpu3) totalScore: 100 + 200 + 100 = 400  
  1. The score for gpu1, gpu2, gpu3 is as follows:
(gpu1, gpu2, gpu3) totalScore: 200 + 100 + 200 = 500  

Therefore, when a Pod requests 3 GPUs, we allocate gpu1, gpu2, gpu3.