跳转到文档内容
版本:v2.5.0

Enable Metax GPU sharing

Introduction

We support metax.com/gpu as follows:

  • support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
  • support metax.com/gpu by implementing topo-awareness among metax GPUs

support metax.com/gpu by implementing most device-sharing features as nvidia-GPU

device-sharing features include the following:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.

Device compute core limitation: GPUs can be allocated with certain percentage of device core(60 indicate this container uses 60% compute cores of this device)

Prerequisites

  • Metax Driver >= 2.31.0
  • Metax GPU Operator >= 0.10.1
  • Kubernetes >= 1.23

Enabling GPU-sharing Support

  • Deploy Metax GPU Operator on metax nodes (Please consult your device provider to aquire its package and document)

  • Deploy HAMi according to README.md

Running Metax jobs

Metax GPUs can now be requested by a container using the metax-tech.com/sgpu resource type:

apiVersion: v1
kind: Pod
metadata:
name: gpu-pod1
spec:
containers:
- name: ubuntu-container
image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64
imagePullPolicy: IfNotPresent
command: ["sleep","infinity"]
resources:
limits:
metax-tech.com/sgpu: 1 # requesting 1 GPU
metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory

NOTICE1: You can find more examples in examples/sgpu folder