版本：v2.5.0

Enable Metax GPU sharing

Introduction

We support metax.com/gpu as follows:

support metax.com/gpu by implementing most device-sharing features as nvidia-GPU
support metax.com/gpu by implementing topo-awareness among metax GPUs

device-sharing features include the following:

GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.

Device Memory Control: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.

Device compute core limitation: GPUs can be allocated with certain percentage of device core(60 indicate this container uses 60% compute cores of this device)

Prerequisites

Metax Driver >= 2.31.0
Metax GPU Operator >= 0.10.1
Kubernetes >= 1.23

Deploy Metax GPU Operator on metax nodes (Please consult your device provider to aquire its package and document)
Deploy HAMi according to README.md

Running Metax jobs

Metax GPUs can now be requested by a container using the metax-tech.com/sgpu resource type:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod1
spec:
  containers:
    - name: ubuntu-container
      image: cr.metax-tech.com/public-ai-release/c500/colossalai:2.24.0.5-py38-ubuntu20.04-amd64 
      imagePullPolicy: IfNotPresent
      command: ["sleep","infinity"]
      resources:
        limits:
          metax-tech.com/sgpu: 1 # requesting 1 GPU 
          metax-tech.com/vcore: 60 # each GPU use 60% of total compute cores
          metax-tech.com/vmemory: 4 # each GPU require 4 GiB device memory

NOTICE1: You can find more examples in examples/sgpu folder

Introduction​

support metax.com/gpu by implementing most device-sharing features as nvidia-GPU​

Prerequisites​

Enabling GPU-sharing Support​

Running Metax jobs​

Introduction

support metax.com/gpu by implementing most device-sharing features as nvidia-GPU

Prerequisites

Enabling GPU-sharing Support

Running Metax jobs