K8s 基于 Volcano 优先级调度的 GPU 算力切分实践指南

机器人
摘要
lomtom

引言

在大模型训练、AI 推理等场景中,GPU 资源常面临 “算力过剩” 或 “分配不均” 的问题 —— 单张 GPU 算力强大但仅支撑单个任务时利用率偏低,而多个轻量任务又可能因资源竞争导致调度冲突。

Volcano 作为 Kubernetes 生态中高性能的批量调度系统,结合 Hami 社区的 volcano-vgpu-device-plugin,可实现 GPU 资源的精细化切分与优先级调度,既能提升 GPU 利用率,又能保障核心任务的资源优先级。

本文将详细介绍整套方案的部署、配置与验证流程。

核心组件简介

  • Volcano 是专为高性能计算、AI 训练等批量任务设计的 Kubernetes 调度器,支持 gang 调度、优先级调度、资源动态分配等核心能力。其灵活的插件化架构可扩展支持 GPU 等异构资源的调度,为 GPU 算力切分提供调度层支撑。

  • volcano-vgpu-device-plugin 🔗 插件由 Hami 社区开发,是 GPU 算力切分的核心组件。它支持将物理 GPU 按照显存、核心占比等维度拆分为多个虚拟 GPU(vGPU),并通过与 Volcano 调度器联动,实现 vGPU 资源的按需分配与优先级调度,适配不同任务的资源需求。

环境部署

安装 nvidia-container-toolkit

nvidia-container-toolkit 🔗 是 GPU 容器运行的基础依赖,需在所有节点执行安装命令(具体安装命令需参考 NVIDIA 官方文档,根据系统版本选择对应方式)。可参考:如何运行Hugging Face大模型StarCoder 内容

配置 Container Runtime

  1. 执行命令配置 containerd 运行时,指定 nvidia 为 runtime 类型:
sudo nvidia-ctk runtime configure --runtime=containerd
  1. 编辑 containerd 配置文件,设置默认运行时为 nvidia:
sudo vi /etc/containerd/config.toml

在配置文件中找到以下层级,补充或修改 default_runtime_name 字段:

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
  1. 重启 containerd 服务,使配置生效:
sudo systemctl restart containerd

安装 Volcano(基于 Helm)

  1. 添加 Volcano 的 Helm 仓库并更新:
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
  1. 执行安装命令,指定版本、镜像仓库及核心组件镜像:
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace \
--version 1.13.0 \
--set basic.image_registry=swr.cn-east-3.myhuaweicloud.com \
--set basic.controller_image_name=lomtom-common/vc-controller-manager \
--set basic.scheduler_image_name=lomtom-common/vc-scheduler \
--set basic.admission_image_name=lomtom-common/vc-webhook-manager \
--set basic.agent_image_name=lomtom-common/vc-agent
  1. 验证安装结果,确保所有组件正常运行:
kubectl get all -n volcano-system
NAME                                       READY   STATUS              RESTARTS   AGE
pod/volcano-admission-5cdb6d487-ntqlc      1/1     Running             0          61s
pod/volcano-controllers-6667dcbfd5-tp7qq   1/1     Running             0          61s
pod/volcano-scheduler-59c4b58bdd-dmkm2     1/1     Running             0          61s

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/volcano-admission-service     ClusterIP   10.234.15.169   <none>        443/TCP    61s
service/volcano-controllers-service   ClusterIP   10.234.44.42    <none>        8081/TCP   61s
service/volcano-scheduler-service     ClusterIP   10.234.57.51    <none>        8080/TCP   61s

NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           61s
deployment.apps/volcano-controllers   1/1     1            0           61s
deployment.apps/volcano-scheduler     1/1     1            1           61s

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5cdb6d487      1         1         1       61s
replicaset.apps/volcano-controllers-6667dcbfd5   1         1         1       61s
replicaset.apps/volcano-scheduler-59c4b58bdd     1         1         1       61s

预期输出将显示 volcano-admissionvolcano-controllersvolcano-scheduler 等组件的 Pod 处于 Running 状态,且 Deployment、Service 资源正常就绪。

安装volcano-vgpu-device-plugin

配置 Container Runtime

为确保容器能正常调用 GPU 资源,需配置 containerd 运行时为 nvidia:

  1. 执行命令配置 nvidia 运行时:
sudo nvidia-ctk runtime configure --runtime=containerd
  1. 编辑 containerd 配置文件,设置默认运行时:
sudo vi /etc/containerd/config.toml
  1. 在配置文件中找到以下层级,补充或修改 default_runtime_name 字段:
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
  1. 重启 containerd 服务使配置生效:
sudo systemctl restart containerd

配置 Volcano 启用 vGPU 功能

修改 Volcano 调度器配置,启用 deviceshare 插件的 vGPU 功能:

  1. 编辑 Volcano 调度器的 ConfigMap:
kubectl edit cm -n volcano-system volcano-scheduler-configmap
  1. volcano-scheduler.conf 中添加 deviceshare.VGPUEnable: true 配置,完整配置如下:
kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.VGPUEnable: true # 启用 vgpu
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

部署 volcano-vgpu-device-plugin

首先为 GPU 节点打标签,再创建插件所需的配置资源、权限资源,最后通过 DaemonSet 实现插件在所有 GPU 节点的部署,具体操作如下:

  1. 为含 GPU 的节点添加标签(用于插件调度匹配):
kubectl label node <node-name> gpu=on  # 替换 <node-name> 为实际 GPU 节点名称
  1. 部署插件相关资源(包含设备配置 ConfigMap、节点配置 ConfigMap、RBAC 权限、DaemonSet),完整 YAML 如下:

包含:

  • 设备配置 ConfigMap:定义不同 GPU 型号(如 A30、A100-40GB/80GB)的切分规格(显存、核心占比、可切分数量)、vGPU 资源参数等。
  • 节点配置 ConfigMap:定义目标节点(如 aio-node67)的 vGPU 运行模式、显存缩放比例、切分数量等。
  • RBAC 权限资源:通过 ServiceAccount、ClusterRole、ClusterRoleBinding 赋予插件操作节点、Pod、ConfigMap 等资源的权限。
  • DaemonSet:确保插件在所有标记为 gpu=on 的节点运行,包含核心插件容器(实现 vGPU 切分与资源分配)和监控容器(监控 vGPU 状态),并挂载必要的主机目录与配置文件。
# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-vgpu-device-config
  namespace: kube-system
  labels:
    app.kubernetes.io/component: volcano-vgpu-device-plugin
data:
  device-config.yaml: |-
    nvidia:
      resourceCountName: volcano.sh/vgpu-number
      resourceMemoryName: volcano.sh/vgpu-memory
      resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
      resourceCoreName: volcano.sh/vgpu-cores
      overwriteEnv: false
      defaultMemory: 0
      defaultCores: 0
      defaultGPUNum: 1
      deviceSplitCount: 10
      deviceMemoryScaling: 1
      deviceCoreScaling: 1
      gpuMemoryFactor: 1
      knownMigGeometries:
      - models: [ "A30" ]
        allowedGeometries:
          - group: group1
            geometries: 
            - name: 1g.6gb
              memory: 6144
              count: 4
          - group: group2
            geometries: 
            - name: 2g.12gb
              memory: 12288
              count: 2
          - group: group3
            geometries: 
            - name: 4g.24gb
              memory: 24576
              count: 1
      - models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ]
        allowedGeometries:
          - group: "group1" 
            geometries: 
            - name: 1g.5gb
              memory: 5120
              count: 7
          - group: "group2"
            geometries: 
            - name: 2g.10gb
              memory: 10240
              count: 3
            - name: 1g.5gb
              memory: 5120
              count: 1
          - group: "group3"
            geometries: 
            - name: 3g.20gb
              memory: 20480
              count: 2
          - group: "group4"
            geometries: 
            - name: 7g.40gb
              memory: 40960
              count: 1
      - models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
        allowedGeometries:
          - group: "group1" 
            geometries: 
            - name: 1g.10gb
              memory: 10240
              count: 7
          - group: "group2"
            geometries: 
            - name: 2g.20gb
              memory: 20480
              count: 3
            - name: 1g.10gb
              memory: 10240
              count: 1
          - group: "group3"
            geometries: 
            - name: 3g.40gb
              memory: 40960
              count: 2
          - group: "group4"
            geometries: 
            - name: 7g.79gb
              memory: 80896
              count: 1
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-vgpu-node-config
  namespace: kube-system
  labels:
    app.kubernetes.io/component: volcano-vgpu-node-plugin
data:
  config.json: |
    {
        "nodeconfig": [
            {
                "name": "aio-node67",
                "operatingmode": "hami-core",
                "devicememoryscaling": 1.8,
                "devicesplitcount": 10,
                "migstrategy":"none",
                "filterdevices": {
                  "uuid": [],
                  "index": []
                }
            }
        ]
    }
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: volcano-device-plugin
  namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: volcano-device-plugin
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: [""]
  resources: ["nodes/status"]
  verbs: ["patch"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "update", "patch", "watch"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch", "create", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: volcano-device-plugin
subjects:
- kind: ServiceAccount
  name: volcano-device-plugin
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: volcano-device-plugin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: volcano-device-plugin
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: volcano-device-plugin
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: volcano-device-plugin
    spec:
      priorityClassName: "system-node-critical"
      serviceAccount: volcano-device-plugin
      containers:
      - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/volcano-vgpu-device-plugin:v1.11.0
        args: ["--device-split-count=10"]
        lifecycle:
          postStart:
            exec:
              command: ["/bin/sh", "-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
        name: volcano-device-plugin
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: HOOK_PATH
          value: "/usr/local/vgpu"
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_MIG_MONITOR_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "utility"
        securityContext:
          allowPrivilegeEscalation: true
          privileged: true
          capabilities:
            drop: ["ALL"]
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: deviceconfig
          mountPath: /config
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: lib
          mountPath: /usr/local/vgpu
        - name: hosttmp
          mountPath: /tmp
      - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/volcano-vgpu-device-plugin:v1.11.0
        name: monitor
        command:
        - /bin/bash
        - -c
        - volcano-vgpu-monitor
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_MIG_MONITOR_DEVICES
          value: "all"
        - name: HOOK_PATH
          value: "/tmp/vgpu"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          privileged: true
          allowPrivilegeEscalation: true
          capabilities:
            drop: ["ALL"]
            add: ["SYS_ADMIN"]
        volumeMounts:
        - name: dockers
          mountPath: /run/docker
        - name: containerds
          mountPath: /run/containerd
        - name: sysinfo
          mountPath: /sysinfo
        - name: hostvar
          mountPath: /hostvar
        - name: hosttmp
          mountPath: /tmp
      volumes:
      - name: deviceconfig
        configMap:
          name: volcano-vgpu-node-config
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: Directory
        name: device-plugin
      - hostPath:
          path: /usr/local/vgpu
          type: DirectoryOrCreate
        name: lib
      - name: hosttmp
        hostPath:
          path: /tmp
          type: DirectoryOrCreate
      - name: dockers
        hostPath:
          path: /run/docker
          type: DirectoryOrCreate
      - name: containerds
        hostPath:
          path: /run/containerd
          type: DirectoryOrCreate
      - name: usrbin
        hostPath:
          path: /usr/bin
          type: Directory
      - name: sysinfo
        hostPath:
          path: /sys
          type: Directory
      - name: hostvar
        hostPath:
          path: /var
          type: Directory
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu
                    operator: In
                    values:
                      - on

验证

优先级调度测试

  1. 创建不同优先级的 PriorityClass

定义高、中、低三个优先级类别,用于区分任务重要性:

# 高优先级(值越大优先级越高)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 100
globalDefault: false
description: "This priority class should be used for volcano job only."
---
# 中优先级
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: med-priority
value: 50
globalDefault: false
description: "This priority class should be used for volcano job only."
---
# 低优先级
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 10
globalDefault: false
description: "This priority class should be used for volcano job only."
  1. 验证 PriorityClass 创建结果
kubectl get priorityclasses.scheduling.k8s.io

预期输出:

NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             100          false            8s
low-priority              10           false            8s
med-priority              50           false            8s
  1. 创建不同优先级测试Job

通过创建高、中、低优先级的 Volcano Job,验证优先级调度效果:

# 高优先级 Job(占用集群全部资源):
apiVersion: batch.volcano.sh/v1alpha1  
kind: Job  
metadata:  
  name: priority-high  
spec:  
  schedulerName: volcano  
  minAvailable: 3  
  priorityClassName: high-priority  
  tasks:  
    - replicas: 3
      name: "test"
      template:  
        spec:  
          containers:  
            - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20  
              command: ["/bin/sh", "-c", "sleep 1000"]  
              imagePullPolicy: IfNotPresent  
              name: running  
              resources:  
                requests:  
                  cpu: "2"  
---
# 中优先级 Job
apiVersion: batch.volcano.sh/v1alpha1  
kind: Job  
metadata:  
  name: priority-med  
spec:  
  schedulerName: volcano  
  minAvailable: 3  
  priorityClassName: med-priority  
  tasks:  
    - replicas: 3
      name: "test"
      template:  
        spec:  
          containers:  
            - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20  
              command: ["/bin/sh", "-c", "sleep 1000"]  
              imagePullPolicy: IfNotPresent  
              name: running  
              resources:  
                requests:  
                  cpu: "2"  
---
# 低优先级 Job
apiVersion: batch.volcano.sh/v1alpha1  
kind: Job  
metadata:  
  name: priority-low  
spec:  
  schedulerName: volcano  
  minAvailable: 3  
  priorityClassName: low-priority  
  tasks:  
    - replicas: 3
      name: "test"
      template:  
        spec:  
          containers:  
            - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20  
              command: ["/bin/sh", "-c", "sleep 1000"]  
              imagePullPolicy: IfNotPresent  
              name: running  
              resources:  
                requests:  
                  cpu: "2"  

预期输出:高优先级 Job 的 Pod 正常运行,中、低优先级 Job 处于 Pending 状态(资源被高优先级任务占用):

kubectl get pod
NAME                   READY   STATUS    RESTARTS   AGE
priority-high-test-0   1/1     Running   0          4m34s
priority-high-test-1   1/1     Running   0          4m34s
priority-high-test-2   1/1     Running   0          4m34s

kubectl get jobs.batch.volcano.sh 
NAME            STATUS    MINAVAILABLE   RUNNINGS   AGE
priority-high   Running   3              3          4m53s
priority-low    Pending   3                         3m20s
priority-med    Pending   3                         3m20s
  1. 删除高优先级 Job,释放资源:
kubectl delete jobs.batch.volcano.sh priority-high 
  1. 实时观察 Pod 状态变化:
 kubectl get po -w
NAME                   READY   STATUS    RESTARTS   AGE
priority-high-test-0   1/1     Running   0          6m1s
priority-high-test-1   1/1     Running   0          6m1s
priority-high-test-2   1/1     Running   0          6m1s
priority-high-test-0   1/1     Terminating   0          6m11s
priority-high-test-1   1/1     Terminating   0          6m11s
priority-high-test-2   1/1     Terminating   0          6m11s
priority-high-test-0   1/1     Terminating   0          6m41s
priority-high-test-1   1/1     Terminating   0          6m41s
priority-high-test-2   1/1     Terminating   0          6m41s
priority-high-test-1   0/1     Terminating   0          6m42s
priority-high-test-1   0/1     Terminating   0          6m42s
priority-high-test-1   0/1     Terminating   0          6m42s
priority-high-test-0   0/1     Terminating   0          6m42s
priority-high-test-0   0/1     Terminating   0          6m42s
priority-high-test-0   0/1     Terminating   0          6m42s
priority-high-test-2   0/1     Terminating   0          6m42s
priority-high-test-2   0/1     Terminating   0          6m42s
priority-high-test-2   0/1     Terminating   0          6m42s
priority-med-test-2    0/1     Pending       0          0s
priority-med-test-1    0/1     Pending       0          0s
priority-med-test-0    0/1     Pending       0          0s
priority-med-test-2    0/1     Pending       0          1s
priority-med-test-1    0/1     Pending       0          1s
priority-med-test-0    0/1     Pending       0          1s
priority-med-test-2    0/1     ContainerCreating   0          1s
priority-med-test-1    0/1     ContainerCreating   0          1s
priority-med-test-2    0/1     ContainerCreating   0          2s
priority-med-test-1    0/1     ContainerCreating   0          2s
priority-med-test-0    0/1     ContainerCreating   0          2s
priority-med-test-0    0/1     ContainerCreating   0          2s
priority-med-test-2    1/1     Running             0          3s
priority-med-test-0    1/1     Running             0          4s
priority-med-test-1    1/1     Running             0          4s

预期输出:高优先级 Job 的 Pod 终止后,中优先级 Job 的 Pod 优先调度运行,低优先级 Job 仍处于 Pending 状态,验证优先级调度功能正常。

kubectl get jobs.batch.volcano.sh 
NAME           STATUS    MINAVAILABLE   RUNNINGS   AGE
priority-low   Pending   3                         5m17s
priority-med   Running   3              3          5m17s

GPU切分测试

  1. 创建测试 Pod

部署一个申请 vGPU 资源的 Pod,指定调度器为 Volcano,YAML 如下:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
  annotations:
    volcano.sh/vgpu-mode: "hami-core" # (Optional, 'hami-core' or 'mig')
spec:
  schedulerName: volcano
  containers:
    - name: cuda-container
      image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/pytorch:2.1.2-cuda12.1-cudnn8-runtime-ubuntu22.04
      command: ["sleep"]
      args: ["100000"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1 # 请求 1 个 gpu 卡
          volcano.sh/vgpu-memory: 1000 # (可选)每个 vGPU 使用 1G 设备内存
          volcano.sh/vgpu-cores: 10 # (可选)每个 vGPU 使用 10% 核心  
  1. 验证 GPU 切分结果
  • 在主机节点执行 nvidia-smi,查看 GPU 资源占用:
# 主机上输出
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:06.0 Off |                    0 |
| N/A   45C    P8             14W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • 进入 Pod 内部执行 nvidia-smi,查看 vGPU 资源分配:
# Pod内输出
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:00:06.0 Off |                    0 |
| N/A   46C    P8             14W /   70W |       0MiB /   1000MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

可见 Pod 内仅能看到分配的 1000MiB 显存(vGPU 切分结果),验证了 GPU 算力切分功能正常。

  1. volcano 官方文档 🔗
  2. volcano-vgpu-device-plugin 官方仓库 🔗

标题:K8s 基于 Volcano 优先级调度的 GPU 算力切分实践指南

作者:lomtom

链接:https://lomtom.cn/dzdqls2tvt80f