K8s 基于 Volcano 优先级调度的 GPU 算力切分实践指南
- 11/10/2025
引言
在大模型训练、AI 推理等场景中,GPU 资源常面临 “算力过剩” 或 “分配不均” 的问题 —— 单张 GPU 算力强大但仅支撑单个任务时利用率偏低,而多个轻量任务又可能因资源竞争导致调度冲突。
Volcano 作为 Kubernetes 生态中高性能的批量调度系统,结合 Hami 社区的 volcano-vgpu-device-plugin,可实现 GPU 资源的精细化切分与优先级调度,既能提升 GPU 利用率,又能保障核心任务的资源优先级。
本文将详细介绍整套方案的部署、配置与验证流程。
核心组件简介
-
Volcano 是专为高性能计算、AI 训练等批量任务设计的 Kubernetes 调度器,支持 gang 调度、优先级调度、资源动态分配等核心能力。其灵活的插件化架构可扩展支持 GPU 等异构资源的调度,为 GPU 算力切分提供调度层支撑。
-
volcano-vgpu-device-plugin 🔗 插件由 Hami 社区开发,是 GPU 算力切分的核心组件。它支持将物理 GPU 按照显存、核心占比等维度拆分为多个虚拟 GPU(vGPU),并通过与 Volcano 调度器联动,实现 vGPU 资源的按需分配与优先级调度,适配不同任务的资源需求。
环境部署
安装 nvidia-container-toolkit
nvidia-container-toolkit 🔗 是 GPU 容器运行的基础依赖,需在所有节点执行安装命令(具体安装命令需参考 NVIDIA 官方文档,根据系统版本选择对应方式)。可参考:如何运行Hugging Face大模型StarCoder 内容
配置 Container Runtime
- 执行命令配置 containerd 运行时,指定 nvidia 为 runtime 类型:
sudo nvidia-ctk runtime configure --runtime=containerd
- 编辑 containerd 配置文件,设置默认运行时为 nvidia:
sudo vi /etc/containerd/config.toml
在配置文件中找到以下层级,补充或修改 default_runtime_name 字段:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
- 重启 containerd 服务,使配置生效:
sudo systemctl restart containerd
安装 Volcano(基于 Helm)
- 添加 Volcano 的 Helm 仓库并更新:
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
- 执行安装命令,指定版本、镜像仓库及核心组件镜像:
helm install volcano volcano-sh/volcano -n volcano-system --create-namespace \
--version 1.13.0 \
--set basic.image_registry=swr.cn-east-3.myhuaweicloud.com \
--set basic.controller_image_name=lomtom-common/vc-controller-manager \
--set basic.scheduler_image_name=lomtom-common/vc-scheduler \
--set basic.admission_image_name=lomtom-common/vc-webhook-manager \
--set basic.agent_image_name=lomtom-common/vc-agent
- 验证安装结果,确保所有组件正常运行:
kubectl get all -n volcano-system
NAME READY STATUS RESTARTS AGE
pod/volcano-admission-5cdb6d487-ntqlc 1/1 Running 0 61s
pod/volcano-controllers-6667dcbfd5-tp7qq 1/1 Running 0 61s
pod/volcano-scheduler-59c4b58bdd-dmkm2 1/1 Running 0 61s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/volcano-admission-service ClusterIP 10.234.15.169 <none> 443/TCP 61s
service/volcano-controllers-service ClusterIP 10.234.44.42 <none> 8081/TCP 61s
service/volcano-scheduler-service ClusterIP 10.234.57.51 <none> 8080/TCP 61s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/volcano-admission 1/1 1 1 61s
deployment.apps/volcano-controllers 1/1 1 0 61s
deployment.apps/volcano-scheduler 1/1 1 1 61s
NAME DESIRED CURRENT READY AGE
replicaset.apps/volcano-admission-5cdb6d487 1 1 1 61s
replicaset.apps/volcano-controllers-6667dcbfd5 1 1 1 61s
replicaset.apps/volcano-scheduler-59c4b58bdd 1 1 1 61s
预期输出将显示 volcano-admission、volcano-controllers、volcano-scheduler 等组件的 Pod 处于 Running 状态,且 Deployment、Service 资源正常就绪。
安装volcano-vgpu-device-plugin
配置 Container Runtime
为确保容器能正常调用 GPU 资源,需配置 containerd 运行时为 nvidia:
- 执行命令配置 nvidia 运行时:
sudo nvidia-ctk runtime configure --runtime=containerd
- 编辑 containerd 配置文件,设置默认运行时:
sudo vi /etc/containerd/config.toml
- 在配置文件中找到以下层级,补充或修改
default_runtime_name字段:
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
- 重启 containerd 服务使配置生效:
sudo systemctl restart containerd
配置 Volcano 启用 vGPU 功能
修改 Volcano 调度器配置,启用 deviceshare 插件的 vGPU 功能:
- 编辑 Volcano 调度器的 ConfigMap:
kubectl edit cm -n volcano-system volcano-scheduler-configmap
- 在
volcano-scheduler.conf中添加deviceshare.VGPUEnable: true配置,完整配置如下:
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: deviceshare
arguments:
deviceshare.VGPUEnable: true # 启用 vgpu
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
部署 volcano-vgpu-device-plugin
首先为 GPU 节点打标签,再创建插件所需的配置资源、权限资源,最后通过 DaemonSet 实现插件在所有 GPU 节点的部署,具体操作如下:
- 为含 GPU 的节点添加标签(用于插件调度匹配):
kubectl label node <node-name> gpu=on # 替换 <node-name> 为实际 GPU 节点名称
- 部署插件相关资源(包含设备配置 ConfigMap、节点配置 ConfigMap、RBAC 权限、DaemonSet),完整 YAML 如下:
包含:
- 设备配置 ConfigMap:定义不同 GPU 型号(如 A30、A100-40GB/80GB)的切分规格(显存、核心占比、可切分数量)、vGPU 资源参数等。
- 节点配置 ConfigMap:定义目标节点(如
aio-node67)的 vGPU 运行模式、显存缩放比例、切分数量等。 - RBAC 权限资源:通过 ServiceAccount、ClusterRole、ClusterRoleBinding 赋予插件操作节点、Pod、ConfigMap 等资源的权限。
- DaemonSet:确保插件在所有标记为
gpu=on的节点运行,包含核心插件容器(实现 vGPU 切分与资源分配)和监控容器(监控 vGPU 状态),并挂载必要的主机目录与配置文件。
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-vgpu-device-config
namespace: kube-system
labels:
app.kubernetes.io/component: volcano-vgpu-device-plugin
data:
device-config.yaml: |-
nvidia:
resourceCountName: volcano.sh/vgpu-number
resourceMemoryName: volcano.sh/vgpu-memory
resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
resourceCoreName: volcano.sh/vgpu-cores
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries:
- models: [ "A30" ]
allowedGeometries:
- group: group1
geometries:
- name: 1g.6gb
memory: 6144
count: 4
- group: group2
geometries:
- name: 2g.12gb
memory: 12288
count: 2
- group: group3
geometries:
- name: 4g.24gb
memory: 24576
count: 1
- models: [ "A100-SXM4-40GB", "A100-40GB-PCIe", "A100-PCIE-40GB", "A100-SXM4-40GB" ]
allowedGeometries:
- group: "group1"
geometries:
- name: 1g.5gb
memory: 5120
count: 7
- group: "group2"
geometries:
- name: 2g.10gb
memory: 10240
count: 3
- name: 1g.5gb
memory: 5120
count: 1
- group: "group3"
geometries:
- name: 3g.20gb
memory: 20480
count: 2
- group: "group4"
geometries:
- name: 7g.40gb
memory: 40960
count: 1
- models: [ "A100-SXM4-80GB", "A100-80GB-PCIe", "A100-PCIE-80GB"]
allowedGeometries:
- group: "group1"
geometries:
- name: 1g.10gb
memory: 10240
count: 7
- group: "group2"
geometries:
- name: 2g.20gb
memory: 20480
count: 3
- name: 1g.10gb
memory: 10240
count: 1
- group: "group3"
geometries:
- name: 3g.40gb
memory: 40960
count: 2
- group: "group4"
geometries:
- name: 7g.79gb
memory: 80896
count: 1
---
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-vgpu-node-config
namespace: kube-system
labels:
app.kubernetes.io/component: volcano-vgpu-node-plugin
data:
config.json: |
{
"nodeconfig": [
{
"name": "aio-node67",
"operatingmode": "hami-core",
"devicememoryscaling": 1.8,
"devicesplitcount": 10,
"migstrategy":"none",
"filterdevices": {
"uuid": [],
"index": []
}
}
]
}
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: volcano-device-plugin
namespace: kube-system
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: volcano-device-plugin
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: [""]
resources: ["nodes/status"]
verbs: ["patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "patch", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "create", "update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: volcano-device-plugin
subjects:
- kind: ServiceAccount
name: volcano-device-plugin
namespace: kube-system
roleRef:
kind: ClusterRole
name: volcano-device-plugin
apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: volcano-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
name: volcano-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: volcano-device-plugin
spec:
priorityClassName: "system-node-critical"
serviceAccount: volcano-device-plugin
containers:
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/volcano-vgpu-device-plugin:v1.11.0
args: ["--device-split-count=10"]
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "cp -f /k8s-vgpu/lib/nvidia/* /usr/local/vgpu/"]
name: volcano-device-plugin
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: HOOK_PATH
value: "/usr/local/vgpu"
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_MIG_MONITOR_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "utility"
securityContext:
allowPrivilegeEscalation: true
privileged: true
capabilities:
drop: ["ALL"]
add: ["SYS_ADMIN"]
volumeMounts:
- name: deviceconfig
mountPath: /config
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: lib
mountPath: /usr/local/vgpu
- name: hosttmp
mountPath: /tmp
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/volcano-vgpu-device-plugin:v1.11.0
name: monitor
command:
- /bin/bash
- -c
- volcano-vgpu-monitor
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_MIG_MONITOR_DEVICES
value: "all"
- name: HOOK_PATH
value: "/tmp/vgpu"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
allowPrivilegeEscalation: true
capabilities:
drop: ["ALL"]
add: ["SYS_ADMIN"]
volumeMounts:
- name: dockers
mountPath: /run/docker
- name: containerds
mountPath: /run/containerd
- name: sysinfo
mountPath: /sysinfo
- name: hostvar
mountPath: /hostvar
- name: hosttmp
mountPath: /tmp
volumes:
- name: deviceconfig
configMap:
name: volcano-vgpu-node-config
- hostPath:
path: /var/lib/kubelet/device-plugins
type: Directory
name: device-plugin
- hostPath:
path: /usr/local/vgpu
type: DirectoryOrCreate
name: lib
- name: hosttmp
hostPath:
path: /tmp
type: DirectoryOrCreate
- name: dockers
hostPath:
path: /run/docker
type: DirectoryOrCreate
- name: containerds
hostPath:
path: /run/containerd
type: DirectoryOrCreate
- name: usrbin
hostPath:
path: /usr/bin
type: Directory
- name: sysinfo
hostPath:
path: /sys
type: Directory
- name: hostvar
hostPath:
path: /var
type: Directory
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- on
验证
优先级调度测试
- 创建不同优先级的 PriorityClass
定义高、中、低三个优先级类别,用于区分任务重要性:
# 高优先级(值越大优先级越高)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100
globalDefault: false
description: "This priority class should be used for volcano job only."
---
# 中优先级
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: med-priority
value: 50
globalDefault: false
description: "This priority class should be used for volcano job only."
---
# 低优先级
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 10
globalDefault: false
description: "This priority class should be used for volcano job only."
- 验证 PriorityClass 创建结果
kubectl get priorityclasses.scheduling.k8s.io
预期输出:
NAME VALUE GLOBAL-DEFAULT AGE
high-priority 100 false 8s
low-priority 10 false 8s
med-priority 50 false 8s
- 创建不同优先级测试Job
通过创建高、中、低优先级的 Volcano Job,验证优先级调度效果:
# 高优先级 Job(占用集群全部资源):
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: priority-high
spec:
schedulerName: volcano
minAvailable: 3
priorityClassName: high-priority
tasks:
- replicas: 3
name: "test"
template:
spec:
containers:
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20
command: ["/bin/sh", "-c", "sleep 1000"]
imagePullPolicy: IfNotPresent
name: running
resources:
requests:
cpu: "2"
---
# 中优先级 Job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: priority-med
spec:
schedulerName: volcano
minAvailable: 3
priorityClassName: med-priority
tasks:
- replicas: 3
name: "test"
template:
spec:
containers:
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20
command: ["/bin/sh", "-c", "sleep 1000"]
imagePullPolicy: IfNotPresent
name: running
resources:
requests:
cpu: "2"
---
# 低优先级 Job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: priority-low
spec:
schedulerName: volcano
minAvailable: 3
priorityClassName: low-priority
tasks:
- replicas: 3
name: "test"
template:
spec:
containers:
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/alpine:3.20
command: ["/bin/sh", "-c", "sleep 1000"]
imagePullPolicy: IfNotPresent
name: running
resources:
requests:
cpu: "2"
预期输出:高优先级 Job 的 Pod 正常运行,中、低优先级 Job 处于 Pending 状态(资源被高优先级任务占用):
kubectl get pod
NAME READY STATUS RESTARTS AGE
priority-high-test-0 1/1 Running 0 4m34s
priority-high-test-1 1/1 Running 0 4m34s
priority-high-test-2 1/1 Running 0 4m34s
kubectl get jobs.batch.volcano.sh
NAME STATUS MINAVAILABLE RUNNINGS AGE
priority-high Running 3 3 4m53s
priority-low Pending 3 3m20s
priority-med Pending 3 3m20s
- 删除高优先级 Job,释放资源:
kubectl delete jobs.batch.volcano.sh priority-high
- 实时观察 Pod 状态变化:
kubectl get po -w
NAME READY STATUS RESTARTS AGE
priority-high-test-0 1/1 Running 0 6m1s
priority-high-test-1 1/1 Running 0 6m1s
priority-high-test-2 1/1 Running 0 6m1s
priority-high-test-0 1/1 Terminating 0 6m11s
priority-high-test-1 1/1 Terminating 0 6m11s
priority-high-test-2 1/1 Terminating 0 6m11s
priority-high-test-0 1/1 Terminating 0 6m41s
priority-high-test-1 1/1 Terminating 0 6m41s
priority-high-test-2 1/1 Terminating 0 6m41s
priority-high-test-1 0/1 Terminating 0 6m42s
priority-high-test-1 0/1 Terminating 0 6m42s
priority-high-test-1 0/1 Terminating 0 6m42s
priority-high-test-0 0/1 Terminating 0 6m42s
priority-high-test-0 0/1 Terminating 0 6m42s
priority-high-test-0 0/1 Terminating 0 6m42s
priority-high-test-2 0/1 Terminating 0 6m42s
priority-high-test-2 0/1 Terminating 0 6m42s
priority-high-test-2 0/1 Terminating 0 6m42s
priority-med-test-2 0/1 Pending 0 0s
priority-med-test-1 0/1 Pending 0 0s
priority-med-test-0 0/1 Pending 0 0s
priority-med-test-2 0/1 Pending 0 1s
priority-med-test-1 0/1 Pending 0 1s
priority-med-test-0 0/1 Pending 0 1s
priority-med-test-2 0/1 ContainerCreating 0 1s
priority-med-test-1 0/1 ContainerCreating 0 1s
priority-med-test-2 0/1 ContainerCreating 0 2s
priority-med-test-1 0/1 ContainerCreating 0 2s
priority-med-test-0 0/1 ContainerCreating 0 2s
priority-med-test-0 0/1 ContainerCreating 0 2s
priority-med-test-2 1/1 Running 0 3s
priority-med-test-0 1/1 Running 0 4s
priority-med-test-1 1/1 Running 0 4s
预期输出:高优先级 Job 的 Pod 终止后,中优先级 Job 的 Pod 优先调度运行,低优先级 Job 仍处于 Pending 状态,验证优先级调度功能正常。
kubectl get jobs.batch.volcano.sh
NAME STATUS MINAVAILABLE RUNNINGS AGE
priority-low Pending 3 5m17s
priority-med Running 3 3 5m17s
GPU切分测试
- 创建测试 Pod
部署一个申请 vGPU 资源的 Pod,指定调度器为 Volcano,YAML 如下:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
annotations:
volcano.sh/vgpu-mode: "hami-core" # (Optional, 'hami-core' or 'mig')
spec:
schedulerName: volcano
containers:
- name: cuda-container
image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/pytorch:2.1.2-cuda12.1-cudnn8-runtime-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
limits:
volcano.sh/vgpu-number: 1 # 请求 1 个 gpu 卡
volcano.sh/vgpu-memory: 1000 # (可选)每个 vGPU 使用 1G 设备内存
volcano.sh/vgpu-cores: 10 # (可选)每个 vGPU 使用 10% 核心
- 验证 GPU 切分结果
- 在主机节点执行
nvidia-smi,查看 GPU 资源占用:
# 主机上输出
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 45C P8 14W / 70W | 3MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
- 进入 Pod 内部执行
nvidia-smi,查看 vGPU 资源分配:
# Pod内输出
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:00:06.0 Off | 0 |
| N/A 46C P8 14W / 70W | 0MiB / 1000MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
可见 Pod 内仅能看到分配的 1000MiB 显存(vGPU 切分结果),验证了 GPU 算力切分功能正常。