K8s 基于 Volcano 优先级调度的 NPU 算力切分实践指南
- 2天前
引言
在 AI 模型训练、推理及高性能计算场景中,华为昇腾 NPU(Neural Processing Unit)作为专用异构计算资源,常面临 “算力闲置” 与 “调度冲突” 的双重问题 —— 单张 NPU 算力充足但仅支撑单个任务时利用率偏低,多个轻量任务同时请求时又易因资源竞争导致调度阻塞。
Volcano 作为 Kubernetes 生态下高性能批量调度系统,结合 HAMi 社区的 hami-ascend-device-plugin,可实现 NPU 资源的精细化切分与优先级调度,既能提升 NPU 资源利用率,又能保障核心任务的资源抢占权。
本文将基于 Volcano v1.13.0 版本,详细拆解 NPU 算力切分方案的部署、配置与验证全流程,为实际落地提供可直接复用的操作指南。
核心组件简介
-
Volcano :是专为批量计算任务设计的 Kubernetes 调度器,核心支持 gang 调度、优先级调度、异构资源动态分配等能力。本文使用基于 v1.13.0 分支构建的临时镜像(因 NPU 调度新特性暂未合入主分支),待官方 PR 合入后可替换为正式版本,其插件化架构可灵活扩展 NPU 等异构资源的调度支持。
-
hami-ascend-device-plugin:该插件由 HAMi 社区开发,是 NPU 算力切分的核心组件。它支持将物理 NPU 按显存、AI Core、AI CPU 等维度拆分为多个虚拟 NPU(vNPU),并通过与 Volcano 调度器联动,实现 vNPU 资源的按需分配与优先级调度,适配 Ascend 310P、910B 等主流 NPU 型号。
-
Ascend Docker Runtime:华为昇腾专用容器运行时,用于实现容器与 NPU 硬件的底层适配,保障容器内任务能正常调用 NPU 资源,是 NPU 容器化调度的基础依赖。
环境部署
安装并配置 Volcano
-
安装 Volcano 基础组件
参考基于 Volcano 优先级调度的 GPU 算力切分实践指南完成基础组件部署,确保
volcano-system命名空间下核心组件正常运行。 -
替换 Volcano 调度器镜像(启用 NPU 特性)
kubectl set image -n volcano-system deployment.v1.apps/volcano-scheduler volcano-scheduler=swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-scheduler:v1.13.0-11-07
由于目前 NPU 调度特性暂未合入 Volcano 主分支,需替换为临时镜像,待官方 PR 🔗 合入后,可自行替换为对应正式版本镜像。
也可自行拉取源码进行构建(可选):
# 拉取后自行切换分支
git clone https://github.com/lomtom/volcano.git
# build all
make images DOCKER_PLATFORMS="linux/arm64" BUILDX_OUTPUT_TYPE=registry IMAGE_PREFIX=hb.grgbanking.com/ouyanglongtong TAG=v1.13.0-11-07
# build one(自行更换镜像地址)
name=scheduler
docker buildx build -t "swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-$name:v1.13.0-11-07" . -f ./installer/dockerfile/$name/Dockerfile --output=type=registry --platform linux/arm64 --build-arg APK_MIRROR= --build-arg OPEN_EULER_IMAGE_TAG=22.03-lts-sp2
# set image(自行更换镜像地址)
kubectl set image -n volcano-system deployment.v1.apps/volcano-scheduler volcano-scheduler=swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-scheduler:v1.13.0-11-07
- 配置 Volcano 启用 NPU 调度功能
修改 Volcano 调度器 ConfigMap,启用 deviceshare 插件的 NPU 相关配置:
kubectl edit cm -n volcano-system volcano-scheduler-configmap
完整配置如下:
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: deviceshare
arguments:
deviceshare.ASCEND310PVNPUEnable: true # 启用 Ascend 310P vNPU 功能
deviceshare.AscendVNPUEnable: true # 启用 Ascend 系列 vNPU 功能
deviceshare.KnownGeometriesCMNamespace: kube-system # 设备配置 ConfigMap 命名空间
deviceshare.KnownGeometriesCMName: hami-scheduler-device # 设备配置 ConfigMap 名称
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
安装及配置Ascend Docker Runtime
- 下载并安装运行时
# 下载 Ascend Docker Runtime 安装包
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.2.RC1/Ascend-docker-runtime_7.2.RC1_linux-aarch64.run
# 添加执行权限
chmod +x Ascend-docker-runtime_7.2.RC1_linux-aarch64.run
# 校验安装包(仅作兼容性检查,无实际校验逻辑)
./Ascend-docker-runtime_7.2.RC1_linux-aarch64.run --check
# 执行安装
./Ascend-docker-runtime_7.2.RC1_linux-aarch64.run --install
安装成功后,运行时将部署在 /usr/local/Ascend/Ascend-Docker-Runtime/ 目录下。
-
配置 containerd 集成运行时(docker环境无需配置)
生成 containerd 默认配置文件(若已存在可跳过):
containerd config default > /etc/containerd/config.toml
编辑配置文件,指定 Ascend Docker Runtime 为容器运行时:
vim /etc/containerd/config.toml
找到以下配置层级并修改:
...
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runtime.v1.linux"
[plugins."io.containerd.runtime.v1.linux"]
shim = "containerd-shim"
runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"
...
重启 containerd 使配置生效:
systemctl daemon-reload && systemctl restart containerd
部署 hami-ascend-device-plugin
- 为目标 NPU 节点添加标签,便于 DaemonSet 精准调度:
kubectl label node <node-name> ascend=on # 替换 <node-name> 为实际 NPU 节点名称
- 创建 NPU 设备配置 ConfigMap
定义 Ascend 310P3、910B4 等 NPU 型号的切分规格(显存、AI Core、AI CPU 分配),具体需要配置可根据实际情况而定,YAML 如下:
apiVersion: v1
kind: ConfigMap
metadata:
name: hami-scheduler-device
namespace: kube-system
labels:
app.kubernetes.io/component: hami-scheduler
app.kubernetes.io/name: hami
app.kubernetes.io/instance: hami
data:
device-config.yaml: |-
vnpus:
# Ascend 910B4 NPU 配置
- chipName: 910B4
commonWord: Ascend910B4
resourceName: huawei.com/Ascend910B4 # NPU 资源名称(Pod 申请时使用)
resourceMemoryName: huawei.com/Ascend910B4-memory # NPU 显存资源名称
memoryAllocatable: 32768 # 可分配显存总量(MB)
memoryCapacity: 32768 # 显存总容量(MB)
aiCore: 20 # AI Core 总数
aiCPU: 7 # AI CPU 总数
templates: # vNPU 切分模板
- name: vir05_1c_8g
memory: 8192 # 单 vNPU 显存(MB)
aiCore: 5 # 单 vNPU AI Core 数量
aiCPU: 1 # 单 vNPU AI CPU 数量
- name: vir10_3c_16g
memory: 16384 # 单 vNPU 显存(MB)
aiCore: 10 # 单 vNPU AI Core 数量
aiCPU: 3 # 单 vNPU AI CPU 数量
# Ascend 310P3 NPU 配置
- chipName: 310P3
commonWord: Ascend310P
resourceName: huawei.com/Ascend310P # NPU 资源名称(Pod 申请时使用)
resourceMemoryName: huawei.com/Ascend310P-memory # NPU 显存资源名称
memoryAllocatable: 21527 # 可分配显存总量(MB)
memoryCapacity: 24576 # 显存总容量(MB)
aiCore: 8 # AI Core 总数
aiCPU: 7 # AI CPU 总数
templates: # vNPU 切分模板
- name: vir01
memory: 3072 # 单 vNPU 显存(MB)
aiCore: 1 # 单 vNPU AI Core 数量
aiCPU: 1 # 单 vNPU AI CPU 数量
- name: vir02
memory: 6144 # 单 vNPU 显存(MB)
aiCore: 2 # 单 vNPU AI Core 数量
aiCPU: 2 # 单 vNPU AI CPU 数量
- name: vir04
memory: 12288 # 单 vNPU 显存(MB)
aiCore: 4 # 单 vNPU AI Core 数量
aiCPU: 4 # 单 vNPU AI CPU 数量
- 创建插件所需的 RBAC 权限与 DaemonSet,完整 YAML 如下:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: hami-ascend
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "update", "watch", "patch"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: hami-ascend
subjects:
- kind: ServiceAccount
name: hami-ascend
namespace: kube-system
roleRef:
kind: ClusterRole
name: hami-ascend
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: hami-ascend
namespace: kube-system
labels:
app.kubernetes.io/component: "hami-ascend"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hami-ascend-device-plugin
namespace: kube-system
labels:
app.kubernetes.io/component: hami-ascend-device-plugin
spec:
selector:
matchLabels:
app.kubernetes.io/component: hami-ascend-device-plugin
hami.io/webhook: ignore
template:
metadata:
labels:
app.kubernetes.io/component: hami-ascend-device-plugin
hami.io/webhook: ignore
spec:
priorityClassName: "system-node-critical"
serviceAccountName: hami-ascend
containers:
- image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/ascend-device-plugin:v1.1.0
imagePullPolicy: IfNotPresent
name: device-plugin
resources:
requests:
memory: 500Mi
cpu: 500m
limits:
memory: 500Mi
cpu: 500m
args:
- --config_file
- /device-config.yaml
securityContext:
privileged: true
readOnlyRootFilesystem: false
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: pod-resource
mountPath: /var/lib/kubelet/pod-resources
- name: hiai-driver
mountPath: /usr/local/Ascend/driver
readOnly: true
- name: log-path
mountPath: /var/log/mindx-dl/devicePlugin
- name: tmp
mountPath: /tmp
- name: ascend-config
mountPath: /device-config.yaml
subPath: device-config.yaml
readOnly: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: pod-resource
hostPath:
path: /var/lib/kubelet/pod-resources
- name: hiai-driver
hostPath:
path: /usr/local/Ascend/driver
- name: log-path
hostPath:
path: /var/log/mindx-dl/devicePlugin
type: Directory
- name: tmp
hostPath:
path: /tmp
- name: ascend-config
configMap:
name: hami-scheduler-device
nodeSelector:
ascend: "on"
验证
Ascend 310P3 NPU 切分测试
- 创建申请 vNPU 资源的 Pod:
kubectl apply -f -<<EOF
apiVersion: v1
kind: Pod
metadata:
name: npu-pod-310p
spec:
schedulerName: volcano # 指定 Volcano 调度器
containers:
- name: npu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"] # 保持 Pod 长期运行
resources:
limits:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend310P: "1" # 申请 1 个 Ascend310P vNPU
huawei.com/Ascend310P-memory: "3072" # 申请 3GB 显存(对应 vir01 模板)
requests:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "3072"
EOF
- 验证 vNPU 分配结果:
# 宿主机执行,查看 NPU 整体状态
# npu-smi info
+-------------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+-------------------------------+-----------------+-----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) |
+===============================+=================+=====================================================+
| 32768 310Pvir01 | OK | NA 41 0 / 0 |
| 0 0 | 0000:81:00.0 | 0 225 / 2690 |
+===============================+=================+=====================================================+
+-------------------------------+-----------------+-----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===============================+=================+=====================================================+
| No running processes found in NPU 32768 |
+===============================+=================+=====================================================+
预期输出将显示 vNPU 已创建(模板名称为 vir01),状态正常,验证 Ascend 310P3 的 NPU 切分功能生效。
# 查看指定 NPU 的 vNPU 详情(替换 -i 后的编号为实际 NPU 索引)
# 宿主机上执行:npu-smi info -t info-vnpu -i 4 -c 0
+-------------------------------------------------------------------------------+
| NPU resource static info as follow: |
| Format:Free/Total NA: Currently, query is not supported. |
| AICORE Memory AICPU VPC VENC VDEC JPEGD JPEGE PNGD |
| GB |
|===============================================================================|
| 7/8 18/21 6/7 11/12 3/3 11/12 14/16 7/8 NA/NA |
+-------------------------------------------------------------------------------+
| Total number of vnpu: 1 |
+-------------------------------------------------------------------------------+
| Vnpu ID | Vgroup ID | Container ID | Status | Template Name |
+-------------------------------------------------------------------------------+
| 100 | 0 | ffffffffffff | 1 | vir01 |
+-------------------------------------------------------------------------------+
# 查看vnpu配置 cat /etc/vnpu.cfg
vnpu_config_recover:enable
[vnpu-config start]
0:100:npu-smi set -t create-vnpu -i 4 -c 0 -f vir01 -v 100 -g 0
[vnpu-config end]
Ascend 910B4 NPU 切分测试
- 创建申请 vNPU 资源的 Pod:
kubectl apply -f -<<EOF
apiVersion: v1
kind: Pod
metadata:
name: npu-pod-910b
spec:
schedulerName: volcano # 指定 Volcano 调度器
containers:
- name: npu-container
image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
command: ["sleep"]
args: ["100000"] # 保持 Pod 长期运行
resources:
limits:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend910B4: "1" # 申请 1 个 Ascend910B4 vNPU
huawei.com/Ascend910B4-memory: "8192" # 申请 8GB 显存(对应 vir05_1c_8g 模板)
requests:
cpu: "1"
memory: 1000Mi
huawei.com/Ascend910B4: "1"
huawei.com/Ascend910B4-memory: "8192"
EOF
- 验证 vNPU 分配结果:
# 宿主机执行,查看 NPU 整体状态
# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0 Version: 25.2.0 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 6 910B4vir05_1c_8g | OK | 83.2 39 0 / 0 |
| 0 | 0000:41:00.4 | 0 0 / 0 1111 / 8192 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
预期输出将显示 vNPU 已创建(模板名称为 vir05_1c_8g),状态正常,验证 Ascend 910B4 的 NPU 切分功能生效。
# 查看指定 NPU 的 vNPU 详情(替换 -i 后的编号为实际 NPU 索引)
# npu-smi info -t info-vnpu -i 7 -c 0
+-------------------------------------------------------------------------------+
| NPU resource static info as follow: |
| Format:Free/Total NA: Currently, query is not supported. |
| AICORE Memory AICPU VPC VENC VDEC JPEGD JPEGE PNGD |
| GB |
|===============================================================================|
| 15/20 21/32 6/7 7/9 0/0 2/2 18/24 3/4 NA/NA |
+-------------------------------------------------------------------------------+
| Total number of vnpu: 1 |
+-------------------------------------------------------------------------------+
| Vnpu ID | Vgroup ID | Container ID | Status | Template Name |
+-------------------------------------------------------------------------------+
| 212 | 0 | ffffffffffff | 1 | vir05_1c_8g |
+-------------------------------------------------------------------------------+