K8s 基于 Volcano 优先级调度的 NPU 算力切分实践指南

机器人
摘要
lomtom

引言

在 AI 模型训练、推理及高性能计算场景中,华为昇腾 NPU(Neural Processing Unit)作为专用异构计算资源,常面临 “算力闲置” 与 “调度冲突” 的双重问题 —— 单张 NPU 算力充足但仅支撑单个任务时利用率偏低,多个轻量任务同时请求时又易因资源竞争导致调度阻塞。

Volcano 作为 Kubernetes 生态下高性能批量调度系统,结合 HAMi 社区的 hami-ascend-device-plugin,可实现 NPU 资源的精细化切分与优先级调度,既能提升 NPU 资源利用率,又能保障核心任务的资源抢占权。

本文将基于 Volcano v1.13.0 版本,详细拆解 NPU 算力切分方案的部署、配置与验证全流程,为实际落地提供可直接复用的操作指南。

核心组件简介

  • Volcano :是专为批量计算任务设计的 Kubernetes 调度器,核心支持 gang 调度、优先级调度、异构资源动态分配等能力。本文使用基于 v1.13.0 分支构建的临时镜像(因 NPU 调度新特性暂未合入主分支),待官方 PR 合入后可替换为正式版本,其插件化架构可灵活扩展 NPU 等异构资源的调度支持。

  • hami-ascend-device-plugin:该插件由 HAMi 社区开发,是 NPU 算力切分的核心组件。它支持将物理 NPU 按显存、AI Core、AI CPU 等维度拆分为多个虚拟 NPU(vNPU),并通过与 Volcano 调度器联动,实现 vNPU 资源的按需分配与优先级调度,适配 Ascend 310P、910B 等主流 NPU 型号。

  • Ascend Docker Runtime:华为昇腾专用容器运行时,用于实现容器与 NPU 硬件的底层适配,保障容器内任务能正常调用 NPU 资源,是 NPU 容器化调度的基础依赖。

环境部署

安装并配置 Volcano

  1. 安装 Volcano 基础组件

    参考基于 Volcano 优先级调度的 GPU 算力切分实践指南完成基础组件部署,确保 volcano-system 命名空间下核心组件正常运行。

  2. 替换 Volcano 调度器镜像(启用 NPU 特性)

kubectl set image -n volcano-system deployment.v1.apps/volcano-scheduler volcano-scheduler=swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-scheduler:v1.13.0-11-07

由于目前 NPU 调度特性暂未合入 Volcano 主分支,需替换为临时镜像,待官方 PR 🔗 合入后,可自行替换为对应正式版本镜像。

也可自行拉取源码进行构建(可选):

# 拉取后自行切换分支
git clone https://github.com/lomtom/volcano.git

# build all
make images DOCKER_PLATFORMS="linux/arm64" BUILDX_OUTPUT_TYPE=registry IMAGE_PREFIX=hb.grgbanking.com/ouyanglongtong  TAG=v1.13.0-11-07

# build one(自行更换镜像地址)
name=scheduler
docker buildx build -t "swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-$name:v1.13.0-11-07" . -f ./installer/dockerfile/$name/Dockerfile --output=type=registry --platform linux/arm64 --build-arg APK_MIRROR= --build-arg OPEN_EULER_IMAGE_TAG=22.03-lts-sp2

# set image(自行更换镜像地址)
kubectl set image -n volcano-system deployment.v1.apps/volcano-scheduler volcano-scheduler=swr.cn-east-3.myhuaweicloud.com/lomtom-common/vc-scheduler:v1.13.0-11-07
  1. 配置 Volcano 启用 NPU 调度功能

修改 Volcano 调度器 ConfigMap,启用 deviceshare 插件的 NPU 相关配置:

kubectl edit cm -n volcano-system volcano-scheduler-configmap

完整配置如下:

kind: ConfigMap
apiVersion: v1
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: deviceshare
        arguments:
          deviceshare.ASCEND310PVNPUEnable: true  # 启用 Ascend 310P vNPU 功能
          deviceshare.AscendVNPUEnable: true       # 启用 Ascend 系列 vNPU 功能
          deviceshare.KnownGeometriesCMNamespace: kube-system  # 设备配置 ConfigMap 命名空间
          deviceshare.KnownGeometriesCMName: hami-scheduler-device  # 设备配置 ConfigMap 名称
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

安装及配置Ascend Docker Runtime

  1. 下载并安装运行时
# 下载 Ascend Docker Runtime 安装包
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.2.RC1/Ascend-docker-runtime_7.2.RC1_linux-aarch64.run
# 添加执行权限
chmod +x Ascend-docker-runtime_7.2.RC1_linux-aarch64.run
# 校验安装包(仅作兼容性检查,无实际校验逻辑)
./Ascend-docker-runtime_7.2.RC1_linux-aarch64.run --check
# 执行安装
./Ascend-docker-runtime_7.2.RC1_linux-aarch64.run --install

安装成功后,运行时将部署在 /usr/local/Ascend/Ascend-Docker-Runtime/ 目录下。

  1. 配置 containerd 集成运行时(docker环境无需配置)

    生成 containerd 默认配置文件(若已存在可跳过):

containerd config default > /etc/containerd/config.toml

编辑配置文件,指定 Ascend Docker Runtime 为容器运行时:

vim /etc/containerd/config.toml

找到以下配置层级并修改:

...
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runtime.v1.linux"

    [plugins."io.containerd.runtime.v1.linux"] 
      shim = "containerd-shim" 
      runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime"   
...

重启 containerd 使配置生效:

systemctl daemon-reload && systemctl restart containerd

部署 hami-ascend-device-plugin

  1. 为目标 NPU 节点添加标签,便于 DaemonSet 精准调度:
kubectl label node <node-name> ascend=on  # 替换 <node-name> 为实际 NPU 节点名称
  1. 创建 NPU 设备配置 ConfigMap

定义 Ascend 310P3、910B4 等 NPU 型号的切分规格(显存、AI Core、AI CPU 分配),具体需要配置可根据实际情况而定,YAML 如下:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hami-scheduler-device
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/name: hami
    app.kubernetes.io/instance: hami
data:
  device-config.yaml: |-
    vnpus:
    # Ascend 910B4 NPU 配置
    - chipName: 910B4
      commonWord: Ascend910B4
      resourceName: huawei.com/Ascend910B4  # NPU 资源名称(Pod 申请时使用)
      resourceMemoryName: huawei.com/Ascend910B4-memory  # NPU 显存资源名称
      memoryAllocatable: 32768  # 可分配显存总量(MB)
      memoryCapacity: 32768     # 显存总容量(MB)
      aiCore: 20                # AI Core 总数
      aiCPU: 7                  # AI CPU 总数
      templates:  # vNPU 切分模板
      - name: vir05_1c_8g
        memory: 8192  # 单 vNPU 显存(MB)
        aiCore: 5     # 单 vNPU AI Core 数量
        aiCPU: 1      # 单 vNPU AI CPU 数量
      - name: vir10_3c_16g
        memory: 16384  # 单 vNPU 显存(MB)
        aiCore: 10     # 单 vNPU AI Core 数量
        aiCPU: 3       # 单 vNPU AI CPU 数量
    # Ascend 310P3 NPU 配置
    - chipName: 310P3
      commonWord: Ascend310P
      resourceName: huawei.com/Ascend310P  # NPU 资源名称(Pod 申请时使用)
      resourceMemoryName: huawei.com/Ascend310P-memory  # NPU 显存资源名称
      memoryAllocatable: 21527  # 可分配显存总量(MB)
      memoryCapacity: 24576     # 显存总容量(MB)
      aiCore: 8                 # AI Core 总数
      aiCPU: 7                  # AI CPU 总数
      templates:  # vNPU 切分模板
      - name: vir01
        memory: 3072  # 单 vNPU 显存(MB)
        aiCore: 1     # 单 vNPU AI Core 数量
        aiCPU: 1      # 单 vNPU AI CPU 数量
      - name: vir02
        memory: 6144  # 单 vNPU 显存(MB)
        aiCore: 2     # 单 vNPU AI Core 数量
        aiCPU: 2      # 单 vNPU AI CPU 数量
      - name: vir04
        memory: 12288  # 单 vNPU 显存(MB)
        aiCore: 4      # 单 vNPU AI Core 数量
        aiCPU: 4       # 单 vNPU AI CPU 数量
  1. 创建插件所需的 RBAC 权限与 DaemonSet,完整 YAML 如下:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hami-ascend
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "update", "watch", "patch"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hami-ascend
subjects:
  - kind: ServiceAccount
    name: hami-ascend
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: hami-ascend
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: hami-ascend
  namespace: kube-system
  labels:
    app.kubernetes.io/component: "hami-ascend"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-ascend-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-ascend-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-ascend-device-plugin
      hami.io/webhook: ignore
  template:
    metadata:
      labels:
        app.kubernetes.io/component: hami-ascend-device-plugin
        hami.io/webhook: ignore
    spec:
      priorityClassName: "system-node-critical"
      serviceAccountName: hami-ascend
      containers:
        - image: swr.cn-east-3.myhuaweicloud.com/lomtom-common/ascend-device-plugin:v1.1.0
          imagePullPolicy: IfNotPresent
          name: device-plugin
          resources:
            requests:
              memory: 500Mi
              cpu: 500m
            limits:
              memory: 500Mi
              cpu: 500m
          args:
            - --config_file
            - /device-config.yaml
          securityContext:
            privileged: true
            readOnlyRootFilesystem: false
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: pod-resource
              mountPath: /var/lib/kubelet/pod-resources
            - name: hiai-driver
              mountPath: /usr/local/Ascend/driver
              readOnly: true
            - name: log-path
              mountPath: /var/log/mindx-dl/devicePlugin
            - name: tmp
              mountPath: /tmp
            - name: ascend-config
              mountPath: /device-config.yaml
              subPath: device-config.yaml
              readOnly: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
        - name: pod-resource
          hostPath:
            path: /var/lib/kubelet/pod-resources
        - name: hiai-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: log-path
          hostPath:
            path: /var/log/mindx-dl/devicePlugin
            type: Directory
        - name: tmp
          hostPath:
            path: /tmp
        - name: ascend-config
          configMap:
            name: hami-scheduler-device
      nodeSelector:
        ascend: "on"

验证

Ascend 310P3 NPU 切分测试

  1. 创建申请 vNPU 资源的 Pod:
kubectl apply -f -<<EOF
apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-310p
spec:
  schedulerName: volcano  # 指定 Volcano 调度器
  containers:
  - name: npu-container
    image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
    command: ["sleep"]
    args: ["100000"]  # 保持 Pod 长期运行
    resources:
      limits:
        cpu: "1"
        memory: 1000Mi
        huawei.com/Ascend310P: "1"  # 申请 1 个 Ascend310P vNPU
        huawei.com/Ascend310P-memory: "3072"  # 申请 3GB 显存(对应 vir01 模板)
      requests:
        cpu: "1"
        memory: 1000Mi
        huawei.com/Ascend310P: "1"
        huawei.com/Ascend310P-memory: "3072"
EOF
  1. 验证 vNPU 分配结果:
# 宿主机执行,查看 NPU 整体状态
# npu-smi info
+-------------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                                 Version: 24.1.rc2                                    |
+-------------------------------+-----------------+-----------------------------------------------------+
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page)|
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                       |
+===============================+=================+=====================================================+
| 32768   310Pvir01             | OK              | NA           41                0     / 0            |
| 0       0                     | 0000:81:00.0    | 0            225  / 2690                            |
+===============================+=================+=====================================================+
+-------------------------------+-----------------+-----------------------------------------------------+
| NPU     Chip                  | Process id      | Process name             | Process memory(MB)       |
+===============================+=================+=====================================================+
| No running processes found in NPU 32768                                                               |
+===============================+=================+=====================================================+

预期输出将显示 vNPU 已创建(模板名称为 vir01),状态正常,验证 Ascend 310P3 的 NPU 切分功能生效。

# 查看指定 NPU 的 vNPU 详情(替换 -i 后的编号为实际 NPU 索引)
# 宿主机上执行:npu-smi info -t info-vnpu -i 4 -c 0 
+-------------------------------------------------------------------------------+
| NPU resource static info as follow:                                           |
| Format:Free/Total                   NA: Currently, query is not supported.    |
| AICORE    Memory    AICPU    VPC    VENC    VDEC    JPEGD    JPEGE    PNGD    |
|            GB                                                                 |
|===============================================================================|
| 7/8       18/21     6/7      11/12  3/3     11/12   14/16    7/8      NA/NA   |
+-------------------------------------------------------------------------------+
| Total number of vnpu: 1                                                       |
+-------------------------------------------------------------------------------+
|  Vnpu ID  |  Vgroup ID     |  Container ID  |  Status  |  Template Name       |
+-------------------------------------------------------------------------------+
|  100      |  0             |  ffffffffffff  |  1       |  vir01               |
+-------------------------------------------------------------------------------+

# 查看vnpu配置 cat /etc/vnpu.cfg 
vnpu_config_recover:enable
[vnpu-config start]
0:100:npu-smi set -t create-vnpu -i 4 -c 0 -f vir01 -v 100 -g 0
[vnpu-config end]

Ascend 910B4 NPU 切分测试

  1. 创建申请 vNPU 资源的 Pod:
kubectl apply -f -<<EOF
apiVersion: v1
kind: Pod
metadata:
  name: npu-pod-910b
spec:
  schedulerName: volcano  # 指定 Volcano 调度器
  containers:
  - name: npu-container
    image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
    command: ["sleep"]
    args: ["100000"]  # 保持 Pod 长期运行
    resources:
      limits:
        cpu: "1"
        memory: 1000Mi
        huawei.com/Ascend910B4: "1"  # 申请 1 个 Ascend910B4 vNPU
        huawei.com/Ascend910B4-memory: "8192"  # 申请 8GB 显存(对应 vir05_1c_8g 模板)
      requests:
        cpu: "1"
        memory: 1000Mi
        huawei.com/Ascend910B4: "1"
        huawei.com/Ascend910B4-memory: "8192"
EOF
  1. 验证 vNPU 分配结果:
# 宿主机执行,查看 NPU 整体状态
# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.2.0                   Version: 25.2.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 6     910B4vir05_1c_8g    | OK            | 83.2        39                0    / 0             |
| 0                         | 0000:41:00.4  | 0           0    / 0          1111 / 8192          |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+

预期输出将显示 vNPU 已创建(模板名称为 vir05_1c_8g),状态正常,验证 Ascend 910B4 的 NPU 切分功能生效。

# 查看指定 NPU 的 vNPU 详情(替换 -i 后的编号为实际 NPU 索引)
#  npu-smi info -t info-vnpu -i 7 -c 0
+-------------------------------------------------------------------------------+
| NPU resource static info as follow:                                           |
| Format:Free/Total                   NA: Currently, query is not supported.    |
| AICORE    Memory    AICPU    VPC    VENC    VDEC    JPEGD    JPEGE    PNGD    |
|            GB                                                                 |
|===============================================================================|
| 15/20     21/32     6/7      7/9    0/0     2/2     18/24    3/4      NA/NA   |
+-------------------------------------------------------------------------------+
| Total number of vnpu: 1                                                       |
+-------------------------------------------------------------------------------+
|  Vnpu ID  |  Vgroup ID     |  Container ID  |  Status  |  Template Name       |
+-------------------------------------------------------------------------------+
|  212      |  0             |  ffffffffffff  |  1       |  vir05_1c_8g         |
+-------------------------------------------------------------------------------+
  1. volcano 官方文档 🔗
  2. ascend-device-plugin 官方仓库 🔗

标题:K8s 基于 Volcano 优先级调度的 NPU 算力切分实践指南

作者:lomtom

链接:https://lomtom.cn/d01w08bawclg4