Skip to main content

使用volcano-vgpu时,不需要 安装HAMi,仅使用Volcano vgpu device-plugin即可。它可以为由volcano管理的NVIDIA设备提供设备共享机制。 该插件源码基于Nvidia Device Plugin开发,并使用HAMi-core实现对GPU卡的硬隔离支持。 Volcano vgpu仅在volcano > 1.9版本中可用。

准备工作

镜像准备

Volcano调度器已集成支持HAMI vGPU,我们需要以前准备以下镜像到本地集群中:

docker.io/projecthami/volcano-vgpu-device-plugin:v1.11.0

下载到本地harbor中,新地址为:

aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0

节点准备

vGPU节点标签

需要给需要安装vGPU组件的节点打上特定的标签,以方便指定集群中部分节点启用vGPU特性。 标签如下:

volcano.sh/vgpu.enabled: "true"

卸载vGPU节点的nvidia-device-plugin

背景说明

由于volcano-vgpu-device-pluginnvidia-device-plugin会产生资源管理冲突,因此需要确保启用了vGPU功能的节点上没有运行nvidia-device-plugin

GPU Operator在部署nvidia-device-plugin时,会根据节点上的nvidia.com/gpu.deploy.device-plugin=true标签来决定是否在该节点上部署device plugin。该标签是由GPU Operator在安装过程中自动添加到节点上的,默认值为true

如果该标签值为false时,GPU Operator会自动跳过该节点,不在其上部署nvidia-device-plugin,并且如果该节点上已经运行了nvidia-device-plugin,也会自动卸载。 但是手动修改节点上的该标签值不可行,因为GPU Operator会定期同步节点标签(或者组件重启后),手动修改的标签值会被覆盖回true

解决方案:使用NFD的NodeFeatureRule

通过NFD(Node Feature Discovery)组件的NodeFeatureRule自动为启用vGPU的节点设置nvidia.com/gpu.deploy.device-plugin=false标签,实现自动化的设备插件管理。

什么是NodeFeatureRule?

NodeFeatureRule(NFR)NFD组件提供的自定义资源(CRD),它允许用户定义规则来自动发现和标记节点特征。NFD作为Kubernetes集群中的守护进程,会持续监控节点的硬件特性、内核配置和其他系统属性,并根据NodeFeatureRule定义的规则自动为节点添加、更新或删除标签。

为什么能避免GPU Operator覆盖标签?

GPU OperatorNFD都会管理节点标签,但它们的优先级和作用域不同:

  1. NFD的优先级更高NFD作为专门的特征发现组件,其设置的标签会被Kubernetes认为是"系统级别"的标签,具有更高的权威性
  2. 持续监控和同步NFD会持续监控节点状态和标签变化,当GPU Operator尝试覆盖标签时,NFD会根据NodeFeatureRule规则立即将标签重新设置回正确的值
  3. 基于条件的标签管理:通过NodeFeatureRule定义的标签是基于节点条件(如volcano.sh/vgpu.enabled=true)动态生成的,只要条件满足,NFD就会确保标签始终保持正确的值
  4. 避免冲突的设计GPU Operator在检测到由NFD管理的标签时,通常会尊重这些标签的值,而不是强制覆盖,这是Kubernetes生态中组件协作的最佳实践

通过这种机制,即使GPU Operator重启或执行同步操作,NFD也会确保nvidia.com/gpu.deploy.device-plugin标签保持为false,从而实现稳定可靠的设备插件管理。

步骤1:创建NodeFeatureRule

创建一个NodeFeatureRule资源,当指定节点满足条件时,自动设置标签"nvidia.com/gpu.deploy.device-plugin": "false"

vgpu-node-feature-rule.yaml
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: gpu-device-plugin-control
spec:
rules:
# 规则1:对于特定主机名的节点,禁用 device-plugin
- name: "disable-device-plugin-on-specific-nodes"
labels:
"nvidia.com/gpu.deploy.device-plugin": "false"
matchFeatures:
- feature: system.name
matchExpressions:
nodename:
op: In
value:
- "dev-app-2-150-master-1"

步骤2:应用NodeFeatureRule

kubectl apply -f vgpu-node-feature-rule.yaml

步骤3:验证标签设置

检查vGPU节点上的标签,确认nvidia.com/gpu.deploy.device-plugin标签已经被正确设置:

kubectl get node <vgpu-node-name> -o jsonpath='{.metadata.labels}' | grep nvidia.com/gpu.deploy.device-plugin

预期输出应包含:

"nvidia.com/gpu.deploy.device-plugin":"false"

步骤4:观察nvidia-device-plugin自动卸载

GPU Operator会监控节点标签的变化,当检测到nvidia.com/gpu.deploy.device-plugin=false时,会自动删除该节点上的nvidia-device-plugin Pod

执行部署

部署volcano-vgpu-device-plugin

部署文件

注意部署文件中Daemonset中的nodeSelectortolerationsvolcano-vgpu-device-plugin.yaml

配置说明

Volcano vGPU的默认配置如下:

device-config.yaml
nvidia:
resourceCountName: volcano.sh/vgpu-number
resourceMemoryName: volcano.sh/vgpu-memory
resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
resourceCoreName: volcano.sh/vgpu-cores
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries: []

关键配置项说明:

配置项说明示例
resourceCountNamevGPU个数的资源名称volcano.sh/vgpu-number
resourceMemoryNamevGPU显存大小的资源名称volcano.sh/vgpu-memory
resourceCoreNamevGPU算力的资源名称volcano.sh/vgpu-cores
resourceMemoryPercentageNamevGPU显存比例的资源名称,仅用在Pod的资源申请中volcano.sh/vgpu-memory-percentage
deviceSplitCountGPU分割数,每张GPU最多可同时运行的任务数10

执行结果

$ kubectl apply -f volcano-vgpu-device-plugin.yaml
configmap/volcano-vgpu-device-config created
configmap/volcano-vgpu-node-config created
serviceaccount/volcano-device-plugin created
clusterrole.rbac.authorization.k8s.io/volcano-device-plugin created
clusterrolebinding.rbac.authorization.k8s.io/volcano-device-plugin created
Warning: spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
daemonset.apps/volcano-device-plugin created

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
volcano-admission-7dc9b78fc6-686tb 1/1 Running 0 20d
volcano-admission-7dc9b78fc6-d9vzk 1/1 Running 0 20d
volcano-admission-7dc9b78fc6-h2ssl 1/1 Running 0 20d
volcano-controllers-855c676dd4-4gpxp 1/1 Running 1 (13d ago) 20d
volcano-controllers-855c676dd4-pspzg 1/1 Running 0 20d
volcano-controllers-855c676dd4-zl8cd 1/1 Running 0 20d
volcano-device-plugin-7g6v2 2/2 Running 0 22s
volcano-scheduler-6645c59d6d-56xdc 1/1 Running 0 6m58s
volcano-scheduler-6645c59d6d-p549s 1/1 Running 0 6m58s
volcano-scheduler-6645c59d6d-pqt68 1/1 Running 0 6m58s

查看节点vGPU资源,可以看到原有的nvidia device plugin注入的资源nvidia.com/gpu已经清空,新生成了vGPU相关的资源volcano.sh/vgpu-coresvolcano.sh/vgpu-memoryvolcano.sh/vgpu-number

# ...
Capacity:
cpu: 128
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263746296Ki
nvidia.com/gpu: 0
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
Allocatable:
cpu: 127600m
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 256048108548
nvidia.com/gpu: 0
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
# ...

自动生成的资源项说明:

资源项说明示例
volcano.sh/vgpu-coresvGPU算力的资源量百分比,是节点上总卡数*100800
volcano.sh/vgpu-memoryvGPU显存的总资源量,单位Mi,是节点上总卡数*单卡显存数。由于4090显卡的单卡显存为24564Mi,那么这里的总显存量为196512Mi196512
volcano.sh/vgpu-numbervGPU个数,是节点上总卡数*deviceSplitCount配置80

启用volcano调度器支持vGPU

修改volcano-scheduler-configmap,增加以下插件支持:

- name: deviceshare
arguments:
# 是否启用vgpu特性
deviceshare.VGPUEnable: true
# volcano-vgpu-device-config这个ConfigMap对应的命名空间
# 便于调度器自动读取ConfigMap内容
deviceshare.KnownGeometriesCMNamespace: volcano-system

修改后内容如下(仅供示例参考,具体根据自身需要调整volcano actionplugin配置):

actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: drf
enablePreemptable: false
- name: deviceshare
arguments:
# 是否启用vgpu特性
deviceshare.VGPUEnable: true
# volcano-vgpu-device-config这个ConfigMap对应的命名空间
# 便于调度器自动读取ConfigMap内容
deviceshare.KnownGeometriesCMNamespace: volcano-system
- name: predicates
- name: capacity-card
arguments:
cardUnlimitedCpuMemory: true
- name: nodeorder
- name: binpack

修改后重启volcano-scheduler

运行测试

vGPU基本使用

该测试Pod使用的镜像为nvidia/cuda:12.2.0-base,下载到本地集群harbor仓库的镜像地址aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04

test-vgpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-vgpu
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
volcano.sh/vgpu-number: 2 # (必须)请求 2 张 GPU 卡
volcano.sh/vgpu-memory: 3000 # (可选)每个 vGPU 使用 3G 显存,超过单卡显存则用最大单卡显存
volcano.sh/vgpu-cores: 50 # (可选)每个 vGPU 使用 50% 核心
limits:
volcano.sh/vgpu-number: 2
volcano.sh/vgpu-memory: 3000
volcano.sh/vgpu-cores: 50

运行后,查看Pod信息:

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-vgpu 1/1 Running 0 23s

进入Pod容器执行nvidia-smi命令查看vGPU资源信息,执行以下指令:

kubectl exec -it test-vgpu bash

查看vGPU资源信息如下:

root@test-vgpu:/# nvidia-smi
[HAMI-core Msg(18:140441960732480:libvgpu.c:839)]: Initializing.....
Mon Nov 24 12:13:31 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:BA:00.0 Off | Off |
| 30% 35C P8 13W / 450W | 0MiB / 3000MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:BB:00.0 Off | Off |
| 30% 33C P8 24W / 450W | 0MiB / 3000MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(18:140441960732480:multiprocess_memory_limit.c:455)]: Calling exit handler 18
root@test-vgpu:/#

在标准输出中以HAMI-core开头的信息属于HAMI-core通过CUDA API劫持的调试信息,表示HAMI-core实际以及起作用,例如[HAMI-core Msg(18:140441960732480:multiprocess_memory_limit.c:455)]: Calling exit handler 18表示是由HAMi-core组件执行完成,它会在nvidia-smi命令末尾执行一些资源清理工作。

使用nvidia-device-plugin的资源

原本使用nvidia device plugin的节点资源不会受影响,部署的Pod YAML如下:

test-nvidia-device-plugin.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-nvidia-device-plugin
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
nvidia.com/gpu: 2 # 使用nvidia device plugin注册的资源
limits:
nvidia.com/gpu: 2

vGPU资源与NVIDIA资源名称兼容

按照节点维度启用vGPU后,该vGPU节点只能使用vGPU的资源名称进行Pod资源申请,无法再使用原有的资源名称调度到vGPU节点上。 Volcano vGPU也支持将整卡和vGPU进行兼容性的资源名称配置,例如将vGPU的资源名称和nvidia的资源名称保持一致(nvidia.com/gpu)。我们来做一下兼容性测试。

配置文件变化

调整vGPU全局资源名称的配置如下(resourceCountName配置从volcano.sh/vgpu-number改为nvidia.com/gpu):

nvidia:
resourceCountName: nvidia.com/gpu
resourceMemoryName: volcano.sh/vgpu-memory
resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
resourceCoreName: volcano.sh/vgpu-cores
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries: []

随后重启volcano-vgpu-device-plugin,发现volcano-vgpu-device-plugin组件的资源名称并未在节点上发现没有生效,经过查看volcano-vgpu-device-pluginvolcanodeviceshare插件的源码,发现:

  • volcano-vgpu-device-configConfigMap配置文件只是给volcanodeviceshare插件使用的。
  • volcano-vgpu-device-plugin组件的源码中忽略了ConfigMap的该配置,而是通过命令行参数指定资源名称,其支持的命令行参数如下:
    命令行参数说明默认值
    resource-namevGPU个数的资源名称,生成到节点上volcano.sh/vgpu-number
    resource-memory-namevGPU显存大小的资源名称,生成到节点上volcano.sh/vgpu-memory
    resource-core-namevGPU算力的资源名称,生成到节点上volcano.sh/vgpu-cores
    debug是否开启调试模式false
  • 这两个组件的相关配置项需要保持一致,否则无法部署Pod

将命令行参数:

containers:
- image: aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0
args: ["--device-split-count=10"]

调整为:

containers:
- image: aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0
args: [
"--device-split-count=10",
"--resource-name=nvidia.com/gpu"
]

部署文件示例

这是完整的volcano-vgpu-device-plugin组件部署文件,仅供参考:volcano-vgpu-device-config.compatible.yaml

执行后,volcano-vgpu-device-plugin组件会重启,同时手动重启volcano scheduler,随后查看vGPU节点资源情况如下,可以看到,vGPU的卡资源名称和NVIDIA保持一致,使用的是nvidia.com/gpu

# ...
Capacity:
cpu: 128
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263746296Ki
nvidia.com/gpu: 80
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
Allocatable:
cpu: 127600m
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 256048108548
nvidia.com/gpu: 80
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 0
# ...

测试文件示例

运行以下示例将Pod调度到vGPU节点上:

test-vgpu-compatible.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-vgpu-compatible
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 新增节点选择,运行到vGPU节点上
nodeSelector:
name: dev-app-2-150-master-1
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
nvidia.com/gpu: 2 # 请求 2 张 GPU 卡
limits:
nvidia.com/gpu: 2

执行后,可以看到Pod已经被成功调度和运行。进入Pod容器查看资源情况,可以看到申请的算力和显存是按照整卡来分配的,这也是HAMi vGPU默认的行为,以便于和原有的NVIDIA device plugin兼容:

$ kubectl exec -it test-vgpu-compatible bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@test-vgpu-compatible:/# nvidia-smi
[HAMI-core Msg(15:139748339885888:libvgpu.c:839)]: Initializing.....
Tue Nov 25 09:26:16 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:BA:00.0 Off | Off |
| 30% 34C P8 13W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:BB:00.0 Off | Off |
| 30% 32C P8 25W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(15:139748339885888:multiprocess_memory_limit.c:455)]: Calling exit handler 15
root@test-vgpu-compatible:/#

监控指标

Volcano vgpu的指标通过volcano scheduler暴露,可以通过进入集群中任一支持curl命令的Pod,随后curl一下volcano scheduler的接口地址,例如:

# 10.233.75.65为主volcano scheduler的ClusterIP
curl 10.233.75.65:8080/metrics

返回的指标比较重,其中与vGPU相关的指标:volcano-vgpu-metrics.txt

常见问题

vGPU Pod部署时报错UnexpectedAdmissionError

在调整完volcano-vgpu-device-config这个ConfigMap中的resourceCountName配置项为自定义的资源名称后,Pod部署时状态为UnexpectedAdmissionError

$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-vgpu-compatible 0/1 UnexpectedAdmissionError 0 75s

通过kubectl describe pod查看PodEvents如下:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s volcano Successfully assigned volcano-system/test-vgpu-compatible to dev-app-2-150-master-1
Warning UnexpectedAdmissionError 25s kubelet Allocate failed due to rpc error: code = Unknown desc = device request not found, which is unexpected

通过翻查volcanovolcano-vgpu-device-plugin源码,经过排查是配置文件不一致引起的。在修改资源名称时,我们需要保证3个地方的配置正确性和一致性,拿resourceCountName配置项修改为nvidia.com/gpu举例,需要调整以下地方:

  • volcano-vgpu-device-configresourceCountName: nvidia.com/gpu
  • volcano-vgpu-device-plugin命令行参数--resource-name=nvidia.com/gpu
  • volcano-scheduler-configmapdeviceshare插件需要指定正确的命名空间,如下:
    - name: deviceshare
    arguments:
    # 是否启用vgpu特性
    deviceshare.VGPUEnable: true
    # volcano-vgpu-device-config这个ConfigMap对应的命名空间
    # 便于调度器自动读取ConfigMap内容
    deviceshare.KnownGeometriesCMNamespace: volcano-system
    可以通过查看调度的日志来排查调度器使用的配置文件是否正确,命令如下:
    $ kubectl logs volcano-scheduler-6645c59d6d-bcw68 | grep "device config"
    I1125 09:11:57.408175 1 config.go:113] "Initializing volcano device config" device-configs={"NvidiaConfig":{"ResourceCountName":"nvidia.com/gpu","ResourceMemoryName":"volcano.sh/vgpu-memory","ResourceCoreName":"volcano.sh/vgpu-cores","ResourceMemoryPercentageName":"volcano.sh/vgpu-memory-percentage","ResourcePriority":"","OverwriteEnv":false,"DefaultMemory":0,"DefaultCores":0,"DefaultGPUNum":1,"DeviceSplitCount":10,"DeviceMemoryScaling":1,"DeviceCoreScaling":1,"DisableCoreLimit":false,"MigGeometriesList":[],"GPUMemoryFactor":1}}