使用volcano-vgpu时,不需要 安装HAMi,仅使用Volcano vgpu device-plugin即可。它可以为由volcano管理的NVIDIA设备提供设备共享机制。
该插件源码基于Nvidia Device Plugin开发,并使用HAMi-core实现对GPU卡的硬隔离支持。
Volcano vgpu仅在volcano > 1.9版本中可用。
准备工作
镜像准备
Volcano调度器已集成支持HAMI vGPU,我们需要以前准备以下镜像到本地集群中:
docker.io/projecthami/volcano-vgpu-device-plugin:v1.11.0
下载到本地harbor中,新地址为:
aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0
节点准备
vGPU节点标签
需要给需要安装vGPU组件的节点打上特定的标签,以方便指定集群中部分节点启用vGPU特性。
标签如下:
volcano.sh/vgpu.enabled: "true"
卸载vGPU节点的nvidia-device-plugin
背景说明
由于volcano-vgpu-device-plugin和nvidia-device-plugin会产生资源管理冲突,因此需要确保启用了vGPU功能的节点上没有运行nvidia-device-plugin。
GPU Operator在部署nvidia-device-plugin时,会根据节点上的nvidia.com/gpu.deploy.device-plugin=true标签来决定是否在该节点上部署device plugin。该标签是由GPU Operator在安装过程中自动添加到节点上的,默认值为true。
如果该标签值为false时,GPU Operator会自动跳过该节点,不在其上部署nvidia-device-plugin,并且如果该节点上已经运行了nvidia-device-plugin,也会自动卸载。
但是手动修改节点上的该标签值不可行,因为GPU Operator会定期同步节点标签(或者组件重启后),手动修改的标签值会被覆盖回true。
解决方案:使用NFD的NodeFeatureRule
通过NFD(Node Feature Discovery)组件的NodeFeatureRule自动为启用vGPU的节点设置nvidia.com/gpu.deploy.device-plugin=false标签,实现自动化的设备插件管理。
什么是NodeFeatureRule?
NodeFeatureRule(NFR)是NFD组件提供的自定义资源(CRD),它允许用户定义规则来自动发现和标记节点特征。NFD作为Kubernetes集群中的守护进程,会持续监控节点的硬件特性、内核配置和其他系统属性,并根据NodeFeatureRule定义的规则自动为节点添加、更新或删除标签。
为什么能避免GPU Operator覆盖标签?
GPU Operator和NFD都会管理节点标签,但它们的优先级和作用域不同:
- NFD的优先级更高:
NFD作为专门的特征发现组件,其设置的标签会被Kubernetes认为是"系统级别"的标签,具有更高的权威性 - 持续监控和同步:
NFD会持续监控节点状态和标签变化,当GPU Operator尝试覆盖标签时,NFD会根据NodeFeatureRule规则立即将标签重新设置回正确的值 - 基于条件的标签管理:通过
NodeFeatureRule定义的标签是基于节点条件(如volcano.sh/vgpu.enabled=true)动态生成的,只要条件满足,NFD就会确保标签始终保持正确的值 - 避免冲突的设计:
GPU Operator在检测到由NFD管理的标签时,通常会尊重这些标签的值,而不是强制覆盖,这是Kubernetes生态中组件协作的最佳实践
通过这种机制,即使GPU Operator重启或执行同步操作,NFD也会确保nvidia.com/gpu.deploy.device-plugin标签保持为false,从而实现稳定可靠的设备插件管理。
步骤1:创建NodeFeatureRule
创建一个NodeFeatureRule资源,当指定节点满足条件时,自动设置标签"nvidia.com/gpu.deploy.device-plugin": "false":
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: gpu-device-plugin-control
spec:
rules:
# 规则1:对于特定主机名的节点,禁用 device-plugin
- name: "disable-device-plugin-on-specific-nodes"
labels:
"nvidia.com/gpu.deploy.device-plugin": "false"
matchFeatures:
- feature: system.name
matchExpressions:
nodename:
op: In
value:
- "dev-app-2-150-master-1"
步骤2:应用NodeFeatureRule
kubectl apply -f vgpu-node-feature-rule.yaml
步骤3:验证标签设置
检查vGPU节点上的标签,确认nvidia.com/gpu.deploy.device-plugin标签已经被正确设置:
kubectl get node <vgpu-node-name> -o jsonpath='{.metadata.labels}' | grep nvidia.com/gpu.deploy.device-plugin
预期输出应包含:
"nvidia.com/gpu.deploy.device-plugin":"false"
步骤4:观察nvidia-device-plugin自动卸载
GPU Operator会监控节点标签的变化,当检测到nvidia.com/gpu.deploy.device-plugin=false时,会自动删除该节点上的nvidia-device-plugin Pod。
执行部署
部署volcano-vgpu-device-plugin
部署文件
注意部署文件中Daemonset中的nodeSelector及tolerations:volcano-vgpu-device-plugin.yaml
配置说明
Volcano vGPU的默认配置如下:
nvidia:
resourceCountName: volcano.sh/vgpu-number
resourceMemoryName: volcano.sh/vgpu-memory
resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
resourceCoreName: volcano.sh/vgpu-cores
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries: []
关键配置项说明:
| 配置项 | 说明 | 示例 |
|---|---|---|
resourceCountName | vGPU个数的资源名称 | volcano.sh/vgpu-number |
resourceMemoryName | vGPU显存大小的资源名称 | volcano.sh/vgpu-memory |
resourceCoreName | vGPU算力的资源名称 | volcano.sh/vgpu-cores |
resourceMemoryPercentageName | vGPU显存比例的资源名称,仅用在Pod的资源申请中 | volcano.sh/vgpu-memory-percentage |
deviceSplitCount | GPU分割数,每张GPU最多可同时运行的任务数 | 10 |
执行结果
$ kubectl apply -f volcano-vgpu-device-plugin.yaml
configmap/volcano-vgpu-device-config created
configmap/volcano-vgpu-node-config created
serviceaccount/volcano-device-plugin created
clusterrole.rbac.authorization.k8s.io/volcano-device-plugin created
clusterrolebinding.rbac.authorization.k8s.io/volcano-device-plugin created
Warning: spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead
daemonset.apps/volcano-device-plugin created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
volcano-admission-7dc9b78fc6-686tb 1/1 Running 0 20d
volcano-admission-7dc9b78fc6-d9vzk 1/1 Running 0 20d
volcano-admission-7dc9b78fc6-h2ssl 1/1 Running 0 20d
volcano-controllers-855c676dd4-4gpxp 1/1 Running 1 (13d ago) 20d
volcano-controllers-855c676dd4-pspzg 1/1 Running 0 20d
volcano-controllers-855c676dd4-zl8cd 1/1 Running 0 20d
volcano-device-plugin-7g6v2 2/2 Running 0 22s
volcano-scheduler-6645c59d6d-56xdc 1/1 Running 0 6m58s
volcano-scheduler-6645c59d6d-p549s 1/1 Running 0 6m58s
volcano-scheduler-6645c59d6d-pqt68 1/1 Running 0 6m58s
查看节点vGPU资源,可以看到原有的nvidia device plugin注入的资源nvidia.com/gpu已经清空,新生成了vGPU相关的资源volcano.sh/vgpu-cores、volcano.sh/vgpu-memory及volcano.sh/vgpu-number。
# ...
Capacity:
cpu: 128
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263746296Ki
nvidia.com/gpu: 0
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
Allocatable:
cpu: 127600m
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 256048108548
nvidia.com/gpu: 0
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
# ...
自动生成的资源项说明:
| 资源项 | 说明 | 示例 |
|---|---|---|
volcano.sh/vgpu-cores | vGPU算力的资源量百分比,是节点上总卡数*100 | 800 |
volcano.sh/vgpu-memory | vGPU显存的总资源量,单位Mi,是节点上总卡数*单卡显存数。由于4090显卡的单卡显存为24564Mi,那么这里的总显存量为196512Mi | 196512 |
volcano.sh/vgpu-number | vGPU个数,是节点上总卡数*deviceSplitCount配置 | 80 |
启用volcano调度器支持vGPU
修改volcano-scheduler-configmap,增加以下插件支持:
- name: deviceshare
arguments:
# 是否启用vgpu特性
deviceshare.VGPUEnable: true
# volcano-vgpu-device-config这个ConfigMap对应的命名空间
# 便于调度器自动读取ConfigMap内容
deviceshare.KnownGeometriesCMNamespace: volcano-system
修改后内容如下(仅供示例参考,具体根据自身需要调整volcano action和plugin配置):
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: drf
enablePreemptable: false
- name: deviceshare
arguments:
# 是否启用vgpu特性
deviceshare.VGPUEnable: true
# volcano-vgpu-device-config这个ConfigMap对应的命名空间
# 便于调度器自动读取ConfigMap内容
deviceshare.KnownGeometriesCMNamespace: volcano-system
- name: predicates
- name: capacity-card
arguments:
cardUnlimitedCpuMemory: true
- name: nodeorder
- name: binpack
修改后重启volcano-scheduler。
运行测试
vGPU基本使用
该测试Pod使用的镜像为nvidia/cuda:12.2.0-base,下载到本地集群harbor仓库的镜像地址aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04:
apiVersion: v1
kind: Pod
metadata:
name: test-vgpu
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
volcano.sh/vgpu-number: 2 # (必须)请求 2 张 GPU 卡
volcano.sh/vgpu-memory: 3000 # (可选)每个 vGPU 使用 3G 显存,超过单卡显存则用最大单卡显存
volcano.sh/vgpu-cores: 50 # (可选)每个 vGPU 使用 50% 核心
limits:
volcano.sh/vgpu-number: 2
volcano.sh/vgpu-memory: 3000
volcano.sh/vgpu-cores: 50
运行后,查看Pod信息:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-vgpu 1/1 Running 0 23s
进入Pod容器执行nvidia-smi命令查看vGPU资源信息,执行以下指令:
kubectl exec -it test-vgpu bash
查看vGPU资源信息如下:
root@test-vgpu:/# nvidia-smi
[HAMI-core Msg(18:140441960732480:libvgpu.c:839)]: Initializing.....
Mon Nov 24 12:13:31 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:BA:00.0 Off | Off |
| 30% 35C P8 13W / 450W | 0MiB / 3000MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:BB:00.0 Off | Off |
| 30% 33C P8 24W / 450W | 0MiB / 3000MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(18:140441960732480:multiprocess_memory_limit.c:455)]: Calling exit handler 18
root@test-vgpu:/#
在标准输出中以HAMI-core开头的信息属于HAMI-core通过CUDA API劫持的调试信息,表示HAMI-core实际以及起作用,例如[HAMI-core Msg(18:140441960732480:multiprocess_memory_limit.c:455)]: Calling exit handler 18表示是由HAMi-core组件执行完成,它会在nvidia-smi命令末尾执行一些资源清理工作。
使用nvidia-device-plugin的资源
原本使用nvidia device plugin的节点资源不会受影响,部署的Pod YAML如下:
apiVersion: v1
kind: Pod
metadata:
name: test-nvidia-device-plugin
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
nvidia.com/gpu: 2 # 使用nvidia device plugin注册的资源
limits:
nvidia.com/gpu: 2
vGPU资源与NVIDIA资源名称兼容
按照节点维度启用vGPU后,该vGPU节点只能使用vGPU的资源名称进行Pod资源申请,无法再使用原有的资源名称调度到vGPU节点上。
Volcano vGPU也支持将整卡和vGPU进行兼容性的资源名称配置,例如将vGPU的资源名称和nvidia的资源名称保持一致(nvidia.com/gpu)。我们来做一下兼容性测试。
配置文件变化
调整vGPU全局资源名称的配置如下(resourceCountName配置从volcano.sh/vgpu-number改为nvidia.com/gpu):
nvidia:
resourceCountName: nvidia.com/gpu
resourceMemoryName: volcano.sh/vgpu-memory
resourceMemoryPercentageName: volcano.sh/vgpu-memory-percentage
resourceCoreName: volcano.sh/vgpu-cores
overwriteEnv: false
defaultMemory: 0
defaultCores: 0
defaultGPUNum: 1
deviceSplitCount: 10
deviceMemoryScaling: 1
deviceCoreScaling: 1
gpuMemoryFactor: 1
knownMigGeometries: []
随后重启volcano-vgpu-device-plugin,发现volcano-vgpu-device-plugin组件的资源名称并未在节点上发现没有生效,经过查看volcano-vgpu-device-plugin和volcano的deviceshare插件的源码,发现:
volcano-vgpu-device-config的ConfigMap配置文件只是给volcano的deviceshare插件使用的。volcano-vgpu-device-plugin组件的源码中忽略了ConfigMap的该配置,而是通过命令行参数指定资源名称,其支持的命令行参数如下:命令行参数 说明 默认值 resource-namevGPU个数的资源名称,生成到节点上volcano.sh/vgpu-numberresource-memory-namevGPU显存大小的资源名称,生成到节点上volcano.sh/vgpu-memoryresource-core-namevGPU算力的资源名称,生成到节点上volcano.sh/vgpu-coresdebug是否开启调试模式 false- 这两个组件的相关配置项需要保持一致,否则无法部署
Pod。
将命令行参数:
containers:
- image: aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0
args: ["--device-split-count=10"]
调整为:
containers:
- image: aiharbor.msxf.local/test/projecthami/volcano-vgpu-device-plugin:v1.11.0
args: [
"--device-split-count=10",
"--resource-name=nvidia.com/gpu"
]
部署文件示例
这是完整的volcano-vgpu-device-plugin组件部署文件,仅供参考:volcano-vgpu-device-config.compatible.yaml
执行后,volcano-vgpu-device-plugin组件会重启,同时手动重启volcano scheduler,随后查看vGPU节点资源情况如下,可以看到,vGPU的卡资源名称和NVIDIA保持一致,使用的是nvidia.com/gpu:
# ...
Capacity:
cpu: 128
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263746296Ki
nvidia.com/gpu: 80
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 80
Allocatable:
cpu: 127600m
ephemeral-storage: 562291Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 256048108548
nvidia.com/gpu: 80
nvidia.com/gpu.shared: 0
pods: 110
volcano.sh/vgpu-cores: 800
volcano.sh/vgpu-memory: 196512
volcano.sh/vgpu-number: 0
# ...
测试文件示例
运行以下示例将Pod调度到vGPU节点上:
apiVersion: v1
kind: Pod
metadata:
name: test-vgpu-compatible
spec:
# 需要使用volcano调度器
schedulerName: volcano
# 新增节点选择,运行到vGPU节点上
nodeSelector:
name: dev-app-2-150-master-1
# 容忍所有污点,仅做测试
tolerations:
- key: volcano.sh/vgpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: nvidia.com/gpu.product
operator: Exists
effect: NoSchedule
- key: special.accelerate.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/node.usage
operator: Exists
effect: NoSchedule
- key: maip.msxf.io/ib.present
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: aiharbor.msxf.local/test/nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["sleep"]
args: ["100000"]
resources:
requests:
nvidia.com/gpu: 2 # 请求 2 张 GPU 卡
limits:
nvidia.com/gpu: 2
执行后,可以看到Pod已经被成功调度和运行。进入Pod容器查看资源情况,可以看到申请的算力和显存是按照整卡来分配的,这也是HAMi vGPU默认的行为,以便于和原有的NVIDIA device plugin兼容:
$ kubectl exec -it test-vgpu-compatible bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
root@test-vgpu-compatible:/# nvidia-smi
[HAMI-core Msg(15:139748339885888:libvgpu.c:839)]: Initializing.....
Tue Nov 25 09:26:16 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:BA:00.0 Off | Off |
| 30% 34C P8 13W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 On | 00000000:BB:00.0 Off | Off |
| 30% 32C P8 25W / 450W | 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
[HAMI-core Msg(15:139748339885888:multiprocess_memory_limit.c:455)]: Calling exit handler 15
root@test-vgpu-compatible:/#
监控指标
Volcano vgpu的指标通过volcano scheduler暴露,可以通过进入集群中任一支持curl命令的Pod,随后curl一下volcano scheduler的接口地址,例如:
# 10.233.75.65为主volcano scheduler的ClusterIP
curl 10.233.75.65:8080/metrics
返回的指标比较重,其中与vGPU相关的指标:volcano-vgpu-metrics.txt
常见问题
vGPU Pod部署时报错UnexpectedAdmissionError
在调整完volcano-vgpu-device-config这个ConfigMap中的resourceCountName配置项为自定义的资源名称后,Pod部署时状态为UnexpectedAdmissionError:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-vgpu-compatible 0/1 UnexpectedAdmissionError 0 75s
通过kubectl describe pod查看Pod的Events如下:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 25s volcano Successfully assigned volcano-system/test-vgpu-compatible to dev-app-2-150-master-1
Warning UnexpectedAdmissionError 25s kubelet Allocate failed due to rpc error: code = Unknown desc = device request not found, which is unexpected
通过翻查volcano和volcano-vgpu-device-plugin源码,经过排查是配置文件不一致引起的。在修改资源名称时,我们需要保证3个地方的配置正确性和一致性,拿resourceCountName配置项修改为nvidia.com/gpu举例,需要调整以下地方:
volcano-vgpu-device-config的resourceCountName: nvidia.com/gpuvolcano-vgpu-device-plugin命令行参数--resource-name=nvidia.com/gpuvolcano-scheduler-configmap的deviceshare插件需要指定正确的命名空间,如下:可以通过查看调度的日志来排查调度器使用的配置文件是否正确,命令如下:- name: deviceshare
arguments:
# 是否启用vgpu特性
deviceshare.VGPUEnable: true
# volcano-vgpu-device-config这个ConfigMap对应的命名空间
# 便于调度器自动读取ConfigMap内容
deviceshare.KnownGeometriesCMNamespace: volcano-system$ kubectl logs volcano-scheduler-6645c59d6d-bcw68 | grep "device config"
I1125 09:11:57.408175 1 config.go:113] "Initializing volcano device config" device-configs={"NvidiaConfig":{"ResourceCountName":"nvidia.com/gpu","ResourceMemoryName":"volcano.sh/vgpu-memory","ResourceCoreName":"volcano.sh/vgpu-cores","ResourceMemoryPercentageName":"volcano.sh/vgpu-memory-percentage","ResourcePriority":"","OverwriteEnv":false,"DefaultMemory":0,"DefaultCores":0,"DefaultGPUNum":1,"DeviceSplitCount":10,"DeviceMemoryScaling":1,"DeviceCoreScaling":1,"DisableCoreLimit":false,"MigGeometriesList":[],"GPUMemoryFactor":1}}