概述
在AI大模型训练场景中,模型并行(Model Parallelism)将模型分割到多个节点上,训练过程中这些节点需要频繁进行大量数据交互。此时,节点间的网络传输性能往往成为训练的瓶颈,显著影响训练效率。
Volcano的 网络拓扑感知调度(Network Topology Aware Scheduling) 特性,通过统一的网络拓扑API和智能调度策略,将工作负载调度到具有最高吞吐量和最低延迟的最佳性能域,尽可能减少跨交换机的通信,以加速数据交换,提升训练效率。
本文将通过实际部署Kind集群、创建HyperNode、运行测试任务,来真实验证Volcano的网络拓扑感知调度能力。
测试环境拓扑
我们将构建如下的网络拓扑结构:
tier3 s6
/ \
tier2 s4 s5
/ \ / \
tier1 s0 s1 s2 s3
/ \ / \ / \ / \
node0 node1 node2 node3 node4 node5 node6 node7
这个拓扑结构模拟了一个典型的数据中心网络:
tier1(叶子层):4个HyperNode(s0-s3),每个包含2个物理节点tier2(汇聚层):2个HyperNode(s4-s5),分别管理2个tier1的HyperNodetier3(核心层):1个HyperNode(s6),管理所有tier2的HyperNode
节点间的通信效率规则:
- 同一
tier1 HyperNode内的节点通信效率最高 - 跨
tier1但在同一tier2内的节点次之 - 跨
tier2需要经过tier3的节点通信效率最低
环境准备
创建包含8个worker节点的Kind集群,用于模拟我们的网络拓扑。
Kind集群配置文件
这里的节点标签不能使用topology.kubernetes.io/switch,kind默认使用kindnet CNI插件,该插件会识别topology.kubernetes.io/switch标签,并尝试:
- 为同交换机标签的节点配置相同的子网/路由
- 生成基于交换机的网络策略规则
当手动指定了节点名(node0-node7),导致kindnet CNI插件在配置网络时,出现「节点名与拓扑标签不匹配」的异常,最终阻塞kubelet网络初始化,表现为 10248端口拒绝连接。
因此这里我们使用自定义标签switch而不是topology.kubernetes.io/switch来避免该问题。
创建文件 kind-cluster.yaml:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: volcano-topology-test
nodes:
# 控制平面节点
- role: control-plane
image: kindest/node:v1.27.3
# Worker节点 - 模拟网络拓扑
# tier1 - s0
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s0
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node0
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s0
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node1
# tier1 - s1
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s1
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node2
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s1
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node3
# tier1 - s2
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s2
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node4
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s2
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node5
# tier1 - s3
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s3
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node6
- role: worker
image: kindest/node:v1.27.3
labels:
switch: s3
kubeadmConfigPatches:
- |
kind: JoinConfiguration
nodeRegistration:
name: node7
创建集群
# 创建集群
kind create cluster --config kind-cluster.yaml
# 验证节点
kubectl get nodes
预期输出:
NAME STATUS ROLES AGE VERSION
node0 Ready <none> 10s v1.27.3
node1 Ready <none> 10s v1.27.3
node2 Ready <none> 10s v1.27.3
node3 Ready <none> 10s v1.27.3
node4 Ready <none> 10s v1.27.3
node5 Ready <none> 9s v1.27.3
node6 Ready <none> 10s v1.27.3
node7 Ready <none> 10s v1.27.3
volcano-topology-test-control-plane Ready control-plane 30s v1.27.3
安装Volcano
使用Helm安装支持网络拓扑感知调度的Volcano版本:
# 添加Volcano Helm仓库
helm repo add volcano-sh https://volcano-sh.github.io/helm-charts
helm repo update
# 安装Volcano(版本可自行指定)
helm install volcano volcano-sh/volcano \
-n volcano-system \
--create-namespace \
--version 1.13.0
# 等待Volcano组件就绪,预计需要几分钟时间
验证安装:
kubectl get pods -n volcano-system
预期输出:
NAME READY STATUS RESTARTS AGE
volcano-admission-b84bbd89-9k55v 1/1 Running 0 105s
volcano-controllers-7b97b6455c-q2jf9 1/1 Running 0 105s
volcano-scheduler-65d4d4645b-k6nmk 1/1 Running 0 105s
配置Volcano调度器
为了启用网络拓扑感知调度功能,需要更新Volcano调度器配置,启用network-topology-aware插件:
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: overcommit
- name: drf
enablePreemptable: false
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
# 启用网络拓扑感知调度插件
- name: network-topology-aware
应用配置:
kubectl apply -f volcano-scheduler-configmap.yaml
重启调度器保证配置生效(通过滚动重启优雅实现):
kubectl rollout restart deployment volcano-scheduler -n volcano-system
创建HyperNode资源
根据我们设计的网络拓扑,创建HyperNode资源。
创建文件 hypernodes.yaml:
# Tier1 - 叶子HyperNode
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s0
spec:
tier: 1
members:
- type: Node
selector:
labelMatch:
matchLabels:
switch: s0
---
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s1
spec:
tier: 1
members:
- type: Node
selector:
labelMatch:
matchLabels:
switch: s1
---
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s2
spec:
tier: 1
members:
- type: Node
selector:
labelMatch:
matchLabels:
switch: s2
---
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s3
spec:
tier: 1
members:
- type: Node
selector:
labelMatch:
matchLabels:
switch: s3
---
# Tier2 - 汇聚层HyperNode
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s4
spec:
tier: 2
members:
- type: HyperNode
selector:
exactMatch:
name: "s0"
- type: HyperNode
selector:
exactMatch:
name: "s1"
---
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s5
spec:
tier: 2
members:
- type: HyperNode
selector:
exactMatch:
name: "s2"
- type: HyperNode
selector:
exactMatch:
name: "s3"
---
# Tier3 - 核心层HyperNode
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s6
spec:
tier: 3
members:
- type: HyperNode
selector:
exactMatch:
name: "s4"
- type: HyperNode
selector:
exactMatch:
name: "s5"
应用HyperNode配置:
# 创建HyperNode资源
kubectl apply -f hypernodes.yaml
# 查看HyperNode(该CRD是Cluster作用域)
kubectl get hypernodes
预期输出:
NAME TIER NODECOUNT AGE
s0 1 2 10s
s1 1 2 10s
s2 1 2 10s
s3 1 2 10s
s4 2 2 10s
s5 2 2 10s
s6 3 2 10s
查看HyperNode详细信息:
# 查看tier1的s0
kubectl get hypernode s0 -o yaml
# 查看tier2的s4
kubectl get hypernode s4 -o yaml
# 查看tier3的s6
kubectl get hypernode s6 -o yaml
测试网络拓扑感知调度
Hard模式 - Tier1约束
创建一个只能在tier1 HyperNode内调度的任务。
创建简单的测试文件topology-test-1.yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: topology-test-1
spec:
minAvailable: 2
schedulerName: volcano
queue: default
# 网络拓扑约束:只能在tier1内调度
networkTopology:
mode: hard
highestTierAllowed: 1
tasks:
- replicas: 2
name: worker
template:
spec:
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
运行测试:
# 创建任务
kubectl apply -f topology-test-1.yaml
# 查看任务状态
kubectl get vcjob topology-test-1
# 查看Pod调度情况
kubectl get pods -o wide
调度结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-1-worker-0 1/1 Running 0 11s 10.244.7.2 node0 <none> <none>
topology-test-1-worker-1 1/1 Running 0 11s 10.244.2.3 node1 <none> <none>
符合预期:
- 所有
2个Pod应该被调度到同一个tier1 HyperNode内(即同一个s0/s1/s2/s3) - 例如:都在
node0和node1(s0)
清理任务:
kubectl delete -f topology-test-1.yaml
Hard模式 - Tier2约束
创建一个可以跨tier1但必须在tier2内调度的任务。
创建文件 topology-test-2.yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: topology-test-2
spec:
minAvailable: 4
schedulerName: volcano
queue: default
# 网络拓扑约束:可以跨tier1,但必须在tier2内
networkTopology:
mode: hard
highestTierAllowed: 2
tasks:
- replicas: 4
name: worker
template:
spec:
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
运行测试:
# 创建新任务
kubectl apply -f topology-test-2.yaml
# 查看任务状态
kubectl get vcjob topology-test-2
# 查看Pod调度情况
kubectl get pods -o wide
调度结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-2-worker-0 1/1 Running 0 9s 10.244.3.37 node3 <none> <none>
topology-test-2-worker-1 1/1 Running 0 9s 10.244.1.46 node2 <none> <none>
topology-test-2-worker-2 1/1 Running 0 9s 10.244.3.38 node3 <none> <none>
topology-test-2-worker-3 1/1 Running 0 9s 10.244.3.39 node3 <none> <none>
符合预期:
- 所有
4个Pod应该被调度到同一个tier2 HyperNode内 - 可能跨越
2个tier1 HyperNode,但都在s4(s0-s1)或s5(s2-s3)内
清理任务:
kubectl delete -f topology-test-2.yaml
Hard模式 - Tier2约束 + 反亲和性
在测试2的基础上增加反亲和性约束,确保Pod分散到不同节点,以达到更好的测试效果。
创建文件 topology-test-3.yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: topology-test-3
spec:
minAvailable: 4
schedulerName: volcano
queue: default
# 网络拓扑约束:可以跨tier1,但必须在tier2内
networkTopology:
mode: hard
highestTierAllowed: 2
tasks:
- replicas: 4
name: worker
template:
metadata:
labels:
# 用于反亲和性规则,确保Pod分散到不同节点
app: exclusive-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- exclusive-app
topologyKey: kubernetes.io/hostname
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
配置说明:
- 反亲和性约束:通过
podAntiAffinity.requiredDuringSchedulingIgnoredDuringExecution,确保具有相同app=exclusive-app标签的Pod不会被调度到同一个节点 - 拓扑键:
topologyKey: kubernetes.io/hostname表示按节点主机名进行互斥判断 - 综合效果:
4个Pod必须分散到4个不同节点,且这4个节点必须在同一个tier2的HyperNode内
运行测试:
# 创建任务
kubectl apply -f topology-test-3.yaml
# 查看任务状态
kubectl get vcjob topology-test-3
# 查看Pod调度情况
kubectl get pods -o wide
预期调度结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-3-worker-0 1/1 Running 0 10s 10.244.7.3 node0 <none> <none>
topology-test-3-worker-1 1/1 Running 0 10s 10.244.2.4 node1 <none> <none>
topology-test-3-worker-2 1/1 Running 0 10s 10.244.1.5 node2 <none> <none>
topology-test-3-worker-3 1/1 Running 0 10s 10.244.3.6 node3 <none> <none>
符合预期:
4个Pod分别调度到4个不同的节点(满足反亲和性)- 这
4个节点都属于同一个tier2HyperNode,例如s4(包含node0-node3)或s5(包含node4-node7) - 网络拓扑和反亲和性约束同时得到满足
清理任务:
kubectl delete -f topology-test-3.yaml
Hard模式 - Deployment工作负载
上面的示例都是基于Volcano Job工作负载,实际上网络拓扑感知调度也支持原生的Deployment工作负载。通过在Deployment的Pod模板中添加特定的annotations,可以实现同样的拓扑约束效果。
创建文件 topology-test-6.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: topology-test-6
annotations:
# 组调度设置:最少需要4个Pod同时调度
scheduling.volcano.sh/group-min-member: "4"
spec:
replicas: 4
selector:
matchLabels:
app: exclusive-app
template:
metadata:
labels:
# 用于反亲和性规则,确保Pod分散到不同节点
app: exclusive-app
annotations:
# 队列名称
scheduling.volcano.sh/queue-name: default
# 设置网络拓扑为硬约束
volcano.sh/network-topology-mode: "hard"
# 设置允许调度的最高网络层级为2
volcano.sh/network-topology-highest-tier: "2"
spec:
schedulerName: volcano
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- exclusive-app
topologyKey: kubernetes.io/hostname
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
配置说明:
- Deployment级别annotations:
scheduling.volcano.sh/group-min-member: "4":启用组调度,要求最少4个Pod同时调度成功
- Pod模板annotations:
scheduling.volcano.sh/queue-name: default:指定使用的Volcano队列volcano.sh/network-topology-mode: "hard":设置拓扑约束为硬约束volcano.sh/network-topology-highest-tier: "2":限制最高在tier2内调度
- 调度器:
schedulerName: volcano指定使用Volcano调度器 - 反亲和性:确保
4个Pod分散到4个不同节点
运行测试:
# 创建Deployment
kubectl apply -f topology-test-6.yaml
# 查看Deployment状态
kubectl get deployment topology-test-6
# 查看Pod调度情况
kubectl get pods -o wide -l app=exclusive-app
预期调度结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-6-7b8c9d5f6-2x4pk 1/1 Running 0 15s 10.244.7.5 node0 <none> <none>
topology-test-6-7b8c9d5f6-5h9km 1/1 Running 0 15s 10.244.2.6 node1 <none> <none>
topology-test-6-7b8c9d5f6-8n3qr 1/1 Running 0 15s 10.244.1.7 node2 <none> <none>
topology-test-6-7b8c9d5f6-9w5vp 1/1 Running 0 15s 10.244.3.8 node3 <none> <none>
符合预期:
4个Pod成功调度并运行(满足组调度要求)4个Pod分别调度到4个不同的节点(满足反亲和性)- 这
4个节点都属于同一个tier2HyperNode,例如s4(包含node0-node3)或s5(包含node4-node7) - 网络拓扑和反亲和性约束同时得到满足
验证PodGroup(Volcano自动为Deployment创建):
# 查看自动创建的PodGroup
kubectl get podgroup
# 查看PodGroup详情
kubectl get podgroup <podgroup-name> -o yaml
清理任务:
kubectl delete -f topology-test-6.yaml
调度失败场景 - 跨Tier2约束
测试当任务需要跨越tier2边界时,在hard模式下无法调度的情况。
创建文件 topology-test-4.yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: topology-test-4
spec:
minAvailable: 5
schedulerName: volcano
queue: default
# 网络拓扑约束:必须在tier2内
networkTopology:
mode: hard
highestTierAllowed: 2
tasks:
- replicas: 5
name: worker
template:
metadata:
labels:
# 用于反亲和性规则,确保Pod分散到不同节点
app: exclusive-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- exclusive-app
topologyKey: kubernetes.io/hostname
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
配置说明:
- 任务数量:需要
5个Pod,每个Pod必须在不同节点(反亲和性) - 拓扑约束:
highestTierAllowed: 2,要求所有Pod在同一个tier2HyperNode内 - 资源冲突:每个
tier2HyperNode(s4或s5)只包含4个节点,无法满足5个不同节点的需求
运行测试:
# 创建任务
kubectl apply -f topology-test-4.yaml
# 查看任务状态(应该是Pending)
kubectl get vcjob topology-test-4
# 查看Pod状态
kubectl get pods -l volcano.sh/job-name=topology-test-4
# 查看调度事件
kubectl describe vcjob topology-test-4
预期结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-4-worker-0 0/1 Pending 0 2m17s <none> <none> <none> <none>
topology-test-4-worker-1 0/1 Pending 0 2m17s <none> <none> <none> <none>
topology-test-4-worker-2 0/1 Pending 0 2m17s <none> <none> <none> <none>
topology-test-4-worker-3 0/1 Pending 0 2m17s <none> <none> <none> <none>
topology-test-4-worker-4 0/1 Pending 0 2m17s <none> <none> <none> <none>
符合预期:
- 所有
5个Pod处于Pending状态,无法被调度 - 原因:每个
tier2HyperNode只有4个节点,无法同时满足以下约束:5个Pod必须在5个不同节点(反亲和性,测试需要)- 这些节点必须在同一个
tier2HyperNode内(网络拓扑约束),只能使用s4或s5中的4个节点,不能两边各调度一部分Pod
查看详细调度信息:
# 查看调度器日志
kubectl logs -n volcano-system -l app=volcano-scheduler --tail=50 | grep -i "topology-test-4"
# 查看PodGroup状态
kubectl get podgroup topology-test-4 -o yaml
清理任务:
kubectl delete -f topology-test-4.yaml
Soft模式 - 跨Tier2调度
将测试4中的示例使用soft模式运行,可实现跨tier2调度,比如s4和s5拓扑下各调度一部分Pod,但在真实业务场景中,这种跨多层网络拓扑的通信效率很差。本示例仅做测试和参考。
创建文件 topology-test-5.yaml:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: topology-test-5
spec:
minAvailable: 5
schedulerName: volcano
queue: default
# 网络拓扑约束:尽可能在tier2内
networkTopology:
mode: soft
highestTierAllowed: 2
tasks:
- replicas: 5
name: worker
template:
metadata:
labels:
# 用于反亲和性规则,确保Pod分散到不同节点
app: exclusive-app
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- exclusive-app
topologyKey: kubernetes.io/hostname
containers:
- name: busybox
image: busybox:latest
command: ["sh", "-c", "sleep 3600"]
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
配置说明:
- 调度模式:
soft模式(软约束),调度器会尽最大努力满足拓扑约束,但允许降级 - 任务数量:
5个Pod,每个Pod必须在不同节点(反亲和性) - 拓扑约束:
highestTierAllowed: 2,尽可能在tier2内调度 - 降级策略:当单个
tier2HyperNode无法满足时,允许跨越tier2边界进行调度
运行测试:
# 创建任务
kubectl apply -f topology-test-5.yaml
# 查看任务状态
kubectl get vcjob topology-test-5
# 查看Pod调度情况
kubectl get pods -o wide -l volcano.sh/job-name=topology-test-5
预期调度结果:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
topology-test-5-worker-0 1/1 Running 0 13m 10.244.7.44 node0 <none> <none>
topology-test-5-worker-1 1/1 Running 0 13m 10.244.2.55 node1 <none> <none>
topology-test-5-worker-2 1/1 Running 0 13m 10.244.1.49 node2 <none> <none>
topology-test-5-worker-3 1/1 Running 0 13m 10.244.3.42 node3 <none> <none>
topology-test-5-worker-4 1/1 Running 0 13m 10.244.8.16 node4 <none> <none>
符合预期:
- 所有
5个Pod成功调度并运行(与测试4的Pending状态形成对比) Pod分散到5个不同节点(满足反亲和性)- 由于单个
tier2HyperNode只有4个节点,调度器允许跨越tier2边界 - 可能的分布:
4个Pod在s4下(node0-node3)1个Pod在s5下(例如node4)
node0-node3与node4之间的通信需要跨越tier2,经过tier3(s6),通信效率较低
对比测试4:
- 测试4(hard模式):无法调度,所有
Pod保持Pending状态 - 测试5(soft模式):成功调度,但可能跨越拓扑边界,牺牲了部分网络性能
清理任务:
kubectl delete -f topology-test-5.yaml
常见问题
HyperNode创建失败
可能原因:
CRD未正确安装- 节点名称不匹配
YAML格式错误- 非叶子
HyperNode使用了不支持的选择器类型(只支持exactMatch)
解决方法:
# 检查CRD是否存在
kubectl get crd hypernodes.topology.volcano.sh
# 验证节点名称
kubectl get nodes --show-labels
# 查看HyperNode详情
kubectl describe hypernode <hypernode-name>
关于选择器限制:
-
叶子
HyperNode(包含Node类型成员):支持三种选择器exactMatch:精确匹配节点名称regexMatch:正则表达式匹配节点名称labelMatch:基于标签匹配节点
-
非叶子
HyperNode(包含HyperNode类型成员):仅支持exactMatch- 必须使用
exactMatch精确指定子HyperNode的名称 - 不支持
regexMatch和labelMatch
- 必须使用
错误示例(非叶子HyperNode使用labelMatch会失败):
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s4
spec:
tier: 2
members:
- type: HyperNode
selector:
labelMatch: # ❌ 错误:非叶子HyperNode不支持labelMatch
matchLabels:
tier: "1"
正确示例:
apiVersion: topology.volcano.sh/v1alpha1
kind: HyperNode
metadata:
name: s4
spec:
tier: 2
members:
- type: HyperNode
selector:
exactMatch: # ✅ 正确:使用exactMatch
name: "s0"
- type: HyperNode
selector:
exactMatch:
name: "s1"
调度器未使用网络拓扑感知
可能原因:
- 调度器配置未正确更新
- 插件未启用
- 任务未设置资源请求和限制,被视为
BestEffort类型
解决方法:
# 检查调度器配置
kubectl get cm volcano-scheduler-configmap -n volcano-system -o yaml
# 重启调度器
kubectl rollout restart deployment volcano-scheduler -n volcano-system
# 验证插件是否加载
kubectl logs -n volcano-system -l app=volcano-scheduler | grep "network-topology-aware"
关于BestEffort任务:
网络拓扑感知调度依赖于任务的资源请求信息来进行调度决策。如果Pod没有设置resources.requests和resources.limits,它会被标记为BestEffort QoS类型。对于BestEffort类型的任务,即使启用了网络拓扑调度插件,该插件也不会对其生效。
原因:
BestEffort任务没有明确的资源需求,调度器无法准确评估其对网络拓扑的影响- 网络拓扑调度需要基于资源请求来计算节点得分和拓扑匹配度
- 为保证调度质量,插件会跳过
BestEffort类型的任务
检查方法:
# 查看Pod的QoS类型
kubectl get pod <pod-name> -o jsonpath='{.status.qosClass}'
# 应该返回 Guaranteed 或 Burstable,而非 BestEffort
解决方案:
在任务定义中必须设置资源请求和限制:
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "100m"
memory: "128Mi"
HyperNode的NODECOUNT显示不符合预期
现象:
查看HyperNode时,发现s4、s5、s6的NODECOUNT都显示为2:
NAME TIER NODECOUNT AGE
s0 1 2 10s
s1 1 2 10s
s2 1 2 10s
s3 1 2 10s
s4 2 2 10s # 包含s0和s1,为什么不是4?
s5 2 2 10s # 包含s2和s3,为什么不是4?
s6 3 2 10s # 包含s4和s5,为什么不是8?
原因:
这是正常现象。NODECOUNT字段统计的是HyperNode的直接子成员数量,而不是递归统计所有叶子节点的数量。
详细说明:
-
s0-s3(tier1):- 直接包含
2个Node类型的成员 NODECOUNT = 2
- 直接包含
-
s4-s5(tier2):- 直接包含
2个HyperNode类型的成员(例如s4包含s0和s1) NODECOUNT = 2(统计的是HyperNode成员数,不是叶子节点数)
- 直接包含
-
s6(tier3):- 直接包含
2个HyperNode类型的成员(s4和s5) NODECOUNT = 2(统计的是HyperNode成员数,不是叶子节点数)
- 直接包含
验证方法:
查看HyperNode的详细配置可以确认成员类型:
# 查看s4的成员
kubectl get hypernode s4 -o yaml
# 输出显示members包含2个HyperNode类型的成员
spec:
tier: 2
members:
- type: HyperNode
selector:
exactMatch:
name: "s0"
- type: HyperNode
selector:
exactMatch:
name: "s1"
总结:NODECOUNT反映的是树状结构中的直接子节点数量,不是叶子节点总数。这种设计有助于清晰地理解HyperNode的层级结构。