混合调度的亲和性、污点容忍设计思考

本章节只考虑混部调度的场景，关于资源混部抢占的场景，后续编写章节再做说明。

1. 背景

使用Kubernetes作为资源调度底座。

每个节点在加入集群时需要打上对应的标签和污点，以便进行资源管控以及调度控制。

1.1 标签

关键的标签设计如下：

节点用途

使用 公司域前缀/node.usage（如 maip.mxsf.io/node.usage）作为节点用途的标签：
- 推理：inference
- 训练：training
- 混部：hybrid
加速卡信息

参考GPU的feature discovery组件自动打标的格式：
- 名称：厂商域前缀/卡类型.product（如 nvidia.com/gpu.product）
- 数量：厂商域前缀/卡类型.count（如 nvidia.com/gpu.count）
- 显存：厂商域前缀/卡类型.memory（如 nvidia.com/gpu.memory）
- 驱动：厂商域前缀/卡类型.driver-version（如 nvidia.com/gpu.driver-version）

1.2 污点

关键的污点设计如下：

管控面节点

使用 公司域前缀/control-plane（如 maip.mxsf.io/control-plane）作为管控面节点的污点。避免业务容器调度到管控面节点上。
卡型号污点

使用 厂商域前缀/卡类型.product（如 nvidia.com/gpu.product）作为业务节点的污点。创建推理或训练任务时需要指定卡型号，避免CPU类型的任务调度到GPU卡的节点上，也避免一个小模型的任务调度到大卡的节点上。
节点用途污点

使用 公司域前缀/node.usage（如 maip.mxsf.io/node.usage）作为节点用途污点。创建推理或训练任务时需要指定任务类型，避免推理类型的任务调度到训练用途的节点上。

可选值：
- 推理：inference
- 训练：training
- 混部：hybrid

1.3 混部的场景

在正常情况下，训练任务调度到训练用途的节点，推理任务调度到推理用途的节点。但当资源紧张时，训练或推理任务可以调度到混部的节点上。

因此在划分节点资源池的时候，需要划分出训练、推理、混部用途的资源池。对节点进行用途标记时，使用公司域前缀/node.usage作为节点用途标签和污点，并使用inference, training, hybrid作为推理、训练、混部的用途标记。

2. 思路

使用亲和性的软亲和以及权重来实现对任务的亲和性配置，并使用足够的污点容忍允许任务调度到节点上。

参考资料：

举几个示例说明。

为简化调度示例，以下的部署YAML中，我们使用nginx镜像来演示。

2.1 训练任务

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: training-job
spec:
  schedulerName: volcano
  minAvailable: 1
  policies:
    - action: CompleteJob
      event: TaskCompleted  
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              # 可以使用训练和混部的节点
              - matchExpressions:
                - key: maip.mxsf.io/node.usage
                  operator: In
                  values:
                  - training
                  - hybrid
            preferredDuringSchedulingIgnoredDuringExecution:
              # 优先使用训练的节点
              - weight: 100
                preference:
                  matchExpressions:
                  - key: maip.mxsf.io/node.usage
                    operator: In
                    values:
                    - training
        tolerations:
          # 可以部署到训练节点
          - key: maip.mxsf.io/node.usage
            operator: Equal
            value: training
            effect: NoSchedule
          # 可以部署到混部节点
          - key: maip.mxsf.io/node.usage
            operator: Equal
            value: hybrid
            effect: NoSchedule
          # 部署到指定的GPU节点
          - key: nvidia.com/gpu.product
            operator: Equal
            value: NVIDIA-H20
            effect: NoSchedule
        containers:
        - name: main
          image: nginx:latest
          imagePullPolicy: IfNotPresent
          resources:
            requests:
              cpu: 1
              memory: 1Gi
              # nvidia.com/gpu: 1
            limits:
              cpu: 1
              memory: 1Gi
              # nvidia.com/gpu: 1
        restartPolicy: Never

2.2 推理任务

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
  labels:
    app: inference-service
spec:
  replicas: 2
  selector:
    matchLabels:
      app: inference-service
  template:
    metadata:
      labels:
        app: inference-service
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            # 可以部署到推理节点
            - matchExpressions:
              - key: maip.mxsf.io/node.usage
                operator: In
                values:
                - inference
                - hybrid
          preferredDuringSchedulingIgnoredDuringExecution:
            # 优先部署到推理节点
            - weight: 100
              preference:
                matchExpressions:
                - key: maip.mxsf.io/node.usage
                  operator: In
                  values:
                  - inference
      tolerations:
        # 可以部署到推理节点
        - key: maip.mxsf.io/node.usage
          operator: Equal
          value: inference
          effect: NoSchedule
        # 可以部署到混部节点
        - key: maip.mxsf.io/node.usage
          operator: Equal
          value: hybrid
          effect: NoSchedule
        # 部署到指定的GPU节点
        - key: nvidia.com/gpu.product
          operator: Equal
          value: NVIDIA-H20
          effect: NoSchedule
      containers:
      - name: main
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 80
          name: http
        resources:
          requests:
            cpu: 1
            memory: 1Gi
            # nvidia.com/gpu: 1
          limits:
            cpu: 1
            memory: 1Gi
            # nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: inference-service
  labels:
    app: inference-service
spec:
  selector:
    app: inference-service
  ports:
  - port: 80
    targetPort: 80
    name: http
  type: ClusterIP

1. 背景​

1.1 标签​

1.2 污点​

1.3 混部的场景​

2. 思路​

2.1 训练任务​

2.2 推理任务​