-
Notifications
You must be signed in to change notification settings - Fork 518
Description
What would you like to be added
Support training workloads with multi-layer topology constraints (e.g., GB200/GB300 clusters).
kubernetes-sigs/kueue#6554 (comment) cleanly illustrates the need.
Real-world data center topologies often have multiple physical levels (e.g., data center → zone → block → row → rack); to get the best performance, workloads increasingly need to express constraints at more than two of those levels simultaneously.
Current situation and gap
TAS balanced placement (#6851, v0.15) is a best-effort primitive that balances pod placement on a specified topology. However, in real-world GB200 clusters we need to guarantee that a training Job's Pods are co-located into block / rack / NVSwitch domain / other topology levels with specific, layered criteria.
Even if we changed semantics of TAS balanced placement to guaranteed, it still lacks the flexibility to express multi-layer topology constraints. For example, a 64-replica Job may require:
- all 64 Pods land on the same block
- groups of 8 Pods land within the same row
- groups of 4 Pods land within the same rack
- (or even finer granularity)
Closest existing primitive
The closest mechanism is KEP-2724: Two-level Topology Aware Scheduling (#5449 / #5596), which introduced the podset-slice-required-topology / podset-slice-size annotations. Although designed primarily for JobSet, the mechanism is generic enough to support arbitrary PodSet-based workloads.
However, it is limited to exactly 2 constraint tiers: one for the entire PodSet and one for slices — close to what we need, but not sufficient.
Proposed change
Extend TAS to support multi-layer topology constraints - up to N total constraint levels (1 podset-level + N−1 slice layers). Users specify additional slice layers that recursively subdivide slices into smaller groups, each constrained to a finer topology level.
Example: 4-layer constraints on 64 pods:
# Existing "kueue.x-k8s.io/podset-slice-*" annotations are kept as-is for backward compatibility
annotations:
# Layer 0 (podset): all 64 pods in the same block
kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-block"
# Layer 1 (existing slice): groups of 16 in the same rack
kueue.x-k8s.io/podset-slice-required-topology: "cloud.provider.com/topology-rack"
kueue.x-k8s.io/podset-slice-size: "16"
# Layer 2 (new): groups of 4 on the same NVSwitch domain
kueue.x-k8s.io/podset-slice-required-topology-1: "cloud.provider.com/topology-nvdomain"
kueue.x-k8s.io/podset-slice-size-1: "4"
# Layer 3 (new): groups of 2 on the same host (e.g., NVL36)
kueue.x-k8s.io/podset-slice-required-topology-2: "kubernetes.io/hostname"
kueue.x-k8s.io/podset-slice-size-2: "2"Result: 64 pods → 4 racks of 16 → 4 NVSwitch domains of 4 per rack → 2 hosts of 2 per domain.
The concrete number of N can be discussed. We may start with a conservative cap of 2 additional layers (3 slice layers total):
diff --git a/apis/kueue/v1beta2/workload_types.go b/apis/kueue/v1beta2/workload_types.go
index ce630751e..79a0d56b8 100644
--- a/apis/kueue/v1beta2/workload_types.go
+++ b/apis/kueue/v1beta2/workload_types.go
@@ -223,6 +223,33 @@ type PodSetTopologyRequest struct {
//
// +optional
PodSetSliceSize *int32 `json:"podSetSliceSize,omitempty"`
+
+ // additionalSliceLayers defines additional layers of recursive slice
+ // subdivision beyond the first slice layer (podSetSliceRequiredTopology /
+ // podSetSliceSize). Each layer further subdivides the parent layer's
+ // groups into smaller groups constrained to a finer topology domain.
+ // At most 2 additional layers are supported (for a total of 3 slice layers).
+ //
+ // +optional
+ // +listType=atomic
+ // +kubebuilder:validation:MaxItems=2
+ AdditionalSliceLayers []SliceLayer `json:"additionalSliceLayers,omitempty"`
+}
+
+// SliceLayer defines a single additional slice subdivision layer.
+type SliceLayer struct {
+ // topology indicates the topology level required for this slice layer.
+ //
+ // +required
+ // +kubebuilder:validation:MinLength=1
+ // +kubebuilder:validation:MaxLength=63
+ Topology string `json:"topology"`
+
+ // size indicates the number of pods in each group at this slice layer.
+ //
+ // +required
+ // +kubebuilder:validation:Minimum=1
+ Size int32 `json:"size"`
}
type Admission struct {Why is this needed
1. Real data center topologies are deeper than 2 levels
Modern GPU clusters have hierarchies like data center → block → rack → NVSwitch domain → host. AI/ML training workloads need different communication granularities at each level: all-reduce within a host, ring-reduce within a rack, hierarchical all-reduce across racks within a block. Being limited to 2 constraint tiers means users can only optimize for 2 of these boundaries, leaving performance on the table.
2. Natural extension of existing design
The slice mechanism already supports recursive subdivision conceptually - a slice is just a fixed-size group constrained to a topology domain. Multi-layer slicing is "slices within slices," following the same pattern. The existing TAS algorithm's 2-phase traversal (greedy above slice level, parent-constrained at/below) generalizes naturally to N phases.
3. Avoids workarounds
Without this, users must work around the limitation by:
- Manually splitting workloads into smaller Jobs and coordinating placement externally
- Using multiple JobSets with separate topology constraints, losing the "all in one block" guarantee
- Relying on the Kubernetes scheduler's topology spread constraints, which lack TAS's capacity-aware admission
Completion requirements
This enhancement requires the following artifacts:
- Design doc
- API change
- Docs update
The artifacts should be linked in subsequent comments.