Skip to content

TAS: Support Multi-Layer Topology Constraints #9046

@Huang-Wei

Description

@Huang-Wei

What would you like to be added

Support training workloads with multi-layer topology constraints (e.g., GB200/GB300 clusters).

kubernetes-sigs/kueue#6554 (comment) cleanly illustrates the need.
Real-world data center topologies often have multiple physical levels (e.g., data center → zone → block → row → rack); to get the best performance, workloads increasingly need to express constraints at more than two of those levels simultaneously.

Current situation and gap

TAS balanced placement (#6851, v0.15) is a best-effort primitive that balances pod placement on a specified topology. However, in real-world GB200 clusters we need to guarantee that a training Job's Pods are co-located into block / rack / NVSwitch domain / other topology levels with specific, layered criteria.

Even if we changed semantics of TAS balanced placement to guaranteed, it still lacks the flexibility to express multi-layer topology constraints. For example, a 64-replica Job may require:

  • all 64 Pods land on the same block
  • groups of 8 Pods land within the same row
  • groups of 4 Pods land within the same rack
  • (or even finer granularity)

Closest existing primitive

The closest mechanism is KEP-2724: Two-level Topology Aware Scheduling (#5449 / #5596), which introduced the podset-slice-required-topology / podset-slice-size annotations. Although designed primarily for JobSet, the mechanism is generic enough to support arbitrary PodSet-based workloads.

However, it is limited to exactly 2 constraint tiers: one for the entire PodSet and one for slices — close to what we need, but not sufficient.

Proposed change

Extend TAS to support multi-layer topology constraints - up to N total constraint levels (1 podset-level + N−1 slice layers). Users specify additional slice layers that recursively subdivide slices into smaller groups, each constrained to a finer topology level.

Example: 4-layer constraints on 64 pods:

# Existing "kueue.x-k8s.io/podset-slice-*" annotations are kept as-is for backward compatibility
annotations:
  # Layer 0 (podset): all 64 pods in the same block
  kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-block"
  # Layer 1 (existing slice): groups of 16 in the same rack
  kueue.x-k8s.io/podset-slice-required-topology: "cloud.provider.com/topology-rack"
  kueue.x-k8s.io/podset-slice-size: "16"
  # Layer 2 (new): groups of 4 on the same NVSwitch domain
  kueue.x-k8s.io/podset-slice-required-topology-1: "cloud.provider.com/topology-nvdomain"
  kueue.x-k8s.io/podset-slice-size-1: "4"
  # Layer 3 (new): groups of 2 on the same host (e.g., NVL36)
  kueue.x-k8s.io/podset-slice-required-topology-2: "kubernetes.io/hostname"
  kueue.x-k8s.io/podset-slice-size-2: "2"

Result: 64 pods → 4 racks of 16 → 4 NVSwitch domains of 4 per rack → 2 hosts of 2 per domain.

The concrete number of N can be discussed. We may start with a conservative cap of 2 additional layers (3 slice layers total):

diff --git a/apis/kueue/v1beta2/workload_types.go b/apis/kueue/v1beta2/workload_types.go
index ce630751e..79a0d56b8 100644
--- a/apis/kueue/v1beta2/workload_types.go
+++ b/apis/kueue/v1beta2/workload_types.go
@@ -223,6 +223,33 @@ type PodSetTopologyRequest struct {
 	//
 	// +optional
 	PodSetSliceSize *int32 `json:"podSetSliceSize,omitempty"`
+
+	// additionalSliceLayers defines additional layers of recursive slice
+	// subdivision beyond the first slice layer (podSetSliceRequiredTopology /
+	// podSetSliceSize). Each layer further subdivides the parent layer's
+	// groups into smaller groups constrained to a finer topology domain.
+	// At most 2 additional layers are supported (for a total of 3 slice layers).
+	//
+	// +optional
+	// +listType=atomic
+	// +kubebuilder:validation:MaxItems=2
+	AdditionalSliceLayers []SliceLayer `json:"additionalSliceLayers,omitempty"`
+}
+
+// SliceLayer defines a single additional slice subdivision layer.
+type SliceLayer struct {
+	// topology indicates the topology level required for this slice layer.
+	//
+	// +required
+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:MaxLength=63
+	Topology string `json:"topology"`
+
+	// size indicates the number of pods in each group at this slice layer.
+	//
+	// +required
+	// +kubebuilder:validation:Minimum=1
+	Size int32 `json:"size"`
 }

 type Admission struct {

Why is this needed

1. Real data center topologies are deeper than 2 levels

Modern GPU clusters have hierarchies like data center → block → rack → NVSwitch domain → host. AI/ML training workloads need different communication granularities at each level: all-reduce within a host, ring-reduce within a rack, hierarchical all-reduce across racks within a block. Being limited to 2 constraint tiers means users can only optimize for 2 of these boundaries, leaving performance on the table.

2. Natural extension of existing design

The slice mechanism already supports recursive subdivision conceptually - a slice is just a fixed-size group constrained to a topology domain. Multi-layer slicing is "slices within slices," following the same pattern. The existing TAS algorithm's 2-phase traversal (greedy above slice level, parent-constrained at/below) generalizes naturally to N phases.

3. Avoids workarounds

Without this, users must work around the limitation by:

  • Manually splitting workloads into smaller Jobs and coordinating placement externally
  • Using multiple JobSets with separate topology constraints, losing the "all in one block" guarantee
  • Relying on the Kubernetes scheduler's topology spread constraints, which lack TAS's capacity-aware admission

Completion requirements

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions