TAS: Support Multi-Layer Topology Constraints

## What would you like to be added

Support training workloads with **multi-layer topology constraints** (e.g., GB200/GB300 clusters).

[kubernetes-sigs/kueue#6554 (comment)](https://github.com/kubernetes-sigs/kueue/issues/6554#issuecomment-3179837495) cleanly illustrates the need.
Real-world data center topologies often have multiple physical levels (e.g., data center → zone → block → row → rack); to get the best performance, workloads increasingly need to express constraints at more than two of those levels simultaneously.

## Current situation and gap

TAS balanced placement ([#6851](https://github.com/kubernetes-sigs/kueue/pull/6851), v0.15) is a best-effort primitive that balances pod placement on a specified topology. However, in real-world GB200 clusters we need to **guarantee** that a training Job's Pods are co-located into block / rack / NVSwitch domain / other topology levels with specific, layered criteria.

Even if we changed semantics of TAS balanced placement to guaranteed, it still lacks the flexibility to express **multi-layer topology constraints**. For example, a 64-replica Job may require:

- all 64 Pods land on the same block
- groups of 8 Pods land within the same row
- groups of 4 Pods land within the same rack
- (or even finer granularity)

## Closest existing primitive

The closest mechanism is [KEP-2724: Two-level Topology Aware Scheduling](https://github.com/kubernetes-sigs/kueue/pull/5449) (#5449 / #5596), which introduced the `podset-slice-required-topology` / `podset-slice-size` annotations. Although designed primarily for JobSet, the mechanism is generic enough to support arbitrary PodSet-based workloads.

However, it is limited to exactly **2 constraint tiers**: one for the entire PodSet and one for slices — close to what we need, but not sufficient.

## Proposed change

Extend TAS to support **multi-layer topology constraints** - up to N total constraint levels (1 podset-level + N−1 slice layers). Users specify **additional slice layers** that recursively subdivide slices into smaller groups, each constrained to a finer topology level.

**Example: 4-layer constraints on 64 pods:**

```yaml
# Existing "kueue.x-k8s.io/podset-slice-*" annotations are kept as-is for backward compatibility
annotations:
  # Layer 0 (podset): all 64 pods in the same block
  kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-block"
  # Layer 1 (existing slice): groups of 16 in the same rack
  kueue.x-k8s.io/podset-slice-required-topology: "cloud.provider.com/topology-rack"
  kueue.x-k8s.io/podset-slice-size: "16"
  # Layer 2 (new): groups of 4 on the same NVSwitch domain
  kueue.x-k8s.io/podset-slice-required-topology-1: "cloud.provider.com/topology-nvdomain"
  kueue.x-k8s.io/podset-slice-size-1: "4"
  # Layer 3 (new): groups of 2 on the same host (e.g., NVL36)
  kueue.x-k8s.io/podset-slice-required-topology-2: "kubernetes.io/hostname"
  kueue.x-k8s.io/podset-slice-size-2: "2"
```

Result: 64 pods → 4 racks of 16 → 4 NVSwitch domains of 4 per rack → 2 hosts of 2 per domain.

The concrete number of `N` can be discussed. We may start with a conservative cap of **2 additional layers** (3 slice layers total):

```diff
diff --git a/apis/kueue/v1beta2/workload_types.go b/apis/kueue/v1beta2/workload_types.go
index ce630751e..79a0d56b8 100644
--- a/apis/kueue/v1beta2/workload_types.go
+++ b/apis/kueue/v1beta2/workload_types.go
@@ -223,6 +223,33 @@ type PodSetTopologyRequest struct {
 	//
 	// +optional
 	PodSetSliceSize *int32 `json:"podSetSliceSize,omitempty"`
+
+	// additionalSliceLayers defines additional layers of recursive slice
+	// subdivision beyond the first slice layer (podSetSliceRequiredTopology /
+	// podSetSliceSize). Each layer further subdivides the parent layer's
+	// groups into smaller groups constrained to a finer topology domain.
+	// At most 2 additional layers are supported (for a total of 3 slice layers).
+	//
+	// +optional
+	// +listType=atomic
+	// +kubebuilder:validation:MaxItems=2
+	AdditionalSliceLayers []SliceLayer `json:"additionalSliceLayers,omitempty"`
+}
+
+// SliceLayer defines a single additional slice subdivision layer.
+type SliceLayer struct {
+	// topology indicates the topology level required for this slice layer.
+	//
+	// +required
+	// +kubebuilder:validation:MinLength=1
+	// +kubebuilder:validation:MaxLength=63
+	Topology string `json:"topology"`
+
+	// size indicates the number of pods in each group at this slice layer.
+	//
+	// +required
+	// +kubebuilder:validation:Minimum=1
+	Size int32 `json:"size"`
 }

 type Admission struct {
```

## Why is this needed

### 1. Real data center topologies are deeper than 2 levels

Modern GPU clusters have hierarchies like **data center → block → rack → NVSwitch domain → host**. AI/ML training workloads need different communication granularities at each level: all-reduce within a host, ring-reduce within a rack, hierarchical all-reduce across racks within a block. Being limited to 2 constraint tiers means users can only optimize for 2 of these boundaries, leaving performance on the table.

### 2. Natural extension of existing design

The slice mechanism already supports recursive subdivision conceptually - a slice is just a fixed-size group constrained to a topology domain. Multi-layer slicing is "slices within slices," following the same pattern. The existing TAS algorithm's 2-phase traversal (greedy above slice level, parent-constrained at/below) generalizes naturally to N phases.

### 3. Avoids workarounds

Without this, users must work around the limitation by:

- Manually splitting workloads into smaller Jobs and coordinating placement externally
- Using multiple JobSets with separate topology constraints, losing the "all in one block" guarantee
- Relying on the Kubernetes scheduler's topology spread constraints, which lack TAS's capacity-aware admission

## Completion requirements

This enhancement requires the following artifacts:

- [ ] Design doc
- [ ] API change
- [ ] Docs update

The artifacts should be linked in subsequent comments.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAS: Support Multi-Layer Topology Constraints #9046

What would you like to be added

Current situation and gap

Closest existing primitive

Proposed change

Why is this needed

1. Real data center topologies are deeper than 2 levels

2. Natural extension of existing design

3. Avoids workarounds

Completion requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TAS: Support Multi-Layer Topology Constraints #9046

Description

What would you like to be added

Current situation and gap

Closest existing primitive

Proposed change

Why is this needed

1. Real data center topologies are deeper than 2 levels

2. Natural extension of existing design

3. Avoids workarounds

Completion requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions