Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new AKS website blog post explaining how to use DRANET + Dynamic Resource Allocation (DRA) for NUMA-aware GPU/NIC alignment (RDMA performance) and includes accompanying control-plane/data-plane diagrams (Mermaid sources + exported SVGs).
Changes:
- New blog post: RDMA/NUMA scheduling problem statement, DRANET architecture, DRA ResourceClaimTemplate examples, and NCCL benchmark walkthrough/results.
- Added control-plane and data-plane diagrams as both
.mmdsources and rendered.svgassets.
Reviewed changes
Copilot reviewed 3 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md | New blog post content + configuration examples + benchmark walkthrough |
| website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/control-plane-diagram.mmd | Mermaid source for control-plane diagram |
| website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/control-plane-diagram.svg | Rendered control-plane diagram used by the post |
| website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/data-plane-diagram.mmd | Mermaid source for data-plane diagram |
| website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/data-plane-diagram.svg | Rendered data-plane diagram used by the post |
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md
Outdated
Show resolved
Hide resolved
|
|
||
| Large-scale AI training and inferencing on Kubernetes depends on high-throughput, low-latency GPU-to-GPU communication. [DRANET](https://github.com/kubernetes-sigs/dranet) is an open-source DRA network driver that discovers RDMA capable devices, exposes their topology as Kubernetes DRA attributes, and injects only desired devices into each container. Combined with the [NVIDIA GPU DRA driver](https://github.com/kubernetes-purgatory/nvidia-dra-driver-gpu), it enables topology-aware co-scheduling of GPUs and NICs to deliver high-performance networking for demanding applications in Kubernetes. | ||
|
|
||
| In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. |
There was a problem hiding this comment.
Grammar: "In previous post" reads like a missing article. Consider changing to "In a previous post, we covered..." (or similar) for correct English.
| In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. | |
| In a previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. |
|
|
||
| Large-scale AI training and inferencing on Kubernetes depends on high-throughput, low-latency GPU-to-GPU communication. [DRANET](https://github.com/kubernetes-sigs/dranet) is an open-source DRA network driver that discovers RDMA capable devices, exposes their topology as Kubernetes DRA attributes, and injects only desired devices into each container. Combined with the [NVIDIA GPU DRA driver](https://github.com/kubernetes-purgatory/nvidia-dra-driver-gpu), it enables topology-aware co-scheduling of GPUs and NICs to deliver high-performance networking for demanding applications in Kubernetes. | ||
|
|
||
| In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. |
There was a problem hiding this comment.
The Microsoft Learn URL uses a locale-specific path (/en-us/). Repo blog guidance recommends using locale-agnostic Learn links (no /en-us/) to avoid unnecessary redirects and keep links consistent.
| In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. | |
| In previous post, we covered [fundamental DRA concepts](/2025/11/17/dra-devices-and-drivers-on-kubernetes). In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [ND GB300-v6](https://learn.microsoft.com/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-uniform memory access) alignment scenarios, and show the benchmark results. |
| | Resource | Count | Detail | | ||
| |---|---|---| | ||
| | GPU | 4x NVIDIA GB300 | 288 GB HBM3E each, NVLink-18 all-to-all | | ||
| | NIC | 4x Mellanox ConnectX | 800 Gb/s InfiniBand each | | ||
| | NUMA nodes | 2 | 2 GPUs + 2 NICs per NUMA node | |
There was a problem hiding this comment.
Several Markdown tables start with || (double leading pipe), which renders as an extra empty first column in CommonMark/Docusaurus tables. Removing the extra leading | (use | ... | ... |) will make the tables render as intended.
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md
Outdated
Show resolved
Hide resolved
website/blog/2026-04-01-dranet-rdma-optimization-for-ai-on-aks/index.md
Outdated
Show resolved
Hide resolved
| - name: nic | ||
| exactly: | ||
| deviceClassName: dranet.net | ||
| count: 1 | ||
| selectors: | ||
| - cel: | ||
| expression: >- | ||
| device.attributes["dra.net"]["rdmaDevice"] == "mlx5_2" | ||
| ``` |
There was a problem hiding this comment.
Same inconsistency here as in the earlier templates: deviceClassName: dranet.net should match the driver/DeviceClass identifier used elsewhere in the post (e.g., the dra.net driver/attribute namespace shown above).
9842ea5 to
ff109d6
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| dra.net/numaNode: | ||
| int: 0 | ||
| dra.net/pciAddress: | ||
| string: "0101:00:00.0" | ||
| dra.net/rdma: | ||
| bool: true | ||
| dra.net/rdmaDevice: | ||
| string: mlx5_0 | ||
| dra.net/pciVendor: | ||
| string: Mellanox Technologies |
There was a problem hiding this comment.
The ResourceSlice example defines attribute keys like dra.net/numaNode / dra.net/rdmaDevice with typed values (int, string, bool), but the CEL selectors later access them as device.attributes["dra.net"]["numaNode"] / ... ["rdmaDevice"] and compare directly to primitives. Please make the attribute schema and the selector syntax consistent (either update the ResourceSlice example to match the selector structure, or update selectors to reference the exact attribute keys/types shown in the ResourceSlice example).
| dra.net/numaNode: | |
| int: 0 | |
| dra.net/pciAddress: | |
| string: "0101:00:00.0" | |
| dra.net/rdma: | |
| bool: true | |
| dra.net/rdmaDevice: | |
| string: mlx5_0 | |
| dra.net/pciVendor: | |
| string: Mellanox Technologies | |
| dra.net: | |
| numaNode: | |
| int: 0 | |
| pciAddress: | |
| string: "0101:00:00.0" | |
| rdma: | |
| bool: true | |
| rdmaDevice: | |
| string: mlx5_0 | |
| pciVendor: | |
| string: Mellanox Technologies |
| expression: >- | ||
| device.attributes["dra.net"]["rdmaDevice"] == "mlx5_0" | ||
| ``` | ||
|
|
There was a problem hiding this comment.
The CEL selector examples appear to assume attributes are nested under device.attributes["dra.net"] and directly comparable (e.g., == 0 / == true). If the published attributes follow the dra.net/<key>: {int|bool|string: ...} pattern shown earlier, these selectors won’t match as written. Please update the selector examples to the correct attribute access pattern so readers can copy/paste them successfully.
| | Resource | Count | Detail | | ||
| |---|---|---| | ||
| | GPU | 4x NVIDIA GB300 | 288 GB HBM3E each, NVLink-18 all-to-all | | ||
| | NIC | 4x Mellanox ConnectX | 800 Gb/s InfiniBand each | | ||
| | NUMA nodes | 2 | 2 GPUs + 2 NICs per NUMA node | |
There was a problem hiding this comment.
The hardware-topology table is written with a leading || on each row (for example || Resource | Count | Detail |), which renders as an extra empty first column in Markdown. Convert these rows to standard table syntax (single leading |, or | | only when you intentionally need a blank header cell) so the table renders as intended.
| dra.net/numaNode: | ||
| int: 0 | ||
| dra.net/pciAddress: | ||
| string: "0101:00:00.0" | ||
| dra.net/rdma: | ||
| bool: true | ||
| dra.net/rdmaDevice: | ||
| string: mlx5_0 | ||
| dra.net/pciVendor: | ||
| string: Mellanox Technologies |
There was a problem hiding this comment.
In the ResourceSlice example, attributes are shown with flat keys like dra.net/numaNode and dra.net/pciAddress, but later the CEL selectors access attributes as a nested map (device.attributes["dra.net"]["..."]). These formats are inconsistent; update the examples so the published attributes and selector expressions use the same schema.
| dra.net/numaNode: | |
| int: 0 | |
| dra.net/pciAddress: | |
| string: "0101:00:00.0" | |
| dra.net/rdma: | |
| bool: true | |
| dra.net/rdmaDevice: | |
| string: mlx5_0 | |
| dra.net/pciVendor: | |
| string: Mellanox Technologies | |
| dra.net: | |
| numaNode: | |
| int: 0 | |
| pciAddress: | |
| string: "0101:00:00.0" | |
| rdma: | |
| bool: true | |
| rdmaDevice: | |
| string: mlx5_0 | |
| pciVendor: | |
| string: Mellanox Technologies |
|
|
||
| ## ResourceClaimTemplates for topology-aware allocation | ||
|
|
||
| With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` and `dranet.net` DeviceClasses, filtered by attributes like NUMA node or PCI address. We define three templates to demonstrate different NUMA placement strategies. |
There was a problem hiding this comment.
This section says the NIC devices come from the dranet.net DeviceClass, but elsewhere in the post the NIC driver/namespace is dra.net (for example driver: dra.net and dra.net/* attributes). Please clarify the intended identifiers (DeviceClass vs driver name) and make the examples consistent so readers can copy/paste them reliably.
| With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` and `dranet.net` DeviceClasses, filtered by attributes like NUMA node or PCI address. We define three templates to demonstrate different NUMA placement strategies. | |
| With both drivers publishing ResourceSlices, we can write ResourceClaimTemplates that use CEL selectors to express precise GPU-NIC co-location constraints. Each template creates a per-pod ResourceClaim that requests devices from both the `gpu.nvidia.com` DeviceClass and the DRANET NIC DeviceClass, `dranet.net`, filtered by attributes like NUMA node or PCI address. In these examples, `dranet.net` is the DeviceClass name, while `dra.net` is the DRANET driver name and attribute namespace used in the published device attributes and CEL selectors. We define three templates to demonstrate different NUMA placement strategies. |
| For a deeper walkthrough of DRA concepts and a hands-on tutorial with the NVIDIA GPU DRA driver, see our previous post on [DRA devices and drivers on Kubernetes](/2025/11/17/dra-devices-and-drivers-on-kubernetes). | ||
| ::: | ||
|
|
||
| In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [Azure ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-Uniform Memory Access) alignment scenarios, show and compare the RDMA benchmark results. |
There was a problem hiding this comment.
The Microsoft Learn URL uses a locale-specific path (/en-us/). Repo guidance prefers locale-agnostic Learn links (for example, https://learn.microsoft.com/azure/...) to avoid unnecessary redirects and keep links consistent.
| In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [Azure ND GB300-v6](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-Uniform Memory Access) alignment scenarios, show and compare the RDMA benchmark results. | |
| In this post, we walk through how DRANET works on [AKS 1.34](https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/) with [Azure ND GB300-v6](https://learn.microsoft.com/azure/virtual-machines/sizes/gpu-accelerated/nd-gb300-v6-series?tabs=sizebasic) nodes, demonstrate three NUMA (Non-Uniform Memory Access) alignment scenarios, show and compare the RDMA benchmark results. |
| On an Azure ND GB300-v6 node, there are four NVIDIA GB300 GPUs and four [Nvidia ConnectX-5](https://www.nvidia.com/en-sg/networking/ethernet/connectx-5/) NICs with [InfiniBand](https://www.nvidia.com/en-us/networking/products/infiniband/) spread across two NUMA domains. The hardware topology looks like this: | ||
|
|
||
| | Resource | Count | Detail | | ||
| |---|---|---| | ||
| | GPU | 4x NVIDIA GB300 | 288 GB HBM3E each | | ||
| | NIC | 4x Mellanox ConnectX | 100 GB/s InfiniBand each | |
There was a problem hiding this comment.
Branding/capitalization is inconsistent: the paragraph uses "Nvidia ConnectX-5" but later the table uses "Mellanox ConnectX". Consider standardizing to the current vendor name (for example, "NVIDIA ConnectX-5" / "NVIDIA (Mellanox) ConnectX-5") for clarity and consistency.
| On an Azure ND GB300-v6 node, there are four NVIDIA GB300 GPUs and four [Nvidia ConnectX-5](https://www.nvidia.com/en-sg/networking/ethernet/connectx-5/) NICs with [InfiniBand](https://www.nvidia.com/en-us/networking/products/infiniband/) spread across two NUMA domains. The hardware topology looks like this: | |
| | Resource | Count | Detail | | |
| |---|---|---| | |
| | GPU | 4x NVIDIA GB300 | 288 GB HBM3E each | | |
| | NIC | 4x Mellanox ConnectX | 100 GB/s InfiniBand each | | |
| On an Azure ND GB300-v6 node, there are four NVIDIA GB300 GPUs and four [NVIDIA ConnectX-5](https://www.nvidia.com/en-sg/networking/ethernet/connectx-5/) NICs with [InfiniBand](https://www.nvidia.com/en-us/networking/products/infiniband/) spread across two NUMA domains. The hardware topology looks like this: | |
| | Resource | Count | Detail | | |
| |---|---|---| | |
| | GPU | 4x NVIDIA GB300 | 288 GB HBM3E each | | |
| | NIC | 4x NVIDIA ConnectX-5 | 100 GB/s InfiniBand each | |
|
|
||
| GPUs 0-1 and NICs 0-1 share NUMA node 0. GPUs 2-3 and NICs 2-3 share NUMA node 1. A **NODE** relationship means the GPU and NIC share a direct PCIe root complex, enabling GPU-Direct RDMA (GDR). A **SYS** relationship means data must cross the QPI/UPI interconnect between NUMA domains, disabling GDR and adding latency. | ||
|
|
||
| Without topology-aware scheduling, Kubernetes has no way to co-locate a GPU and its NUMA-local NICs in the same ResourceClaim. Scheduling a workload onto a GPU with a wrong NIC on a different NUMA node, can silently result in slower data paths and degrade RDMA performance. |
There was a problem hiding this comment.
This sentence has a double space ("wrong NIC") and awkward phrasing/punctuation (", can"). Consider revising to "the wrong NIC" and removing the extra space/comma to improve readability.
| Without topology-aware scheduling, Kubernetes has no way to co-locate a GPU and its NUMA-local NICs in the same ResourceClaim. Scheduling a workload onto a GPU with a wrong NIC on a different NUMA node, can silently result in slower data paths and degrade RDMA performance. | |
| Without topology-aware scheduling, Kubernetes has no way to co-locate a GPU and its NUMA-local NICs in the same ResourceClaim. Scheduling a workload onto a GPU with the wrong NIC on a different NUMA node can silently result in slower data paths and degrade RDMA performance. |
|
|
||
| Without topology-aware scheduling, Kubernetes has no way to co-locate a GPU and its NUMA-local NICs in the same ResourceClaim. Scheduling a workload onto a GPU with a wrong NIC on a different NUMA node, can silently result in slower data paths and degrade RDMA performance. | ||
|
|
||
| ## How does DRANET work |
There was a problem hiding this comment.
Section heading reads like a question but is missing the question mark. Either add "?" or change the heading to a statement like "How DRANET works".
| ## How does DRANET work | |
| ## How does DRANET work? |
No description provided.