Skip to content

kubernetes-sigs/dra-driver-cpu

dra-driver-cpu

Kubernetes Device Resource Assignment (DRA) driver for CPU resources.

This repository implements a DRA driver that enables Kubernetes clusters to manage and assign CPU resources to workloads using the DRA framework.

Configuration

The driver can be configured with the following command-line flags:

  • --cpu-device-mode: Sets the mode for exposing CPU devices.
    • "individual": Exposes each allocatable CPU as a separate device in the ResourceSlice. This mode provides fine-grained control as it exposes granular information specific to each CPU as device attributes in the ResourceSlice.
    • "grouped" (default): Exposes a single device representing a group of CPUs. This mode treats CPUs as a consumable capacity within the group, improving scalability by reducing the number of API objects.
  • --cpu-device-group-by: When --cpu-device-mode is set to "grouped", this flag determines the grouping strategy.
    • "numanode" (default): Groups CPUs by NUMA node.
    • "socket": Groups CPUs by socket.
  • --reserved-cpus: Specifies a set of CPUs to be reserved for system and kubelet processes. These CPUs will not be allocatable by the DRA driver and would be excluded from the ResourceSlice. The value is a cpuset, e.g., 0-1. This semantic is the same as the one the kubelet applies with its static CPU Manager policy and enabling strict-cpu-reservation flag and specifying the CPUs with the reservedSystemCPUs to be reserved for system daemons. For correct CPU accounting, the number of CPUs reserved with this flag should match the sum of the kubelet's kubeReserved and systemReserved settings. This ensures the kubelet subtracts the correct number of CPUs from Node.Status.Allocatable.

How it Works

The driver is deployed as a DaemonSet which contains two core components:

  • DRA driver: This component is the main control loop and handles the interaction with the Kubernetes API server for Dynamic Resource Allocation.

    • Topology Discovery: It discovers the node's CPU topology, including details like sockets, NUMA nodes, cores, SMT siblings, Last-Level Cache (LLC), and core types (e.g., Performance-cores, Efficiency-cores). This is done by parsing /proc/cpuinfo and reading sysfs files.
    • ResourceSlice Publication: Based on the --cpu-device-mode flag, it publishes ResourceSlice objects to the API server:
      • In individual mode, each allocatable CPU becomes a device in the ResourceSlice, with attributes detailing its topology.
      • In grouped mode, devices represent larger CPU aggregates (like NUMA nodes or sockets). These devices support consumable capacity, indicating the number of available CPUs within that group.
    • Claim Allocation: When a ResourceClaim is assigned to the node, the DRA driver handles the allocation:
      • In individual mode, the scheduler has already selected specific CPU devices. The driver enforces this selection through CDI and NRI.
      • In grouped mode, the claim requests a quantity of CPUs from the group device. The driver then uses topology-aware allocation logic (imported from Kubelet's CPU Manager) to select the physical CPUs within the group. Strict compatibility with kubelet's cpumanager or CPU allocation is not a goal of this driver. This decision will be reviewed in the future releases.
    • CDI Spec Generation: Upon successful allocation, the driver generates a CDI (Container Device Interface) specification.
  • CDI (Container Device Interface): The driver uses CDI to communicate the allocated CPU set to the container runtime.

    • A CDI JSON spec file is created or updated for the allocated claim.
    • This spec instructs the runtime to inject an environment variable (e.g., DRA_CPUSET_<claimUID>=<cpuset>) into the container.
    • The driver includes mechanisms for thread-safe and atomic updates to the CDI spec files.
  • NRI Plugin: This component integrates with the container runtime via the Node Resource Interface (NRI).

    • For containers with guaranteed CPUs (those with a DRA ResourceClaim), the plugin reads the environment variable injected via CDI and pins the container to its exclusive CPU set using the cgroup cpuset controller.
    • For all other containers, it confines them to a shared pool of CPUs, which consists of all allocatable CPUs not exclusively assigned to any guaranteed container.
    • It dynamically updates the shared pool cpuset for all shared containers whenever guaranteed allocations change (containers are created or removed).
    • On restart, the NRI plugin can synchronize its state by inspecting existing containers and their environment variables to rebuild the current CPU allocations.

Feature Support

Currently Supported

  • Exclusive CPU Allocation: Pods that request CPUs via a ResourceClaim are allocated exclusive CPUs based on the chosen mode and topology.
  • Shared CPU Pool Management: All other containers without a ResourceClaim are confined to a shared pool of CPUs that are not reserved.
  • Topology Awareness: The driver discovers detailed CPU topology including sockets, NUMA nodes, cores, SMT siblings, L3 cache (UncoreCache), and core types (Performance/Efficiency).
  • Advanced CPU Allocation Strategies: When in "grouped" mode, the driver utilizes allocation logic adapted from the Kubelet's CPU Manager, including:
    • NUMA aware best-fit allocation.
    • Packing or spreading CPUs across cores.
    • Preference for aligning allocations to UncoreCache boundaries.
  • CDI Integration: Manages CDI spec files to inject environment variables containing the allocated cpuset into the container.
  • State Synchronization: On restart, the driver synchronizes with all existing pods on the node to rebuild its state of CPU allocations from environment variables injected by CDI.
  • Multiple Device Exposure Modes:
    • Individual Mode: Each CPU is a device, allowing for selection based on attributes like CPU ID, core type, NUMA node, etc. This mode is ideal for workloads requiring fine-grained control over CPU placement, common in HPC or performance-critical applications.
    • Grouped Mode: CPUs are grouped (e.g., by NUMA node or socket) and treated as a consumable capacity within that group. This helps in reducing the number of devices exposed to the API server, especially on systems with a large number of CPUs, thus improving scalability. This mode is suitable for workloads needing alignment with other DRA resources within the same group (e.g., NUMA node) or where the exact CPU IDs are less critical than the quantity.

Not Supported

  • This driver currently only manages CPU resources. Memory allocation and management are not supported.
  • While the driver is topology-aware, the grouped mode currently abstracts some of the fine-grained details within the group. Future enhancements may explore combining consumable capacity with partitionable devices for more hierarchical control.

Sharing resource claims

This driver strictly enforces a 1-to-1 mapping between Claims and Containers. It does not support sharing a single ResourceClaim among multiple containers or multiple pods, if that claims includes a resource (dra.cpu) managed by this driver. Attempting to share a claim among containers or pods will make all but the first pod consuming the claim to fail to start with the error CreateContainerError and remain in Pending state.

The rationale to disallow sharing is that sharing claim confuses resource accounting, which is currently fragile because the lack of integration between the classic resource accounting and DRA-managed core resources.

This gap is meant to be addressed by KEP-5517 (Native Resource Management). However, until that KEP progresses and gets traction, the safest approach for this driver is to prevent any resource claim sharing.

Matching CPU Manager functionality

The kubelet cpumanager support options to fine-tune the CPU allocation behavior. This DRA driver aims to implement feature parity with the kubelet cpumanager. The following table summarize how you can achieve a cpumananger functionality controlled by a cpumanager policy option. Reference: kubernetes 1.35.0.

CPU Manager Option Maturity Kubelet development status Driver equivalent functionality notes
AlignBySocket alpha inactive --grouped-mode driver option
DistributeCPUsAcrossCores alpha inactive none yet; postponed till k8s feature graduates to beta
DistributeCPUsAcrossNUMA beta active see issue: #46 see below for details
PreferAlignByUnCoreCache beta active builtin; enabled by default
FullPCPUsOnly GA N/A see issue: #45
StrictCPUReservation GA N/A builtin; enabled by default

Distributing CPUs across NUMA nodes

It is currently possible to do encode a split of CPUs in such a way the allocator picks them from different NUMA nodes. Example:

apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
  name: claim-cpu-capacity-20
spec:
  devices:
    requests:
    - name: numa0-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID == 0
    - name: numa1-cpus
      exactly:
        deviceClassName: dra.cpu
        capacity:
          requests:
            dra.cpu/cpu: "10"
        selectors:
        - cel:
            expression: device.attributes["dra.cpu"].numaNodeID ==1

However, this is only a partial replacement of the corresponding CPU Manager option. The main problem of this approach is that it leaks assumptions about machine properties. We hardcode the NUMA split and, unlike the cpumanager feature, it won't automatically adapt if the same claim is handled by a 1-NUMA, 2-NUMA or 4-NUMA machine; the claim would need to be updated or recreated manually.

Getting Started

Installation

  • If needed, create a kind cluster. We have one in the repo, if needed, that can be deplayed as follows:
    • make kind-cluster
  • Deploy the driver and all necessary RBAC configurations using the provided manifest
    • kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/dra-driver-cpu/refs/heads/main/install.yaml

Example Usage

The driver supports two modes of operation. Each mode has a complete example manifest that includes both the ResourceClaim(s) and a sample Pod. The ResourceClaim requests a specific number of exclusive CPUs from the driver, and is referenced in the Pod spec to receive the allocated CPUs.

Grouped Mode (default)

In grouped mode, CPUs are requested as a consumable capacity from a device group (e.g., NUMA node or socket). This example requests 10 CPUs.

  • kubectl apply -f hack/examples/pod_with_resource_claim_grouped_mode.yaml

Individual Mode

In individual mode, specific CPU devices are requested by count, allowing for fine-grained control over CPU selection. This example includes two ResourceClaims requesting 4 and 6 CPUs respectively, used by a Pod with multiple containers.

  • kubectl apply -f hack/examples/pod_with_resource_claim_individual_mode.yaml

Example ResourceSlices

Here's how the ResourceSlice objects might look for the different modes:

Individual Mode

Each CPU is listed as a separate device with detailed attributes.

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: dra-driver-cpu-worker-dra.cpu-qskwf
  # ... other metadata
spec:
  driver: dra.cpu
  nodeName: dra-driver-cpu-worker
  pool:
    generation: 1
    name: dra-driver-cpu-worker
    resourceSliceCount: 1
  devices:
  - attributes:
      dra.cpu/cacheL3ID:
        int: 0
      dra.cpu/coreID:
        int: 1
      dra.cpu/coreType:
        string: standard
      dra.cpu/cpuID:
        int: 1
      dra.cpu/numaNodeID:
        int: 0
      dra.cpu/socketID:
        int: 0
      dra.net/numaNode:
        int: 0
    name: cpudev0
  - attributes:
      dra.cpu/cacheL3ID:
        int: 0
      dra.cpu/coreID:
        int: 1
      dra.cpu/coreType:
        string: standard
      dra.cpu/cpuID:
        int: 33
      dra.cpu/numaNodeID:
        int: 0
      dra.cpu/socketID:
        int: 0
      dra.net/numaNode:
        int: 0
    name: cpudev1
  # ... other CPU devices

Grouped Mode (e.g., by NUMA node)

CPUs are grouped, and the device entry shows consumable capacity.

apiVersion: resource.k8s.io/v1
kind: ResourceSlice
metadata:
  name: dra-driver-cpu-worker-dra.cpu-tp869
  # ... other metadata
spec:
  driver: dra.cpu
  nodeName: dra-driver-cpu-worker
  pool:
    generation: 1
    name: dra-driver-cpu-worker
    resourceSliceCount: 1
  devices:
  - allowMultipleAllocations: true
    attributes:
      dra.cpu/smtEnabled:
        bool: true
      dra.cpu/numCPUs:
        int: 64
      dra.cpu/numaNodeID:
        int: 0
      dra.cpu/socketID:
        int: 0
      dra.net/numaNode:
        int: 0
    capacity:
      dra.cpu/cpu:
        value: "64"
    name: cpudevnuma0
  - allowMultipleAllocations: true
    attributes:
      dra.cpu/smtEnabled:
        bool: true
      dra.cpu/numCPUs:
        int: 64
      dra.cpu/numaNodeID:
        int: 1
      dra.cpu/socketID:
        int: 0
      dra.net/numaNode:
        int: 1
    capacity:
      dra.cpu/cpu:
        value: "64"
    name: cpudevnuma1

Running the tests

To run the e2e tests against a custom-setup, special-purpose kind cluster, just run

make test-e2e-kind

Set the environment variable DRACPU_E2E_VERBOSE to 1 for more output:

DRACPU_E2E_VERBOSE=1 make test-e2e-kind

In some cases, you may need to set explicitly the KUBECONFIG path:

KUBECONFIG=${HOME}/.kube/config DRACPU_E2E_VERBOSE=1 make test-e2e-kind

The test-e2e-kind will exercises the same flows which are run on the project CI. The full documentation for all the supported environment variables is found in the tests README.

NOTE the custom-setup kind cluster is not automatically tear down once the tests terminate NOTE if you want to run again the tests, just use make test-e2e. Please see make help for more details.

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page. Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

You can reach the maintainers of this project at:

About

CPU DRA Driver

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6

Languages