Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 8 additions & 37 deletions website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,54 +6,25 @@ authors:
- datta-rajpure
tags: ["aks-arc", "ai", "ai-inference"]
---
For many edge and on-premises environments, sending data to the cloud for AI inferencing isn't an option, as latency, data residency, and compliance make it a non-starter. With Azure Kubernetes Service (AKS) enabled by Azure Arc managing your Kubernetes clusters, you can run AI inferencing locally on the hardware you already have. This blog series shows you how, with hands-on tutorials covering deployment of generative and predictive AI workloads using CPUs, GPUs, and NPUs.
For many edge and on-premises environments, sending data to the cloud for AI inferencing isn't an option, as latency, data residency, and compliance make it a non-starter. With Azure Kubernetes Service (AKS) enabled by Azure Arc managing your Kubernetes clusters, you can run AI inferencing locally on the hardware you already have. This blog series shows you how, with hands-on tutorials for **experimenting** with generative and predictive AI workloads using CPUs, GPUs, and NPUs.

<!-- truncate -->

![AI Inference on AKS enabled by Azure Arc: Running AI Inference at the Edge and On‑Prem](./hero-image.png)

## Introduction

Whether you are processing sensor data on a factory floor, analyzing medical images in a hospital, or running in store retail analytics, your AI models need to run where the data lives. AKS enabled by Azure Arc extends Azure’s management capabilities to distributed Kubernetes environments so you can deploy and operate AI workloads across data centers, branch offices, and edge locations. In this series, you learn how to run and validate generative and predictive AI inference using the hardware and infrastructure you already have.
Whether you are processing sensor data on a factory floor, analyzing medical images in a hospital, or running in store retail analytics, your AI models need to run where the data lives. With AKS enabled by Azure Arc, you can extend Azure’s Kubernetes management to on‑prem and edge infrastructure and run AI inference without changing how you operate your clusters. You use the same Kubernetes APIs, deployment patterns, and lifecycle workflows across cloud and non‑cloud environments, while keeping inference close to where data is generated. This allows you to operate AI workloads consistently across highly distributed environments.

## Why AI inferencing on AKS enabled by Azure Arc matters

Running AI inference on AKS enabled by Azure Arc addresses several urgent customer needs and industry trends:
- **Low latency and data residency:** Inference runs locally, meeting real-time and compliance requirements for factory automation, medical imaging, and retail analytics.
- **Existing hardware utilization:** Use your current infrastructure with flexibility to add GPUs or accelerators later.
- **Hybrid and disconnected operations:** Manage workloads centrally from Azure while local execution continues during network outages.
- **Industry alignment:** Support the shift toward edge AI driven by data gravity, regulatory compliance, and real-time requirements.

- **Low latency and data residency –**
Inference workloads can run locally on-premises or at the edge, ensuring real-time responsiveness and compliance with data sovereignty requirements. This is essential for scenarios like factory automation, medical imaging, or retail analytics, where data must remain on-site and latency is a key constraint.

- **Existing hardware utilization –**
This lets you use existing infrastructure while keeping the flexibility to scale with GPUs or other accelerators later.

- **Hybrid and disconnected operations –**
AKS enabled by Azure Arc provides a consistent deployment and governance experience across connected and disconnected environments. Customers can centrally manage AI workloads from Azure while ensuring local execution continues even during network outages.

- **Aligned with industry trends –**
The shift toward hybrid and edge AI is driven by trends like data gravity, regulatory compliance, and the need for real-time insights. AKS enabled by Azure Arc aligns with these trends by enabling scalable, secure, and flexible AI deployments across industries such as manufacturing, healthcare, retail, and logistics.

## A platform for distributed AI operations

AKS enabled by Azure Arc enables you to bring your own AI runtimes and models to Kubernetes clusters running in hybrid environments. It provides:

- A consistent DevOps experience for deploying and managing AI models across environments
- Centralized governance, monitoring, and security via Azure
- Integration with Azure ML and Microsoft Foundry for model lifecycle management
- Support for diverse hardware configurations, including CPUs, GPUs, and NPUs

By managing Kubernetes clusters across hybrid and edge environments, AKS enabled by Azure Arc helps you operationalize AI workloads using the tools and runtimes that best fit your infrastructure and use cases.

## Explore AI inference with step-by-step tutorials

To help you explore and validate AI inference on AKS enabled by Azure Arc, we’ve created a series of scenario-driven tutorials that show how to run both generative and predictive workloads in hybrid and edge environments. The series walks through concrete examples step by step, using open-source tools and real models to demonstrate hybrid AI capabilities in action. Each tutorial highlights a different inference pattern and technology stack, reflecting the diverse options available for edge inferencing:

- Deploy open-source large language models (LLMs) using GPU-accelerated inference engines
- Serve predictive models like ResNet-50 using a unified model server
- Configure and validate inference workloads across different hardware types
- Manage and monitor inference services using Azure-native tools

These tutorials help you build confidence running AI at the edge using your existing Kubernetes skills and AKS enabled by Azure Arc infrastructure. The examples rely on off the shelf assets such as open source models and containers to highlight an open and flexible approach. You can bring your own models and select the inference engine best suited to the task whether that is a lightweight CPU friendly runtime or a vendor optimized GPU server.
Together, these capabilities make AKS enabled by Azure Arc a strong foundation for AI inference across edge and on‑prem environments, enabling you to choose and operate inference engines and models directly on your Kubernetes clusters, including bring‑your‑own models, while integrating with Microsoft’s broader AI stack such as [Microsoft Foundry](https://learn.microsoft.com/azure/foundry/what-is-foundry), [Microsoft Foundry Local](https://learn.microsoft.com/azure/foundry-local/what-is-foundry-local), and [KAITO](https://learn.microsoft.com/azure/aks/ai-toolchain-operator) where appropriate.

## Get started

To get started, follow the tutorial series: [AI Inference on AKS Arc: Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2). By the end, you'll have hands-on experience running AI models across hybrid cloud and edge environments on Azure Arc.
This series walks you through experimenting with generative and predictive AI workloads step by step, using open-source tools and real models on your AKS enabled by Azure Arc clusters. For the full list of topics, prerequisites, and hands-on tutorials, head to the [Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
93 changes: 21 additions & 72 deletions website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,103 +6,52 @@ authors:
- datta-rajpure
tags: ["aks-arc", "ai", "ai-inference"]
---
This series gives you **practical, step-by-step guidance** for running generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on **repeatable, production-ready validation** rather than abstract examples.
This series gives you **practical, step-by-step guidance** for experimentation with generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on **repeatable, hands-on experimentation** rather than abstract examples.

<!-- truncate -->

![AI inference on AKS enabled by Azure Arc series overview showing generative and predictive AI patterns across hybrid environments](./hero-image.png)

## Introduction

This series explores emerging patterns for running generative and predictive AI inference workloads on AKS enabled by Azure Arc clusters in on-premises and edge environments. If you're looking to deploy AI closer to where your data is generated on factory floors, in retail stores, across manufacturing lines, and within infrastructure monitoring systems—they face unique challenges: limited connectivity, diverse hardware, and constrained resources.
High-end GPUs may not always be available or practical in these environments due to cost, power, or space limitations. That's why you may be exploring how to leverage your existing infrastructure—such as CPU-based clusters—or exploring new accelerators like NPUs to enable scalable, low-latency inference at the edge.
The series focuses on scenario-driven experimentation with AI inference on AKS enabled by Azure Arc, validating real-world deployments that go beyond traditional cloud-centric patterns. From deploying open-source LLM servers like **Ollama** and **vLLM** to integrating **NVIDIA Triton** with custom backends, each entry provides a structured approach to evaluating feasibility, performance, and operational readiness. Our goal is to equip you with practical insights and repeatable strategies for enabling AI inference in hybrid and edge-native environments.
[Part 1](/2026/04/07/ai-inference-on-aks-arc-part-1) covered why running AI inference at the edge matters. This post defines the series scope, ground rules, and shared prerequisites so each tutorial can focus on the hands-on walkthrough.

## Audience and assumptions
## Scope and expectations

This series assumes:

- You are already familiar with Kubernetes concepts such as pods, deployments, services, and node scheduling.
- You are operating, or plan to operate, AKS enabled by Azure Arc on Azure Local or a comparable on‑premises / edge environment.
- You are comfortable using command‑line tools such as kubectl, Azure CLI, and Helm.
- You are evaluating AI inference workloads (LLMs or predictive models) from an infrastructure and platform perspective, not from a data science or model‑training perspective.

### Explicit Non‑Goals

To keep this series focused and actionable, the following topics are intentionally **not** covered:

- **Kubernetes fundamentals or onboarding:**
Readers new to Kubernetes should complete foundational material first:
- [Introduction to Kubernetes (Microsoft Learn)](https://learn.microsoft.com/training/modules/intro-to-kubernetes/)
- [Kubernetes Basics Tutorial (Upstream)](https://kubernetes.io/docs/tutorials/kubernetes-basics/)

- **AKS enabled by Azure Arc conceptual overview or onboarding:**
This series assumes you already understand what Azure Arc provides and how AKS enabled by Azure Arc works:
- [AKS enabled by Azure Arc Kubernetes overview](https://learn.microsoft.com/azure/azure-arc/kubernetes/overview)
- [AKS enabled by Azure Arc documentation](https://learn.microsoft.com/azure/aks/aksarc/)

- **Model training, fine‑tuning, or data preparation:**
All scenarios assume models are already trained and packaged in formats supported by the selected inference engine.

- **Deep internals of inference engines:**
Engine-specific internals are referenced only where required for deployment or configuration. For deeper learning:
- [NVIDIA Triton Inference Server documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/)
- [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)

If you’re looking for conceptual comparisons, performance benchmarks, or model‑level optimizations, those topics are intentionally out of scope for this series.

## Series ground rules (What this series guarantees)

Here are the set of baseline guarantees and assumptions that apply to all subsequent parts of the series:

- All scenarios use the same AKS enabled by Azure Arc environment unless explicitly noted otherwise.
- AKS enabled by Azure Arc is used as the management and control plane only; inference execution always occurs locally on the cluster.
- No managed Azure AI services are used to execute inference.
- Each scenario follows a consistent, repeatable structure so results can be compared across inference engines and hardware types.
:::warning
These tutorials are designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices.
:::

### Standard workflow
This series assumes familiarity with Kubernetes fundamentals, proficiency with kubectl, Azure CLI, and Helm, and experience using AKS enabled by Azure Arc on Azure Local. The focus is not on model development or training.

Each scenario follows the same high-level workflow:
All scenarios use the same AKS enabled by Azure Arc environment and follow a consistent structure. Inference execution always occurs locally on the cluster. No managed Azure AI services are used. Each scenario follows the same steps: **connect and verify** cluster access, **prepare the accelerator** if required, **deploy the inference workload**, **validate inference** with a test request, and **clean up resources**.

- **Connect and verify:**
Log in to Azure and get cluster credentials. Inspect available compute resources (CPU, GPU, NPU) and node labels/capabilities
For reference:

- **Prepare the accelerator (If Required):**
Install or validate the required accelerator enablement based on the scenario.
- GPU: NVIDIA GPU Operator
- NPU: Vendor‑specific enablement (future)
- CPU: No accelerator setup required
- **Deploy the inference Workload:**
- Deploy the model server or inference pipeline (LLM server, Triton, or other engine)
- Configure runtime parameters appropriate to the selected hardware
- **Validate inference:**
- Send a test request (prompt, image, or payload)
- Confirm functional and expected inference output
- **Cleanup resources:**
- Remove deployed workloads
- Release cluster resources (compute, storage, accelerator allocations)
- [Kubernetes fundamentals](https://learn.microsoft.com/training/modules/intro-to-kubernetes/)
- [AKS enabled by Azure Arc overview](https://learn.microsoft.com/azure/azure-arc/kubernetes/overview)
- [AKS enabled by Azure Arc documentation](https://learn.microsoft.com/azure/aks/aksarc/)
- [GPU Operator internals](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)

## Series outline

In this series, we walk you through a range of AI inference patterns on AKS enabled by Azure Arc clusters, spanning both generative and predictive AI workloads. The series is designed to evolve over time, and we'll continue adding topics as we validate new scenarios, runtimes, and architectures.
The series is designed to evolve. New topics will be added as additional scenarios, runtimes, and architectures are explored.

### Topics covered in this series

| Topic | Type | Status |
| ----- | ---- | ------ |
| [**Ollama** — open-source LLM server](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available |
| [**vLLM** — high-throughput LLM engine](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available |
| [**Triton + ONNX** — ResNet‑50 image classification](/2026/04/07/ai-inference-on-aks-arc-part-4) | Predictive | ✅ Available |
| **Triton + TensorRT‑LLM** — optimized large-model inference | Generative | 🔜 Coming soon |
| **Triton + vLLM backend** — vision-language model serving | Generative | 🔜 Coming soon |

This series will continue to grow as we introduce new inference engines, hardware configurations, and real‑world deployment patterns across edge, on‑premises, and hybrid environments.
| [**Ollama**: open-source LLM server](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available |
| [**vLLM**: high-throughput LLM engine](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available |
| [**Triton + ONNX**: ResNet‑50 image classification](/2026/04/07/ai-inference-on-aks-arc-part-4) | Predictive | ✅ Available |
| [**Triton + TensorRT‑LLM**: optimized large-model inference](/2026/04/09/ai-inference-on-aks-arc-part-5) | Generative | ✅ Available |
| **Triton + vLLM backend**: vision-language model serving | Generative | 🔜 Coming soon |

## Prerequisites

All scenarios in this series run on a common AKS enabled by Azure Arc clusters environment. Before you begin, make sure you have the following in place:
All scenarios in this series run on an AKS enabled by Azure Arc cluster deployed on Azure Local. Before you begin, make sure you have the following in place:

- **AKS enabled by Azure Arc clusters with a GPU node:** A Azure Local clusters with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources.
- **AKS enabled by Azure Arc clusters with a GPU node:** An Azure Local cluster with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources.

- **Azure CLI with Azure Arc extensions:** The [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) installed on your admin machine and `connectedk8s` extensions (for Azure Arc-enabled Kubernetes). Use `az extension list -o table` to confirm these are installed.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading