diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/hero-image.png b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/hero-image.png index 35e865cf1..150006ba7 100644 Binary files a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/hero-image.png and b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/hero-image.png differ diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/index.md b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/index.md index 655ace552..5aa9a4a4a 100644 --- a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/index.md +++ b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-1/index.md @@ -6,7 +6,7 @@ authors: - datta-rajpure tags: ["aks-arc", "ai", "ai-inference"] --- -For many edge and on-premises environments, sending data to the cloud for AI inferencing isn't an option, as latency, data residency, and compliance make it a non-starter. With Azure Kubernetes Service (AKS) enabled by Azure Arc managing your Kubernetes clusters, you can run AI inferencing locally on the hardware you already have. This blog series shows you how, with hands-on tutorials covering deployment of generative and predictive AI workloads using CPUs, GPUs, and NPUs. +For many edge and on-premises environments, sending data to the cloud for AI inferencing isn't an option, as latency, data residency, and compliance make it a non-starter. With Azure Kubernetes Service (AKS) enabled by Azure Arc managing your Kubernetes clusters, you can run AI inferencing locally on the hardware you already have. This blog series shows you how, with hands-on tutorials for **experimenting** with generative and predictive AI workloads using CPUs, GPUs, and NPUs. @@ -14,46 +14,17 @@ For many edge and on-premises environments, sending data to the cloud for AI inf ## Introduction -Whether you are processing sensor data on a factory floor, analyzing medical images in a hospital, or running in store retail analytics, your AI models need to run where the data lives. AKS enabled by Azure Arc extends Azure’s management capabilities to distributed Kubernetes environments so you can deploy and operate AI workloads across data centers, branch offices, and edge locations. In this series, you learn how to run and validate generative and predictive AI inference using the hardware and infrastructure you already have. +Whether you are processing sensor data on a factory floor, analyzing medical images in a hospital, or running in store retail analytics, your AI models need to run where the data lives. With AKS enabled by Azure Arc, you can extend Azure’s Kubernetes management to on‑prem and edge infrastructure and run AI inference without changing how you operate your clusters. You use the same Kubernetes APIs, deployment patterns, and lifecycle workflows across cloud and non‑cloud environments, while keeping inference close to where data is generated. This allows you to operate AI workloads consistently across highly distributed environments. ## Why AI inferencing on AKS enabled by Azure Arc matters -Running AI inference on AKS enabled by Azure Arc addresses several urgent customer needs and industry trends: +- **Low latency and data residency:** Inference runs locally, meeting real-time and compliance requirements for factory automation, medical imaging, and retail analytics. +- **Existing hardware utilization:** Use your current infrastructure with flexibility to add GPUs or accelerators later. +- **Hybrid and disconnected operations:** Manage workloads centrally from Azure while local execution continues during network outages. +- **Industry alignment:** Support the shift toward edge AI driven by data gravity, regulatory compliance, and real-time requirements. -- **Low latency and data residency –** -Inference workloads can run locally on-premises or at the edge, ensuring real-time responsiveness and compliance with data sovereignty requirements. This is essential for scenarios like factory automation, medical imaging, or retail analytics, where data must remain on-site and latency is a key constraint. - -- **Existing hardware utilization –** -This lets you use existing infrastructure while keeping the flexibility to scale with GPUs or other accelerators later. - -- **Hybrid and disconnected operations –** -AKS enabled by Azure Arc provides a consistent deployment and governance experience across connected and disconnected environments. Customers can centrally manage AI workloads from Azure while ensuring local execution continues even during network outages. - -- **Aligned with industry trends –** -The shift toward hybrid and edge AI is driven by trends like data gravity, regulatory compliance, and the need for real-time insights. AKS enabled by Azure Arc aligns with these trends by enabling scalable, secure, and flexible AI deployments across industries such as manufacturing, healthcare, retail, and logistics. - -## A platform for distributed AI operations - -AKS enabled by Azure Arc enables you to bring your own AI runtimes and models to Kubernetes clusters running in hybrid environments. It provides: - -- A consistent DevOps experience for deploying and managing AI models across environments -- Centralized governance, monitoring, and security via Azure -- Integration with Azure ML and Microsoft Foundry for model lifecycle management -- Support for diverse hardware configurations, including CPUs, GPUs, and NPUs - -By managing Kubernetes clusters across hybrid and edge environments, AKS enabled by Azure Arc helps you operationalize AI workloads using the tools and runtimes that best fit your infrastructure and use cases. - -## Explore AI inference with step-by-step tutorials - -To help you explore and validate AI inference on AKS enabled by Azure Arc, we’ve created a series of scenario-driven tutorials that show how to run both generative and predictive workloads in hybrid and edge environments. The series walks through concrete examples step by step, using open-source tools and real models to demonstrate hybrid AI capabilities in action. Each tutorial highlights a different inference pattern and technology stack, reflecting the diverse options available for edge inferencing: - -- Deploy open-source large language models (LLMs) using GPU-accelerated inference engines -- Serve predictive models like ResNet-50 using a unified model server -- Configure and validate inference workloads across different hardware types -- Manage and monitor inference services using Azure-native tools - -These tutorials help you build confidence running AI at the edge using your existing Kubernetes skills and AKS enabled by Azure Arc infrastructure. The examples rely on off the shelf assets such as open source models and containers to highlight an open and flexible approach. You can bring your own models and select the inference engine best suited to the task whether that is a lightweight CPU friendly runtime or a vendor optimized GPU server. +Together, these capabilities make AKS enabled by Azure Arc a strong foundation for AI inference across edge and on‑prem environments, enabling you to choose and operate inference engines and models directly on your Kubernetes clusters, including bring‑your‑own models, while integrating with Microsoft’s broader AI stack such as [Microsoft Foundry](https://learn.microsoft.com/azure/foundry/what-is-foundry), [Microsoft Foundry Local](https://learn.microsoft.com/azure/foundry-local/what-is-foundry-local), and [KAITO](https://learn.microsoft.com/azure/aks/ai-toolchain-operator) where appropriate. ## Get started -To get started, follow the tutorial series: [AI Inference on AKS Arc: Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2). By the end, you'll have hands-on experience running AI models across hybrid cloud and edge environments on Azure Arc. +This series walks you through experimenting with generative and predictive AI workloads step by step, using open-source tools and real models on your AKS enabled by Azure Arc clusters. For the full list of topics, prerequisites, and hands-on tutorials, head to the [Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2). diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/hero-image.png b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/hero-image.png index ba6339640..55979e584 100644 Binary files a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/hero-image.png and b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/hero-image.png differ diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/index.md b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/index.md index ce3cbbca3..aece19c3a 100644 --- a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/index.md +++ b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-2/index.md @@ -6,7 +6,7 @@ authors: - datta-rajpure tags: ["aks-arc", "ai", "ai-inference"] --- -This series gives you **practical, step-by-step guidance** for running generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on **repeatable, production-ready validation** rather than abstract examples. +This series gives you **practical, step-by-step guidance** for experimentation with generative and predictive AI inference workloads on Azure Kubernetes Service (AKS) enabled by Azure Arc clusters, using CPUs, GPUs, and neural processing units (NPUs). The scenarios target on‑premises and edge environments, specifically Azure Local, with a focus on **repeatable, hands-on experimentation** rather than abstract examples. @@ -14,95 +14,44 @@ This series gives you **practical, step-by-step guidance** for running generativ ## Introduction -This series explores emerging patterns for running generative and predictive AI inference workloads on AKS enabled by Azure Arc clusters in on-premises and edge environments. If you're looking to deploy AI closer to where your data is generated on factory floors, in retail stores, across manufacturing lines, and within infrastructure monitoring systems—they face unique challenges: limited connectivity, diverse hardware, and constrained resources. -High-end GPUs may not always be available or practical in these environments due to cost, power, or space limitations. That's why you may be exploring how to leverage your existing infrastructure—such as CPU-based clusters—or exploring new accelerators like NPUs to enable scalable, low-latency inference at the edge. -The series focuses on scenario-driven experimentation with AI inference on AKS enabled by Azure Arc, validating real-world deployments that go beyond traditional cloud-centric patterns. From deploying open-source LLM servers like **Ollama** and **vLLM** to integrating **NVIDIA Triton** with custom backends, each entry provides a structured approach to evaluating feasibility, performance, and operational readiness. Our goal is to equip you with practical insights and repeatable strategies for enabling AI inference in hybrid and edge-native environments. +[Part 1](/2026/04/07/ai-inference-on-aks-arc-part-1) covered why running AI inference at the edge matters. This post defines the series scope, ground rules, and shared prerequisites so each tutorial can focus on the hands-on walkthrough. -## Audience and assumptions +## Scope and expectations -This series assumes: - -- You are already familiar with Kubernetes concepts such as pods, deployments, services, and node scheduling. -- You are operating, or plan to operate, AKS enabled by Azure Arc on Azure Local or a comparable on‑premises / edge environment. -- You are comfortable using command‑line tools such as kubectl, Azure CLI, and Helm. -- You are evaluating AI inference workloads (LLMs or predictive models) from an infrastructure and platform perspective, not from a data science or model‑training perspective. - -### Explicit Non‑Goals - -To keep this series focused and actionable, the following topics are intentionally **not** covered: - -- **Kubernetes fundamentals or onboarding:** - Readers new to Kubernetes should complete foundational material first: - - [Introduction to Kubernetes (Microsoft Learn)](https://learn.microsoft.com/training/modules/intro-to-kubernetes/) - - [Kubernetes Basics Tutorial (Upstream)](https://kubernetes.io/docs/tutorials/kubernetes-basics/) - -- **AKS enabled by Azure Arc conceptual overview or onboarding:** - This series assumes you already understand what Azure Arc provides and how AKS enabled by Azure Arc works: - - [AKS enabled by Azure Arc Kubernetes overview](https://learn.microsoft.com/azure/azure-arc/kubernetes/overview) - - [AKS enabled by Azure Arc documentation](https://learn.microsoft.com/azure/aks/aksarc/) - -- **Model training, fine‑tuning, or data preparation:** - All scenarios assume models are already trained and packaged in formats supported by the selected inference engine. - -- **Deep internals of inference engines:** - Engine-specific internals are referenced only where required for deployment or configuration. For deeper learning: - - [NVIDIA Triton Inference Server documentation](https://docs.nvidia.com/deeplearning/triton-inference-server/) - - [NVIDIA GPU Operator documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) - -If you’re looking for conceptual comparisons, performance benchmarks, or model‑level optimizations, those topics are intentionally out of scope for this series. - -## Series ground rules (What this series guarantees) - -Here are the set of baseline guarantees and assumptions that apply to all subsequent parts of the series: - -- All scenarios use the same AKS enabled by Azure Arc environment unless explicitly noted otherwise. -- AKS enabled by Azure Arc is used as the management and control plane only; inference execution always occurs locally on the cluster. -- No managed Azure AI services are used to execute inference. -- Each scenario follows a consistent, repeatable structure so results can be compared across inference engines and hardware types. +:::warning +These tutorials are designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices. +::: -### Standard workflow +This series assumes familiarity with Kubernetes fundamentals, proficiency with kubectl, Azure CLI, and Helm, and experience using AKS enabled by Azure Arc on Azure Local. The focus is not on model development or training. -Each scenario follows the same high-level workflow: +All scenarios use the same AKS enabled by Azure Arc environment and follow a consistent structure. Inference execution always occurs locally on the cluster. No managed Azure AI services are used. Each scenario follows the same steps: **connect and verify** cluster access, **prepare the accelerator** if required, **deploy the inference workload**, **validate inference** with a test request, and **clean up resources**. -- **Connect and verify:** - Log in to Azure and get cluster credentials. Inspect available compute resources (CPU, GPU, NPU) and node labels/capabilities +For reference: -- **Prepare the accelerator (If Required):** - Install or validate the required accelerator enablement based on the scenario. - - GPU: NVIDIA GPU Operator - - NPU: Vendor‑specific enablement (future) - - CPU: No accelerator setup required -- **Deploy the inference Workload:** - - Deploy the model server or inference pipeline (LLM server, Triton, or other engine) - - Configure runtime parameters appropriate to the selected hardware -- **Validate inference:** - - Send a test request (prompt, image, or payload) - - Confirm functional and expected inference output -- **Cleanup resources:** - - Remove deployed workloads - - Release cluster resources (compute, storage, accelerator allocations) +- [Kubernetes fundamentals](https://learn.microsoft.com/training/modules/intro-to-kubernetes/) +- [AKS enabled by Azure Arc overview](https://learn.microsoft.com/azure/azure-arc/kubernetes/overview) +- [AKS enabled by Azure Arc documentation](https://learn.microsoft.com/azure/aks/aksarc/) +- [GPU Operator internals](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) ## Series outline -In this series, we walk you through a range of AI inference patterns on AKS enabled by Azure Arc clusters, spanning both generative and predictive AI workloads. The series is designed to evolve over time, and we'll continue adding topics as we validate new scenarios, runtimes, and architectures. +The series is designed to evolve. New topics will be added as additional scenarios, runtimes, and architectures are explored. ### Topics covered in this series | Topic | Type | Status | | ----- | ---- | ------ | -| [**Ollama** — open-source LLM server](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available | -| [**vLLM** — high-throughput LLM engine](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available | -| [**Triton + ONNX** — ResNet‑50 image classification](/2026/04/07/ai-inference-on-aks-arc-part-4) | Predictive | ✅ Available | -| **Triton + TensorRT‑LLM** — optimized large-model inference | Generative | 🔜 Coming soon | -| **Triton + vLLM backend** — vision-language model serving | Generative | 🔜 Coming soon | - -This series will continue to grow as we introduce new inference engines, hardware configurations, and real‑world deployment patterns across edge, on‑premises, and hybrid environments. +| [**Ollama**: open-source LLM server](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available | +| [**vLLM**: high-throughput LLM engine](/2026/04/07/ai-inference-on-aks-arc-part-3) | Generative | ✅ Available | +| [**Triton + ONNX**: ResNet‑50 image classification](/2026/04/07/ai-inference-on-aks-arc-part-4) | Predictive | ✅ Available | +| [**Triton + TensorRT‑LLM**: optimized large-model inference](/2026/04/09/ai-inference-on-aks-arc-part-5) | Generative | ✅ Available | +| **Triton + vLLM backend**: vision-language model serving | Generative | 🔜 Coming soon | ## Prerequisites -All scenarios in this series run on a common AKS enabled by Azure Arc clusters environment. Before you begin, make sure you have the following in place: +All scenarios in this series run on an AKS enabled by Azure Arc cluster deployed on Azure Local. Before you begin, make sure you have the following in place: -- **AKS enabled by Azure Arc clusters with a GPU node:** A Azure Local clusters with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources. +- **AKS enabled by Azure Arc clusters with a GPU node:** An Azure Local cluster with at least one GPU node and appropriate NVIDIA drivers installed. The GPU node needs the NVIDIA device plugin (via the NVIDIA GPU Operator) running so pods can access nvidia.com/gpu resources. - **Azure CLI with Azure Arc extensions:** The [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) installed on your admin machine and `connectedk8s` extensions (for Azure Arc-enabled Kubernetes). Use `az extension list -o table` to confirm these are installed. diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/hero-image.png b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/hero-image.png index 507ff54a9..80af4bb90 100644 Binary files a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/hero-image.png and b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/hero-image.png differ diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/index.md b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/index.md index 53cda2bdc..c6a8144d4 100644 --- a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/index.md +++ b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-3/index.md @@ -7,7 +7,7 @@ authors: tags: ["aks-arc", "ai", "ai-inference"] --- -In this post, you'll explore how to deploy and run generative AI inference workloads using open-source large language model servers on Azure Kubernetes Service (AKS) enabled by Azure Arc. You'll focus on running these workloads locally, on-premises or at the edge, using GPU acceleration with centralized management. This approach is especially useful when cloud-based AI services are not viable due to data sovereignty, latency, cost, or limited connectivity. +In this post, you'll explore how to deploy and run generative AI inference workloads using open-source large language model servers on Azure Kubernetes Service (AKS) enabled by Azure Arc. You'll focus on running these workloads locally, on-premises or at the edge, using GPU acceleration with centralized management. @@ -15,21 +15,21 @@ In this post, you'll explore how to deploy and run generative AI inference workl ## Introduction -Rather than using managed AI services, you'll deploy Ollama and vLLM as standalone Kubernetes workloads directly on your cluster. This keeps things transparent — you'll see exactly how model serving, GPU scheduling, and inference requests work inside your Arc-managed environment. Performance tuning and benchmarks are out of scope here; the focus is a clear, repeatable, and diagnosable foundation for GPU-accelerated inference. These fundamentals set you up for the more advanced architectures covered in later posts. +Rather than using managed AI services, you'll deploy Ollama and vLLM as standalone Kubernetes workloads directly on your cluster. This keeps things transparent. You'll see exactly how model serving, GPU scheduling, and inference requests work inside your Arc-managed environment. Performance tuning and benchmarks are out of scope here; the focus is a clear, repeatable, and diagnosable starting point for GPU-accelerated inference. These fundamentals set you up for the more advanced architectures covered in later posts. -:::note -Before you begin, ensure the prerequisites described in [AI Inference on AKS enabled by Azure Arc: Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2) are fully met. -You should have an AKS enabled by Azure Arc cluster (on Azure Local or similar) with a **GPU node** available and configured for **nvidia.com/gpu**. -The cluster nodes must have **internet access** to download the model. If restricted, you must manually provide the model files via a Persistent Volume. **Expect a delay** during the initial deployment while the **pod downloads** and caches the large model files. +:::note[PREREQUISITES AND SCOPE] +Before you begin, ensure the [Part 2 prerequisites](/2026/04/07/ai-inference-on-aks-arc-part-2) are met, including a GPU node configured for **nvidia.com/gpu**. Cluster nodes need **internet access** to download models. **Expect a delay** during initial deployment while the pod downloads and caches model files. + +This tutorial is designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices. ::: ## AI inference with Ollama -Now that your environment is ready, you'll deploy the Ollama model server. You'll use Ollama's official container image to serve a large language model — specifically **Phi-3 Mini** with 4-bit quantization (~2.2 GB), which fits comfortably on a single **16 GB GPU**. Once deployed, you'll have a single endpoint that supports both Ollama's native REST API and an OpenAI-compatible interface. +Now that your environment is ready, you'll deploy the Ollama model server. You'll use Ollama's official container image to serve a large language model, specifically **Phi-3 Mini** with 4-bit quantization (~2.2 GB), which fits comfortably on a single **16 GB GPU**. Once deployed, you'll have a single endpoint that supports both Ollama's native REST API and an OpenAI-compatible interface. ### Deploying the Ollama model server -First, ensure you have connected to your Arc-enabled cluster (see Prerequisites) and that it has a GPU node with the NVIDIA device plugin ready (the GPU Operator should be installed). If your cluster has multiple GPU nodes, apply the accelerator=nvidia-gpu label to a node to ensure the Ollama pod schedules on your target hardware. +If your cluster has multiple GPU nodes, apply the `accelerator=nvidia-gpu` label to a node to ensure the Ollama pod schedules on your target hardware. ```powershell # 1. FIND THE GPU NODE NAME @@ -126,7 +126,7 @@ spec: targetPort: 11434 # The port the Ollama application is listening on inside the pod. ``` -This defines a **Deployment** running one instance of the ollama/ollama:0.18.3 container image, exposing the server on port **11434**, and requesting 1 GPU (nvidia.com/gpu: 1) so it runs on your GPU node. A LoadBalancer Service on port 11434 forwards requests to the pod; on Azure Local, if no external load balancer is available, you can use port-forwarding to access the service. Apply the manifest to start the Ollama server: +Apply the manifest to start the Ollama server. On Azure Local, if no external load balancer is available, use port-forwarding to access the service. ```powershell kubectl apply -f ollama-deployment.yaml # apply deployment yaml @@ -192,11 +192,11 @@ kubectl label node $nodeName accelerator- ## AI inference with vLLM -Before starting this step, make sure you have completed the cleanup and any required prerequisites. You will then serve a local large language model using the vLLM inference engine on an AKS enabled by Azure Arc cluster. With its optimized memory management approach, vLLM enables efficient text generation for large models. In this step, you will deploy a sample Mistral 7B model (quantized to about 4 GB) using vLLM’s OpenAI‑compatible API, then submit a prompt to validate the response. +Make sure you have completed the Ollama cleanup above. In this step, you will deploy a **Mistral 7B** model (AWQ quantized, ~4 GB) using vLLM's OpenAI‑compatible API, then submit a prompt to validate the response. ### Deploying the vLLM model server -After connecting to your Arc-enabled cluster (see Prerequisites), confirm the cluster’s GPU node is ready and run the NVIDIA GPU Operator if not already installed (to provide the device plugin). +Label the GPU node for scheduling affinity, the same as in the Ollama section above. ```powershell # 1. FIND THE GPU NODE NAME @@ -300,8 +300,6 @@ spec: targetPort: 8000 # The port the vLLM container is actually listening on ``` -This Deployment launches one vllm/vllm-openai:v0.18.0 container that runs vLLM’s OpenAI-compatible API server for the Mistral-7B model (TheBloke/Mistral-7B-v0.1-AWQ from Hugging Face). The container is configured with a 4096 token context, uses 80% of GPU memory (--gpu-memory-utilization 0.80), and employs AWQ 4-bit quantized weights (to fit in a ~16 GB GPU). It requests 1 GPU, and mounts a 2 GiB emptyDir at /dev/shm for fast memory access. A Service vllm-service is used to forward port 80 to the container’s port 8000 (the API) as a LoadBalancer. - Apply the manifest to start the vLLM server: ```powershell @@ -309,7 +307,7 @@ kubectl apply -f vllm-deployment.yaml # apply deployment y kubectl get pods -l app=vllm-mistral -n vllm-inference -w # wait for vllm-mistral pod to run ``` -Kubernetes will pull the container image and start the server. Wait for the vllm-mistral pod to reach Running. Once running, if no external IP address is assigned to vllm-service, open a terminal and port-forward it (e.g. `kubectl port-forward svc/vllm-service -n vllm-inference 8080:80`) to access the API at `http://localhost:8080`. +Wait for the vllm-mistral pod to reach Running. If no external IP is assigned, port-forward with `kubectl port-forward svc/vllm-service -n vllm-inference 8080:80` to access the API at `http://localhost:8080`. ### Testing the LLM endpoint @@ -339,6 +337,6 @@ $nodeName = (kubectl get nodes -l accelerator=nvidia-gpu -o jsonpath='{.items[0] kubectl label node $nodeName accelerator- ``` -This removes the vllm-mistral Deployment (stopping the pod) and the Service. If no more GPU inference is needed, you may also remove the GPU Operator (`helm uninstall `) to reclaim cluster resources. +If no more GPU inference is needed, you can also remove the GPU Operator (`helm uninstall `) to reclaim cluster resources. ### Next up: [Predictive AI using Triton and ResNet-50](/2026/04/07/ai-inference-on-aks-arc-part-4) diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/hero-image.png b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/hero-image.png index 027375214..315f0b716 100644 Binary files a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/hero-image.png and b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/hero-image.png differ diff --git a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/index.md b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/index.md index 84961b254..724c64a8f 100644 --- a/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/index.md +++ b/website/blog/2026-04-07-ai-inference-on-aks-arc-part-4/index.md @@ -15,41 +15,47 @@ In this post, you'll deploy NVIDIA Triton Inference Server on your Azure Kuberne ## Introduction -The previous post covered deploying LLM servers for text generation. This post shifts to predictive AI, specifically computer vision. You'll use NVIDIA Triton Inference Server with the ONNX runtime backend to serve ResNet-50, a classic "hello world" for image classification. Triton is an enterprise-grade, multi-framework model server, and deploying it here gives you a reusable pattern for serving any ONNX model on your AKS enabled by Azure Arc managed cluster. +The previous post covered deploying LLM servers for text generation. This post shifts to predictive AI, specifically computer vision. You'll use NVIDIA Triton Inference Server with the ONNX runtime backend to serve ResNet-50, a classic "hello world" for image classification. Triton is an enterprise-grade, multi-framework model server, and deploying it here gives you a starting pattern for serving any ONNX model on your AKS enabled by Azure Arc managed cluster. -:::note -Before you begin, ensure the prerequisites described in [AI Inference on AKS Arc: Series Introduction and Scope](/2026/04/07/ai-inference-on-aks-arc-part-2) are fully met. -You should have an AKS enabled by Azure Arc cluster (on Azure Local or similar) with a **GPU node** available and configured for **nvidia.com/gpu**. -The cluster nodes must have **internet access** to download the model. If restricted, you must manually provide the model files via a Persistent Volume. **Expect a delay** during the initial deployment while the **pod downloads** and caches the large model files. +:::note[PREREQUISITES AND SCOPE] +Before you begin, ensure the [Part 2 prerequisites](/2026/04/07/ai-inference-on-aks-arc-part-2) are met, including a GPU node configured for **nvidia.com/gpu**. Cluster nodes need **internet access** to download models. **Expect a delay** during initial deployment while the pod downloads and caches model files. + +This tutorial is designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices. ::: ## AI Inference with Triton (ONNX) -With the environment in place, you'll deploy NVIDIA Triton Inference Server using the ONNX Runtime backend to serve a ResNet‑50 model for image classification. You'll then send a sample image to validate the deployment and confirm that the inference pipeline is working as expected. - ### Preparing storage for the model repository In addition to the prerequisites, you'll need persistent **storage** for model files. First, create a **triton-inference** namespace and a **PersistentVolumeClaim** (PVC) for the model repository. -Save the following YAML as triton-pvc.yaml: +Save the following YAML as `triton-pvc.yaml`: ```yaml +# 1. THE NAMESPACE +# Creates an isolated logical boundary for your Triton resources. +# All subsequent resources must reference this namespace to communicate. apiVersion: v1 kind: Namespace metadata: name: triton-inference --- +# 2. THE STORAGE (PVC) +# Requests a persistent disk from the cluster to store your model weights. +# This ensures that if the Pod restarts, your downloaded models remain intact. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: triton-model-repository-pvc - namespace: triton-inference + namespace: triton-inference # Ensures the storage is available within your namespace spec: + # ReadWriteOnce (RWO) allows the volume to be mounted as read-write by a single node. accessModes: - - ReadWriteOnce - storageClassName: default + - ReadWriteOnce + # Omit storageClassName to use the cluster's default StorageClass. resources: requests: + # Allocates 10GB of space. Ensure your underlying disk provider supports this size. storage: 10Gi ``` @@ -62,28 +68,48 @@ kubectl get pvc -n triton-inference ### Deploy a helper pod to download the model -Next, you'll spin up a temporary pod to download the model into storage. Save the following YAML as model-download-pod.yaml: +:::tip +This tutorial uses a helper pod for interactive model downloading, which makes it easy to troubleshoot and retry. For automated workflows, consider using a Kubernetes [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/) instead, which handles retries and completion tracking natively. +::: + +Next, you'll spin up a temporary pod to download the model into storage. Save the following YAML as `model-download-pod.yaml`: ```yaml apiVersion: v1 kind: Pod metadata: + # Name of the helper pod used to stage or download models name: model-downloader + + # Namespace where Triton and related components are deployed namespace: triton-inference + labels: + # Label to identify this pod for management or cleanup app: model-downloader + spec: volumes: - - name: model-storage - persistentVolumeClaim: - claimName: triton-model-repository-pvc - containers: - - name: helper - image: python:3.11-slim - command: ["/bin/sh","-c","tail -f /dev/null"] - volumeMounts: - name: model-storage - mountPath: /models + # Persistent volume claim backing the Triton model repository + # Models downloaded by this pod will be visible to Triton + persistentVolumeClaim: + claimName: triton-model-repository-pvc + + containers: + - name: helper + # Lightweight Python image used as a utility container + image: python:3.11-slim + + # Keep the container running so you can exec into it + # and download or manage model files interactively + command: ["/bin/sh", "-c", "tail -f /dev/null"] + + volumeMounts: + - name: model-storage + # Mount path where models will be placed + # This should match Triton's model repository path + mountPath: /models ``` Apply the manifest and wait for the model-downloader pod to reach Running (you can stop watching with Ctrl+C once it’s running). @@ -110,11 +136,6 @@ print('Download Complete') " ``` -:::note -Ensure the cluster nodes have internet connectivity to download the model file. If internet access from the cluster is restricted, you may have to manually provide the model file to the volume via an alternate method. -::: -This one-line command creates the required directory (/models/repository/resnet50/1) and downloads the **ResNet-50 ONNX** model file into it, outputting “Download Complete” when finished. - #### Confirm the model file (~98 MB) exists ```powershell @@ -123,58 +144,81 @@ kubectl exec -n triton-inference model-downloader -- ls -lh /models/repository/r ### Deploying Triton inference server (ONNX) -With the model in place, you'll deploy Triton. Save this as triton-deployment.yaml: +With the model in place, you'll deploy Triton. Save this as `triton-deployment.yaml`: ```yaml apiVersion: apps/v1 kind: Deployment metadata: + # Name of the Triton Inference Server deployment name: triton-server + + # Namespace where Triton and related resources are deployed namespace: triton-inference + spec: + # Number of Triton pods to run replicas: 1 + selector: matchLabels: + # Label selector used by the Deployment to manage pods app: triton-server + template: metadata: labels: + # Labels applied to the Triton pod app: triton-server + spec: containers: - name: triton + # Official NVIDIA Triton Inference Server image image: nvcr.io/nvidia/tritonserver:24.08-py3 + + # Start Triton with the model repository mounted at /models/repository + # --strict-model-config=false allows Triton to infer model configs args: ["tritonserver", "--model-repository=/models/repository", "--strict-model-config=false"] + ports: + # HTTP endpoint for inference and health checks - containerPort: 8000 name: http + + # gRPC endpoint for inference - containerPort: 8001 name: grpc - # READINESS PROBE: Tells Kubernetes when this pod is ready to accept traffic. - # Triton exposes /v2/health/ready, which returns HTTP 200 only after the - # server is fully initialized and all models in the repository are loaded. - # The Service will not route inference requests to this pod until this probe succeeds. + + # READINESS PROBE: + # Checks whether Triton is ready to accept inference requests. + # Triton exposes /v2/health/ready and returns HTTP 200 only after + # the server is initialized and all models are loaded. readinessProbe: httpGet: path: /v2/health/ready port: 8000 - initialDelaySeconds: 60 # Wait 60s after container start before first check. - periodSeconds: 10 # Check every 10 seconds after that. - failureThreshold: 6 # Mark unready after 6 consecutive failures. + initialDelaySeconds: 60 # Wait 60 seconds before the first probe + periodSeconds: 10 # Probe every 10 seconds + failureThreshold: 6 # Mark pod unready after 6 failures + resources: limits: + # Request exclusive access to one NVIDIA GPU nvidia.com/gpu: 1 + volumeMounts: - name: model-vol + # Mount the persistent volume containing the Triton model repository mountPath: /models + volumes: - name: model-vol + # Persistent volume claim backing the Triton model repository persistentVolumeClaim: claimName: triton-model-repository-pvc ``` -This **Deployment** runs a Triton Inference Server container (image nvcr.io/nvidia/tritonserver:24.08-py3) pointing to the model repository at /models/repository (our PVC). This exposes Triton's **HTTP API** (port 8000) and **gRPC API** (port 8001) and requests **1 GPU** (nvidia.com/gpu: 1) to run on a GPU node. The volumeMount attaches the triton-model-repository-pvc at **/models** inside the container so Triton can access the model file. - Apply the Triton deployment manifest and verify Triton server is running ```powershell @@ -200,7 +244,7 @@ Now you'll test it end-to-end. You'll send an image to Triton and confirm you ge pip install numpy pillow requests ``` -4. Run an inference request using a script: Save the following script as ImageClassificationSample.ps1 and run it: +4. Run an inference request using a script. Save the following script as `ImageClassificationSample.ps1`, then execute it to submit a request to Triton: :::note ResNet‑50 predicts one of 1,000 ImageNet classes. This sample applies a demo‑only mapping that groups some animal‑related labels for simplicity. You can extend the sample to handle additional classes or custom mappings. This is provided for demonstration purposes only. @@ -295,10 +339,9 @@ Now you'll test it end-to-end. You'll send an image to Triton and confirm you ge } ``` -The script prompts you for an image path, sends it to Triton's endpoint, and prints the prediction — animal type, breed, and confidence score. Here's an example: +Example output: ```output -Example Output: PS D:\Dynamo-Triton> powershell.exe -ExecutionPolicy Bypass -File .\ImageClassificationSample.ps1 Enter the full path to your local image (e.g., C:\pics\animal.jpg): "D:\Dynamo-Triton\susanneedele-quarter-horse.jpg" Processing Inference... @@ -309,7 +352,7 @@ SPECIFIC BREED : sorrel SCORE : 12.5520 ``` -The model correctly identified a horse (sorrel breed) — confirming Triton is serving ResNet-50 predictions. The script maps ImageNet labels to broad animal categories for readability; your actual output will include the full class name. +The script maps ImageNet labels to broad animal categories for readability; your actual output will include the full class name. ### Cleanup @@ -319,7 +362,7 @@ To free resources, delete the triton-inference namespace and all its contents: kubectl delete namespace triton-inference ``` -This removes the Triton server Deployment, model-downloader Pod, and PVC. If you installed the GPU Operator specifically for this test, you can also uninstall it via Helm to release cluster resources. +If you installed the GPU Operator specifically for this test, you can also uninstall it via Helm to release cluster resources. ### Troubleshooting diff --git a/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/hero-image.png b/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/hero-image.png new file mode 100644 index 000000000..f333f8dfb Binary files /dev/null and b/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/hero-image.png differ diff --git a/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/index.md b/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/index.md new file mode 100644 index 000000000..d296ed834 --- /dev/null +++ b/website/blog/2026-04-09-ai-inference-on-aks-arc-part-5/index.md @@ -0,0 +1,556 @@ +--- +title: "AI Inference on AKS enabled by Azure Arc: Generative AI using Triton and TensorRT‑LLM" +date: 2026-04-09 +description: "Run local edge generative AI by deploying Triton with the TensorRT-LLM backend on AKS enabled by Azure Arc to serve Qwen2.5 7B Instruct with GPU acceleration." +authors: +- datta-rajpure +tags: ["aks-arc", "ai", "ai-inference"] +--- + + + +In this post, you’ll deploy NVIDIA Triton Inference Server on your Azure Kubernetes Service (AKS) enabled by Azure Arc cluster to serve a Qwen‑based generative model using the TensorRT‑LLM backend. By the end, you’ll have a working generative AI inference pipeline running locally on your on‑premises GPU hardware. + + + +![Deploying generative AI inference with open-source inference servers on AKS enabled by Azure Arc](./hero-image.png) + +## Introduction + +Earlier parts of this series used runtimes such as Ollama and vLLM to quickly enable local LLM inference and iterate on model behavior. In this post, you move to a more infrastructure‑centric deployment model using **NVIDIA Triton Inference Server with the TensorRT‑LLM backend** and an instruction‑tuned model. This approach prioritizes **predictable performance, efficient GPU utilization, and long‑running inference services**, making it well suited for edge and on‑prem environments where capacity is fixed and inference must operate reliably as part of the platform. + +:::note[PREREQUISITES AND SCOPE] +Before you begin, ensure the [Part 2 prerequisites](/2026/04/07/ai-inference-on-aks-arc-part-2) are met, including a GPU node configured for **nvidia.com/gpu**. Cluster nodes need **internet access** to download models. **Expect a delay** during initial deployment while the pod downloads and caches model files. + +This tutorial is designed for experimentation and learning. The configurations shown are not production-ready and should not be deployed to production environments without additional security, reliability, performance hardening, and following standard practices. +::: + +## TensorRT‑LLM deployment pipeline on Triton + +Unlike Ollama, vLLM, or ONNX-based models, where inference engines initialize at runtime, TensorRT-LLM follows an explicit build-then-serve pipeline. The model is compiled into a GPU-specific engine ahead of time, and Triton uses this engine for runtime inference. Once a TensorRT‑LLM engine is built, it can be deployed to any node as long as the hardware and software environment match the conditions it was built for. At a high level, the pipeline consists of three phases: + +| Phase | Purpose | Key activities | +| ------ | -------- | ---------------- | +| **Preparation (Provisioner)** | Transform a standard model into a high‑performance, GPU‑specific executable | Model weights are converted from Hugging Face format into a TensorRT‑LLM checkpoint, optional quantization is applied to reduce memory footprint, and a GPU‑specific TensorRT engine is compiled with CUDA kernel fusion for maximum throughput. | +| **Configuration (Template)** | Align the model’s physical characteristics with Triton’s execution requirements | Template‑based Triton configurations are populated with concrete values such as engine paths, batching limits, and instance counts, ensuring preprocessing, engine, and postprocessing stages agree on shapes and runtime behavior. | +| **Inference (Runtime)** | Orchestrate live request execution and model serving | Triton Inference Server loads the prebuilt TensorRT‑LLM engine, orchestrates the ensemble pipeline end‑to‑end, manages concurrency and in‑flight batching, and exposes HTTP and gRPC endpoints for inference. | + +### For reference + +- [TensorRT‑LLM overview](https://docs.nvidia.com/tensorrt-llm) +- [TensorRT‑LLM engine build workflow](https://github.com/NVIDIA/TensorRT-LLM) +- [Triton model repository and ensemble models](https://docs.nvidia.com/deeplearning/triton-inference-server) +- [TensorRT‑LLM Triton backend](https://github.com/triton-inference-server/tensorrtllm_backend) +- [Triton Inference Server architecture](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs) +- [Performance and batching concepts](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html) + +### Why this pipeline matters + +This compile-then-serve model differentiates TensorRT-LLM from earlier approaches in this series. By shifting optimization and hardware specialization to build time, TensorRT-LLM delivers deterministic performance and efficient GPU utilization, which are key requirements for edge and on-prem AI deployments. + +### TensorRT-LLM build and deployment directory structure + +The following directory structure represents how artifacts flow through the build and deployment phases when using TensorRT-LLM with Triton. This structure helps visualize how model assets, hardware-specific artifacts, and runtime configuration are separated. + +```text +/models +├── qwen-raw +│ ├── model.safetensors +│ ├── tokenizer.json +│ ├── tokenizer_config.json +│ └── config.json +│ +├── model_repository +│ ├── tllm_checkpoint_qwen_7b_int41 +│ │ ├── config.json +│ │ ├── *.ckpt +│ │ └── rank0.safetensors +│ └── tllm_engine_qwen_7b_int41 +│ ├── config.json +│ └── rank0.engine +│ +├── engines +│ └── qwen_7b_int4 +│ ├── config.json +│ └── rank0.engine +│ +├── tokenizer +│ └── qwen_7b_int4 +│ ├── tokenizer.json +│ └── tokenizer_config.json +│ +└── triton_model_repo + ├── ensemble + │ ├── 1 + │ └── config.pbtxt + ├── preprocessing + │ ├── 1 + │ └── config.pbtxt + ├── tensorrt_llm + │ ├── 1 + │ └── config.pbtxt + ├── postprocessing + │ ├── 1 + │ └── config.pbtxt + └── tensorrt_llm_bls + ├── 1 + └── config.pbtxt +``` + +## Phase 1: Preparation + +### Preparing storage for the model repository + +In addition to the prerequisites, you'll need persistent **storage** for model files. First, create a **triton-inference** namespace and a **PersistentVolumeClaim** (PVC) for the model repository. + +Save the following YAML as `triton-pvc.yaml`: + +```yaml +# THE NAMESPACE +# Creates an isolated logical boundary for your Triton resources. +# All subsequent resources must reference this namespace to communicate. +apiVersion: v1 +kind: Namespace +metadata: + name: triton-inference +--- +# THE STORAGE (PVC) +# Requests a persistent disk from the cluster to store your model weights. +# This ensures that if the Pod restarts, your downloaded models remain intact. +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: triton-model-repository-pvc + namespace: triton-inference # Ensures the storage is available within your namespace +spec: + # ReadWriteOnce (RWO) allows the volume to be mounted as read-write by a single node. + accessModes: + - ReadWriteOnce + + # Omit storageClassName to use the cluster's default StorageClass. + resources: + requests: + # Allocates 100GB of space. Ensure your underlying disk provider + # supports this size. + storage: 100Gi +``` + +Apply the manifest and verify the PVC status is **Bound** (storage provisioned): + +```powershell +kubectl apply -f triton-pvc.yaml +kubectl get pvc -n triton-inference +``` + +### Deploy a helper pod to download the model + +:::tip +This tutorial uses a helper pod for downloading models, installing tools, and converting models to engines, making it easier to troubleshoot and iterate. For automated workflows, consider using a Kubernetes [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/) instead, which handles retries and completion tracking natively. +::: + +Next, you'll deploy a temporary helper pod that acts as a provisioner. You'll use this pod to download the model into persistent storage, convert the raw weights into TensorRT-LLM checkpoints, and build a GPU-specific TensorRT engine. This pod **requires a GPU** during the preparation stage. Save the following YAML as `triton-provisioner.yaml`. + +```yaml +# THE PROVISIONER POD +# This pod serves as your "workspace" to prepare the environment. +# Its purpose is to download raw models, convert them into checkpoints, +# and compile the optimized TensorRT engines for deployment. +apiVersion: v1 +kind: Pod +metadata: + name: triton-provisioner + namespace: triton-inference +spec: + containers: + - name: provisioner + # TRT-LLM Image: Contains the necessary NVIDIA libraries (CUDA, TensorRT-LLM) + # and the Triton server binary required for engine building and testing. + image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 + + # Keep the pod alive indefinitely so you can 'kubectl exec' into it + # and run your download/build commands manually or via scripts. + command: ["/bin/sh", "-c", "sleep infinity"] + + resources: + limits: + # A GPU is required for the 'trtllm-build' step to compile + # the engine for the specific GPU architecture in your cluster. + nvidia.com/gpu: 1 + + volumeMounts: + # Mount the PVC to /models inside the container. + # All downloads and engine files saved here will persist across pod restarts. + - name: model-storage + mountPath: /models + volumes: + - name: model-storage + # Connects the internal /models path to the 100Gi PVC defined earlier. + persistentVolumeClaim: + claimName: triton-model-repository-pvc +``` + +Apply the manifest and wait for the triton-provisioner pod to reach Running (you can stop watching with Ctrl+C once it’s running). + +:::note[EXPECT A WAIT] +The provisioning pod may take ten minutes or longer to reach the Running state because it pulls a large Triton image that includes CUDA and TensorRT-LLM libraries. +::: + +```powershell +kubectl apply -f triton-provisioner.yaml +kubectl get pods -n triton-inference -w +``` + +### TensorRT-LLM Model Preparation and Triton Repository Assembly + +Save the following script as a `prepare-trtllm.sh` file. This script prepares a Hugging Face LLM for inference with Triton by downloading the model, converting and quantizing it into a GPU‑specific TensorRT‑LLM engine, assembling the Triton model repository, and generating all required configuration files for end‑to‑end inference. + +```bash +#!/usr/bin/env bash +set -euo pipefail + +# Print a helpful message if any command fails +trap 'echo "FAILED at line $LINENO: $BASH_COMMAND" >&2' ERR + +echo "Starting TensorRT-LLM preparation pipeline..." + +# Paths +MODELS_ROOT="/models" +RAW_MODEL_DIR="${MODELS_ROOT}/qwen-raw" +CHECKPOINT_DIR="${MODELS_ROOT}/model_repository/tllm_checkpoint_qwen_7b_int41" +ENGINE_BUILD_DIR="${MODELS_ROOT}/model_repository/tllm_engine_qwen_7b_int41" +ENGINES_DIR="${MODELS_ROOT}/engines/qwen_7b_int4" +TOKENIZER_DIR="${MODELS_ROOT}/tokenizer/qwen_7b_int4" +TRITON_REPO_DIR="${MODELS_ROOT}/triton_model_repo" +FILL_TEMPLATE_SCRIPT="/app/tools/fill_template.py" +MODEL_ID="Qwen/Qwen2.5-7B-Instruct" + +# Small helper to print what is running +run() { + echo "" + echo "RUN: $*" + "$@" +} + +# Validate environment +if [[ ! -d "${MODELS_ROOT}" ]]; then + echo "FAILED: ${MODELS_ROOT} does not exist. Ensure the PVC is mounted at /models and run inside the provisioner pod." >&2 + exit 1 +fi + +if ! command -v python3 >/dev/null 2>&1; then + echo "FAILED: python3 not found. Run inside the TRT LLM provisioner image." >&2 + exit 1 +fi + +if ! command -v trtllm-build >/dev/null 2>&1; then + echo "FAILED: trtllm-build not found. Run inside the TRT LLM provisioner image." >&2 + exit 1 +fi + +# Enter persistent directory +run cd "${MODELS_ROOT}" + +# Create required directories (idempotent) +run mkdir -p "${RAW_MODEL_DIR}" +run mkdir -p "${CHECKPOINT_DIR}" +run mkdir -p "${ENGINE_BUILD_DIR}" + +echo "" +echo "Downloading model ${MODEL_ID} into ${RAW_MODEL_DIR} ..." + +# Download raw model +# Prefer huggingface-cli when available, otherwise fall back to python module invocation +if command -v huggingface-cli >/dev/null 2>&1; then + run huggingface-cli download "${MODEL_ID}" --local-dir "${RAW_MODEL_DIR}" --exclude "*.bin" "*.pth" +else + run python3 -m huggingface_hub.cli download "${MODEL_ID}" --local-dir "${RAW_MODEL_DIR}" --exclude "*.bin" "*.pth" +fi + +# Convert to quantized checkpoint +run python3 /app/examples/qwen/convert_checkpoint.py \ + --model_dir "${RAW_MODEL_DIR}" \ + --output_dir "${CHECKPOINT_DIR}" \ + --dtype float16 \ + --use_weight_only \ + --weight_only_precision int4 + +# Build the compressed engine +run trtllm-build \ + --checkpoint_dir "${CHECKPOINT_DIR}" \ + --output_dir "${ENGINE_BUILD_DIR}" \ + --gemm_plugin float16 \ + --max_batch_size 4 \ + --max_input_len 2048 \ + --max_seq_len 3072 + +echo "" +echo "Finalizing Triton model repository..." + +# Create folder layout + copy engine files +run mkdir -p "${ENGINES_DIR}" +run cp -f "${ENGINE_BUILD_DIR}/"* "${ENGINES_DIR}/" + +# Copy Triton model repository templates (overwrite allowed) +run mkdir -p "${TRITON_REPO_DIR}" +run cp -rf /app/all_models/inflight_batcher_llm/* "${TRITON_REPO_DIR}/" + +# Verify files copied +run ls -la "${TRITON_REPO_DIR}" + +# Copy tokenizer files +run mkdir -p "${TOKENIZER_DIR}" +run cp -f "${RAW_MODEL_DIR}/"tokenizer* "${TOKENIZER_DIR}/" + +# Fill in model configs with fill_template.py +if [[ ! -f "${FILL_TEMPLATE_SCRIPT}" ]]; then + echo "FAILED: fill_template.py not found at ${FILL_TEMPLATE_SCRIPT}" >&2 + exit 1 +fi + +export ENGINE_DIR="${ENGINES_DIR}" +export TOKENIZER_DIR="${TOKENIZER_DIR}" +export MODEL_FOLDER="${TRITON_REPO_DIR}" +export TRITON_MAX_BATCH_SIZE="1" +export INSTANCE_COUNT="1" +export MAX_QUEUE_DELAY_MS="0" +export MAX_QUEUE_SIZE="0" +export FILL_TEMPLATE_SCRIPT="${FILL_TEMPLATE_SCRIPT}" +export DECOUPLED_MODE="false" +export LOGITS_DATATYPE="TYPE_FP32" + +# Basic validation that expected config files exist before patching +for f in \ + "${MODEL_FOLDER}/ensemble/config.pbtxt" \ + "${MODEL_FOLDER}/preprocessing/config.pbtxt" \ + "${MODEL_FOLDER}/tensorrt_llm/config.pbtxt" \ + "${MODEL_FOLDER}/postprocessing/config.pbtxt" \ + "${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt" +do + if [[ ! -f "$f" ]]; then + echo "FAILED: expected config file missing: $f" >&2 + exit 1 + fi +done + +# Update ensemble config +run python3 "${FILL_TEMPLATE_SCRIPT}" -i "${MODEL_FOLDER}/ensemble/config.pbtxt" \ + "triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}" + +# Update preprocessing config +run python3 "${FILL_TEMPLATE_SCRIPT}" -i "${MODEL_FOLDER}/preprocessing/config.pbtxt" \ + "tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}" + +# Update tensorrt_llm config +run python3 "${FILL_TEMPLATE_SCRIPT}" -i "${MODEL_FOLDER}/tensorrt_llm/config.pbtxt" \ + "triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}" + +# Update postprocessing config +run python3 "${FILL_TEMPLATE_SCRIPT}" -i "${MODEL_FOLDER}/postprocessing/config.pbtxt" \ + "tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}" + +# Update BLS config +run python3 "${FILL_TEMPLATE_SCRIPT}" -i "${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt" \ + "triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}" + +echo "" +echo "Triton model repository templates have been successfully updated." +echo "TensorRT-LLM preparation completed successfully." +echo "Stop the provisioner pod after this to release the GPU for the Triton server." +``` + +Run the `prepare-trtllm.sh` script inside provisioner POD using the following sequence of commands: + +:::note[LONG BUILD TIME] +The engine build step may take 30 minutes or more depending on your GPU. Do not interrupt the process. +::: + +```powershell +# Copy the prepare-trtllm.sh script to the provisioner pod +kubectl cp ./prepare-trtllm.sh triton-inference/triton-provisioner:/models/prepare-trtllm.sh + +# Start a bash session inside the pod +kubectl exec -it triton-provisioner -n triton-inference -- /bin/bash +``` + +Run the following commands inside bash session: + +```bash +# Fix potential Windows line-ending issues (safe no-op on Linux files) +sed -i 's/\r$//' /models/prepare-trtllm.sh + +# Make the script executable +chmod +x /models/prepare-trtllm.sh + +# Run the script +/models/prepare-trtllm.sh +``` + +### Deploying Triton inference server + +With the engine in place, you'll deploy Triton. Save this as `triton-deployment.yaml`: + +```yaml +# THE SERVICE (The "Phone Number" for your Model) +# This exposes the Triton Inference Server to users outside or inside the cluster. +--- +apiVersion: v1 +kind: Namespace +metadata: + name: triton-inference +--- +apiVersion: v1 +kind: Service +metadata: + name: triton-server + namespace: triton-inference +spec: + # LoadBalancer provides a public IP (on cloud providers like AKS/EKS/GKE). + # Change to 'ClusterIP' if you only want internal access. + type: LoadBalancer + ports: + # HTTP Endpoint: Standard REST queries (standard for most web apps) + - name: http + port: 8000 + targetPort: 8000 + # gRPC Endpoint: High-performance, low-latency streaming (best for LLMs) + - name: grpc + port: 8001 + targetPort: 8001 + # Metrics Endpoint: For Prometheus/Grafana monitoring (GPU usage, throughput) + - name: metrics + port: 8002 + targetPort: 8002 + selector: + # Connects this Service to any Pod labeled 'app: triton-server' + app: triton-server +--- +# 5. THE DEPLOYMENT (The Inference Server) +# This is the actual "Running Process" that serves the model to users. +apiVersion: apps/v1 +kind: Deployment +metadata: + name: triton-server + namespace: triton-inference +spec: + # This can be flipped 1 or 0 to start and stop the Triton server during troubleshooting. + replicas: 1 + selector: + matchLabels: + app: triton-server + template: + metadata: + labels: + app: triton-server + spec: + containers: + - name: triton-server + # Must match the version used in your Provisioner to ensure Engine compatibility. + image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 + args: + - "tritonserver" + # Point to the specific folder where your 'ensemble', 'tensorrt_llm', etc., are stored. + - "--model-repository=/models/triton_model_repo" + # Prevents Triton from guessing configurations; ensures it uses your exact .pbtxt files. + - "--disable-auto-complete-config" + # Increases timeout for the Python backend (needed for heavy LLM tokenizers/preprocessing). + - "--backend-config=python,stub-timeout-seconds=120" + # Verbose logging (Level 3) is great for debugging but very "noisy." + # In production, change --log-verbose to 0 or 1. + - "--log-info=true" + - "--log-warning=true" + - "--log-error=true" + - "--log-verbose=3" + ports: + - containerPort: 8000 + - containerPort: 8001 + - containerPort: 8002 + + resources: + limits: + # Triton requires at least one GPU to load the TensorRT-LLM backend. + nvidia.com/gpu: 1 + volumeMounts: + # Mount the SAME PVC that the Provisioner used to access the built engine files. + - name: model-storage + mountPath: /models + volumes: + - name: model-storage + persistentVolumeClaim: + claimName: triton-model-repository-pvc +``` + +Apply the Triton deployment manifest and verify Triton server is running: + +:::warning[GPU RESOURCE CONFLICT] +The provisioner pod holds the GPU. You must delete it before deploying the Triton server, otherwise the Triton pod will stay in Pending state due to insufficient GPU resources. +::: + +```powershell +# Stop Provisioner POD +kubectl delete pod triton-provisioner -n triton-inference +# wait until provisioner pod is stopped. +kubectl get pods -n triton-inference + +# Deploy Triton Server +kubectl apply -f triton-deployment.yaml +# verify triton server is running. +kubectl get pods -n triton-inference -w + +# Optionally check the Triton Server logs to ensure it starts correctly +kubectl logs -f deployment/triton-server -n triton-inference +# Replace "triton-server-5cb57f9bf-fjk6x" with the pod name from "kubectl get pods -n triton-inference" +kubectl logs triton-server-5cb57f9bf-fjk6x -n triton-inference --previous + +``` + +### Validating the TensorRT-LLM inference + +To validate the setup end-to-end, send a request to Triton and verify that TensorRT-LLM returns a valid response. + +1. Expose the Triton service for testing: you'll use port-forwarding. In a separate terminal window (or a new background process), run: + + ```powershell + kubectl port-forward -n triton-inference deploy/triton-server 8000:8000 + ``` + +2. Run an inference request + + ```powershell + # Send a query to Triton from your client PowerShell + $ModelName = "ensemble" + $Uri = "http://localhost:8000/v2/models/$ModelName/generate" + $Payload = @{ + text_input = "What is Azure?" + max_tokens = 64 + bad_words = "" + stop_words = "" + } + $response = Invoke-RestMethod -Uri $Uri -Method Post -Body ($Payload | ConvertTo-Json -Compress) -ContentType "application/json" + $response + ``` + +Example output: + +```output +model_name : ensemble +model_version : 1 +sequence_end : False +sequence_id : 0 +sequence_start : False +text_output : What is Azure? + + Azure is Microsoft's cloud computing platform that enables businesses to build, deploy, and manage + applications and services through a global network of data centers. It offers a wide range of cloud + services including Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a + Service (SaaS). Azure +``` + +### Cleanup + +To free resources, delete the triton-inference namespace and all its contents: + +```powershell +kubectl delete namespace triton-inference +``` + +If you installed the GPU Operator specifically for this test, you can also uninstall it via Helm to release cluster resources.