Author: Igor Khozhanov
Contact: khozhanov@gmail.com
Copyright: Β© 2026 Igor Khozhanov. All Rights Reserved.
Processing input with Yolov8m 1024x1024 INT8 @ ~165 FPS on RTX 3060 Ti and @ ~38 FPS on Jetson Orin Nano.
The previous phases Phase 3 (Integration) and Phase 4 (Functional Inference) completed. The pipeline now supports full end-to-end detection and tracking using TensorRT and ONNX backends with mathematically verified kernels.
Note for Reviewers: This repository is currently under active development. The pipeline is being implemented in stages to ensure memory safety and zero-host-copy verification.
| Module / Stage | Status | Notes |
|---|---|---|
| FFMpeg Source | β Stable | Handles stream connection and packet extraction. |
| Stub Detector | β Stable | Pass-through module, validated for pipeline latency profiling. |
| Output / NVJpeg | β Stable | Saves frames from GPU memory to disk as separate *.jpg images. |
| Inference Pipeline | β Stable | Connects all the stages together. |
| ONNX Detector | β Optimized | FP32 Inference optimized (~38 FPS). |
| TensorRT Detector | β Stable | Engine builder & enqueueV3 implemented. |
| Object Tracker | β Stable | Kernels for position prediction, IOU matching, velocity filtering. |
| Post-Processing | β Stable | Custom CUDA kernels for YOLOv8 output decoding & NMS. |
| Jetson Port | β Stable | Native CUDA/TRT pipeline operational. Integrated Jetson Multimedia API (MMAPI) for hardware-accelerated JPEG decoding/encoding. |
| Windows Port | π§ WIP | Adapting CMake & CUDA |
This project implements a high-performance video inference pipeline designed to minimize CPU-GPU bandwidth usage. Unlike standard OpenCV implementations, this pipeline keeps data entirely on the VRAM (Zero-Host-Copy) from decoding to inference.
This repository contains the Inference Engine (MIT Licensed). It does not include pre-trained model weights.
To reproduce the demo results (Crop & Weed Detection), you must download pre-trained YOLOv8 model separately.
The model is hosted in the research repository (AGPL-3.0):
- Download:
best_int8.onnx - License: AGPL-3.0 (Derived from Ultralytics YOLOv8)
- β Linux x64 (Verified on Ubuntu 24.04 / RTX 3060 Ti)
- β Nvidia Jetson Orin Nano (Verified on Ubuntu 22.04 / JetPack 6.0)
- π§ Windows 10/11 (Build scripts implemented, pending validation)
Note: Jetson currently requires passing a directory of images (-i ./frames/) instead of an .mp4 file.
- CMake 3.19+
- CUDA Toolkit (12.x)
- TensorRT 10.x+
- FFmpeg: Required.
- Linux Users: Install via package manager or build from source with
--enable-shared.
- Linux Users: Install via package manager or build from source with
- NVIDIA cuDNN: Required by ONNX Runtime CUDA provider.
- Note:
Ensure
libcudnn.sois in yourLD_LIBRARY_PATHor installed system-wide.
- Note:
Ensure
git clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir build
mv ~/Downloads/best_int8.onnx ./build/
cd build
cmake ..
make -j$(nproc)./ZeroCopyInference -i ../video/Moving.mp4 --backend trt --model best_int8.onnx -b 16 -o Moving./ZeroCopyInferenceTests --model best_int8.onnxBuilding natively on the Jetson Orin ensures the compiler has direct access to the specialized L4T hardware headers (nvjpeg.h) and TensorRT libraries.
git clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir build
mv ~/Downloads/best_int8.onnx ./build/
cd build
cmake ..
make -j$(nproc)Note: Jetson requires a directory of frames (fetched automatically by cmake) as input instead of an .mp4
./ZeroCopyInference -i ../frames/ --backend trt --model best_int8.onnx -b 16 -o Moving./ZeroCopyInferenceTests --model best_int8.onnxNo C++ compilation required. Requires NVIDIA Container Toolkit.
git clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir models
mv ~/Downloads/best_int8.onnx ./models/
bash download_frames_data.sh
docker run --rm --gpus all \
-v $(pwd)/video:/app/video \
-v $(pwd)/models:/app/models \
ghcr.io/igkho/zerohostcopyinference:main \
-i /app/video/Moving.mp4 \
--backend trt \
--model /app/models/best_int8.onnx \
-b 16 \
-o /app/video/outputgit clone https://github.com/Igkho/ZeroHostCopyInference.git
cd ZeroHostCopyInference
mkdir models
mv ~/Downloads/best_int8.onnx ./models/
docker run --rm --gpus all \
-v $(pwd)/video:/app/video \
-v $(pwd)/models:/app/models \
--entrypoint /app/build/ZeroCopyInferenceTests \
ghcr.io/igkho/zerohostcopyinference:main \
--model /app/models/best_int8.onnxBenchmarks performed on NVIDIA RTX 3060 Ti and on NVIDIA Jetson Orin Nano
Input: 1440p Video Stream or directory of jpeg images.
Model: YOLOv8 Medium (YOLOv8m) @ 1024x1024 Resolution.
To measure the raw overhead of the pipeline architecture (I/O latency), a pass-through (Stub) detector should be used.
| Metric | Result | Notes |
|---|---|---|
| Throughput | ~300 FPS | Maximum theoretical speed without AI model. |
| Latency | 3.3 ms | Combined Decoding + Memory Management overhead. |
Running YOLOv8m (Explicitly Quantized INT8) with full object tracking and NVJpeg output.
| Metric | Result | Notes |
|---|---|---|
| Total Throughput | ~165 FPS | Wall time (End-to-End). >2.5x Real-Time. |
| Bottleneck Shift | Decoding | Inference is now so fast (3.36ms) that Video Decoding (4.83ms) has become the primary bottleneck. |
Workload Distribution (Active Work):
- Decoding (Source): ~4.83 ms/frame (46.91% load)
- Inference (Detector): ~3.36 ms/frame (32.66% load)
- Storage (Sink): ~2.10 ms/frame (20.43% load)
Both TensorRT (Highly Optimized) and ONNX Runtime (Generic Compatibility) are supported.
| Backend / Precision | FPS | Latency (Inference) | Speedup Factor | Notes |
|---|---|---|---|---|
| TensorRT (INT8) | ~164.8 | ~3.36 ms | 1.4x | Maximum performance. Recommended. |
| TensorRT (FP16) | ~118.1 | ~5.58 ms | 1.0x (Ref) | Baseline hardware acceleration. |
| ONNX Runtime | ~38.4 | ~24.2 ms | 0.33x | Generic execution. Useful for testing new models. |
Scenario: 1440x1440 Input Resolution (Directory of JPEG frames).
Model: YOLOv8 Medium (YOLOv8m) 1024x1024.
ONNX Runtime* is not supported.
| Metric | Result, FPS | Notes |
|---|---|---|
| Infrastructure Ceiling (Stub) | ~160 | Maximum pipeline speed utilizing dedicated MMAPI hardware engine (no model). |
| TensorRT (FP16) + NVJpeg | ~22.2 | Baseline hardware acceleration (CUDA NVJPEG). |
| TensorRT (INT8) + NVJpeg | ~31.4 | Fast inference, with decode/encode tasks using CUDA SMs. |
| TensorRT (INT8) + MMAPI | ~38.0 | New Peak. I/O offloaded to dedicated ASICs. |
The source code of this project is licensed under the MIT License. You are free to use, modify, and distribute this infrastructure code for any purpose, including commercial applications.
While the code is MIT-licensed, the assets and models used in this repository are subject to different terms. Please review them carefully before redistributing:
- Files: Content located in the
video/directory (e.g.,Moving.mp4,Moving_annotated.gif). - Source: Generated using KlingAI (Free Tier).
- Terms: These assets are provided for demonstration and educational purposes only. They are strictly non-commercial. You may not use these specific video files in any commercial product or service.
- Attribution: The watermarks on these videos must remain intact as per the platform's Terms of Service.
- Example: If you use YOLOv8 (Ultralytics) with this pipeline, be aware that YOLOv8 is licensed under AGPL-3.0.
- Implication: Integrating an AGPL-3.0 model may legally require your entire combined application to comply with AGPL-3.0 terms (i.e., open-sourcing your entire project).
User Responsibility: This repository provides the execution engine only. No models are bundled. You are responsible for verifying and complying with the license of any specific ONNX/TensorRT model you choose to load.
