Add DINOv3 vision foundation models (ViT + ConvNeXt, 21M-6.7B) by jimburtoft · Pull Request #116 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-09T04:11:06Z

Description

NxDI contrib implementation for Meta DINOv3 self-supervised vision foundation models. Supports 7 model variants from 21M to 6.7B parameters across two architectures (ViT and ConvNeXt), using torch_neuronx.trace() for models up to 840M and tensor parallelism via neuronx-distributed ModelBuilder for ViT-7B (6.7B).

Key highlights:

First encoder-only vision model with tensor parallelism on Neuron (ViT-7B, TP=4)
Two compilation paths: torch_neuronx.trace() for standard models, neuronx-distributed TP for 6.7B
GPU comparison included: Neuron DP=4 beats A10G on ViT (1.16x), GPU wins on ConvNeXt (3.2x)

Model Information

Model Name: DINOv3 (ViT-S/B/L/H+, ConvNeXt-T/B, ViT-7B)

Model Architecture: Encoder-only vision transformer / ConvNeXt backbone

Purpose: Dense feature extraction (self-supervised vision embeddings)

HuggingFace / Source: https://github.com/facebookresearch/dinov3

License: DINOv3 License (not Apache/MIT -- review before redistribution)

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- 15 integration tests covering smoke tests, accuracy validation, DataParallel scaling, and performance benchmarks
- Accuracy validated via cosine similarity between Neuron (matmult bf16) and CPU FP32 outputs
- ViT models: cosine >= 0.9999, ConvNeXt: cosine >= 0.9998
- All 15 tests PASSED on trn2.3xlarge (SDK 2.28)
README.md with the following sections:
- Usage Example: Python API for trace, TP, DataParallel, and validation
- Compatibility Matrix: SDK 2.28, inf2.xlarge / trn2.3xlarge
- Example Checkpoints: DINOv3 repository (pretrained weights via dinov3 package)
- Testing Instructions: pytest and standalone runner commands
- GPU Comparison: A10G vs trn2.3xlarge benchmark results
Source Code (src/)
- modeling_dinov3.py (801 lines): Model loading, trace compilation, TP ViT-7B definition, accuracy validation, benchmarking utilities
- Follows MoLFormer contrib pattern (trace-based, no NxDI base classes for standard models)

Optional Components

Unit Tests -- Integration tests cover both CPU validation and Neuron execution

Folder Structure

contrib/models/DINOv3/
├── README.md                           # Model docs, results, GPU comparison, usage
├── src/
│   ├── __init__.py
│   └── modeling_dinov3.py              # Trace helpers + TP ViT-7B model definition
└── test/
    ├── __init__.py
    ├── unit/
    │   └── __init__.py
    └── integration/
        ├── __init__.py
        └── test_model.py              # 15 tests: smoke, accuracy, DP, performance

Testing

Instance: trn2.3xlarge (ap-southeast-4), SDK 2.28, DLAMI 20260227

Test command:

source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
git clone https://github.com/facebookresearch/dinov3.git /mnt/models/dinov3
python -m pytest contrib/models/DINOv3/test/integration/test_model.py -v

Test Results: 15/15 PASSED (129 seconds)

test_vit_small_smoke ..................... PASSED
test_vit_base_smoke ...................... PASSED
test_convnext_tiny_smoke ................. PASSED
test_convnext_base_smoke ................. PASSED
test_vit_small_accuracy .................. PASSED  (cosine: 1.000000)
test_vit_base_accuracy ................... PASSED  (cosine: 1.000000)
test_vit_large_accuracy .................. PASSED  (cosine: 1.000000)
test_vit_huge_accuracy ................... PASSED  (cosine: 1.000000)
test_convnext_tiny_accuracy .............. PASSED  (cosine: 0.999989)
test_convnext_base_accuracy .............. PASSED  (cosine: 0.999989)
test_vit_base_dataparallel ............... PASSED  (DP=4: 438.7 img/s)
test_convnext_tiny_dataparallel .......... PASSED  (DP=4: 363.3 img/s)
test_vit_base_performance ................ PASSED  (222.3 img/s)
test_convnext_tiny_performance ........... PASSED  (184.2 img/s)
test_vit7b_tp4 ........................... PASSED  (38.9 img/s, 25.69ms)

Standalone validation:

ViT-B/16:      cosine 1.000000, 222.3 img/s (1-core), 440.6 img/s (DP=4)
ConvNeXt-Tiny: cosine 0.999991, 184.2 img/s (1-core), 364.5 img/s (DP=4)
ViT-7B (TP=4): 38.9 img/s, 25.69ms latency, deterministic, all finite

Compatibility

Tested with:

Neuron SDK Version(s): 2.28
Instance Type(s): trn2.3xlarge (TP models + DP benchmark), inf2.xlarge (small models)
torch-neuronx: 2.9.0.2.11
neuronx-cc: 2.22.12471
neuronx-distributed: 0.16.25997 (ViT-7B TP only)
Python Version: 3.12
DLAMI: Deep Learning AMI Neuron (Ubuntu 24.04) 20260227

Instance	Models	Status
inf2.xlarge	ViT-S, ViT-B, ConvNeXt-T, ConvNeXt-B	PASS
trn2.3xlarge	All models including ViT-7B (TP=4)	PASS

Additional Information

Benchmark Results (trn2.3xlarge, LNC=2, DP=4)

Model	NEFF Size	1-Core (img/s)	DP=4 Peak (img/s)
ViT-S/16 (21M)	68 MB	367	722.8
ViT-B/16 (86M)	264 MB	222	438.7
ViT-L/16 (303M)	931 MB	87.6	174.7
ViT-H+/16 (841M)	2,595 MB	5.2	10.5
ViT-7B (6.7B)	TP=4 NEFF	--	38.8 (TP=4)
ConvNeXt-T (28M)	90 MB	183	363.3
ConvNeXt-B (88M)	275 MB	130	257.8

GPU Comparison (A10G g5.xlarge vs trn2.3xlarge)

Model	Neuron Best (DP=4)	GPU Best (torch.compile BS=16)	Winner
ViT-B/16	440.6 img/s	380.0 img/s	Neuron 1.16x
ConvNeXt-Tiny	364.5 img/s	1,156.2 img/s	GPU 3.2x

Neuron excels on ViT (transformer ops optimized), GPU excels on ConvNeXt (conv ops heavily optimized on CUDA).

Key Design Decisions

Two compilation paths: torch_neuronx.trace() for models up to 840M, neuronx-distributed TP for ViT-7B
--auto-cast=matmult critical: 50-60% speedup for FP32 models with matmult bf16 autocast
pretrained=False for tests: Random weights for architecture validation (avoids large downloads in CI)
ViT-7B TP=4: First encoder-only vision model with tensor parallelism on Neuron; 20.1 GB NEFF exceeds single-core HBM

Known Limitations

ViT-H+ is HBM-bandwidth limited (2.5 GB NEFF, only 10.5 img/s DP=4)
ViT-7B requires TP=4 on trn2.3xlarge (single-core OOM)
DINOv3 pretrained weights require cloning the DINOv3 repository
DINOv3 License is not Apache/MIT -- review before redistribution

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Onboard Meta DINOv3 ViT and ConvNeXt backbones (21M-6.7B params) to Neuron. Two compilation paths: torch_neuronx.trace() for models up to 840M, and neuronx-distributed TP=4 for ViT-7B (first encoder-only vision TP on Neuron). ViT cosine sim 1.000000, ConvNeXt 0.999989, peak DP=4: 722.8 img/s (ViT-S), ViT-7B TP=4: 38.8 img/s at 25.77ms latency.

pretrained=False gives different random weights on each load_dinov3_model call. compile_and_cache must receive the CPU model used for accuracy comparison rather than loading a new one internally.

jimburtoft added 3 commits April 6, 2026 18:18

Fix accuracy validation: trace from same CPU model instance

cbe7ebe

pretrained=False gives different random weights on each load_dinov3_model call. compile_and_cache must receive the CPU model used for accuracy comparison rather than loading a new one internally.

Add GPU comparison benchmarks to DINOv3 README (A10G vs trn2)

4158a3d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DINOv3 vision foundation models (ViT + ConvNeXt, 21M-6.7B)#116

Add DINOv3 vision foundation models (ViT + ConvNeXt, 21M-6.7B)#116
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/dinov3

jimburtoft commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 9, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Benchmark Results (trn2.3xlarge, LNC=2, DP=4)

GPU Comparison (A10G g5.xlarge vs trn2.3xlarge)

Key Design Decisions

Known Limitations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant