Skip to content

Add DINOv3 vision foundation models (ViT + ConvNeXt, 21M-6.7B)#116

Open
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/dinov3
Open

Add DINOv3 vision foundation models (ViT + ConvNeXt, 21M-6.7B)#116
jimburtoft wants to merge 3 commits intoaws-neuron:mainfrom
jimburtoft:contrib/dinov3

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Description

NxDI contrib implementation for Meta DINOv3 self-supervised vision foundation models. Supports 7 model variants from 21M to 6.7B parameters across two architectures (ViT and ConvNeXt), using torch_neuronx.trace() for models up to 840M and tensor parallelism via neuronx-distributed ModelBuilder for ViT-7B (6.7B).

Key highlights:

  • First encoder-only vision model with tensor parallelism on Neuron (ViT-7B, TP=4)
  • Two compilation paths: torch_neuronx.trace() for standard models, neuronx-distributed TP for 6.7B
  • GPU comparison included: Neuron DP=4 beats A10G on ViT (1.16x), GPU wins on ConvNeXt (3.2x)

Model Information

Model Name: DINOv3 (ViT-S/B/L/H+, ConvNeXt-T/B, ViT-7B)

Model Architecture: Encoder-only vision transformer / ConvNeXt backbone

Purpose: Dense feature extraction (self-supervised vision embeddings)

HuggingFace / Source: https://github.com/facebookresearch/dinov3

License: DINOv3 License (not Apache/MIT -- review before redistribution)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)

    • 15 integration tests covering smoke tests, accuracy validation, DataParallel scaling, and performance benchmarks
    • Accuracy validated via cosine similarity between Neuron (matmult bf16) and CPU FP32 outputs
    • ViT models: cosine >= 0.9999, ConvNeXt: cosine >= 0.9998
    • All 15 tests PASSED on trn2.3xlarge (SDK 2.28)
  • README.md with the following sections:

    • Usage Example: Python API for trace, TP, DataParallel, and validation
    • Compatibility Matrix: SDK 2.28, inf2.xlarge / trn2.3xlarge
    • Example Checkpoints: DINOv3 repository (pretrained weights via dinov3 package)
    • Testing Instructions: pytest and standalone runner commands
    • GPU Comparison: A10G vs trn2.3xlarge benchmark results
  • Source Code (src/)

    • modeling_dinov3.py (801 lines): Model loading, trace compilation, TP ViT-7B definition, accuracy validation, benchmarking utilities
    • Follows MoLFormer contrib pattern (trace-based, no NxDI base classes for standard models)

Optional Components

  • Unit Tests -- Integration tests cover both CPU validation and Neuron execution

Folder Structure

contrib/models/DINOv3/
├── README.md                           # Model docs, results, GPU comparison, usage
├── src/
│   ├── __init__.py
│   └── modeling_dinov3.py              # Trace helpers + TP ViT-7B model definition
└── test/
    ├── __init__.py
    ├── unit/
    │   └── __init__.py
    └── integration/
        ├── __init__.py
        └── test_model.py              # 15 tests: smoke, accuracy, DP, performance

Testing

Instance: trn2.3xlarge (ap-southeast-4), SDK 2.28, DLAMI 20260227

Test command:

source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/bin/activate
git clone https://github.com/facebookresearch/dinov3.git /mnt/models/dinov3
python -m pytest contrib/models/DINOv3/test/integration/test_model.py -v

Test Results: 15/15 PASSED (129 seconds)

test_vit_small_smoke ..................... PASSED
test_vit_base_smoke ...................... PASSED
test_convnext_tiny_smoke ................. PASSED
test_convnext_base_smoke ................. PASSED
test_vit_small_accuracy .................. PASSED  (cosine: 1.000000)
test_vit_base_accuracy ................... PASSED  (cosine: 1.000000)
test_vit_large_accuracy .................. PASSED  (cosine: 1.000000)
test_vit_huge_accuracy ................... PASSED  (cosine: 1.000000)
test_convnext_tiny_accuracy .............. PASSED  (cosine: 0.999989)
test_convnext_base_accuracy .............. PASSED  (cosine: 0.999989)
test_vit_base_dataparallel ............... PASSED  (DP=4: 438.7 img/s)
test_convnext_tiny_dataparallel .......... PASSED  (DP=4: 363.3 img/s)
test_vit_base_performance ................ PASSED  (222.3 img/s)
test_convnext_tiny_performance ........... PASSED  (184.2 img/s)
test_vit7b_tp4 ........................... PASSED  (38.9 img/s, 25.69ms)

Standalone validation:

ViT-B/16:      cosine 1.000000, 222.3 img/s (1-core), 440.6 img/s (DP=4)
ConvNeXt-Tiny: cosine 0.999991, 184.2 img/s (1-core), 364.5 img/s (DP=4)
ViT-7B (TP=4): 38.9 img/s, 25.69ms latency, deterministic, all finite

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.28
  • Instance Type(s): trn2.3xlarge (TP models + DP benchmark), inf2.xlarge (small models)
  • torch-neuronx: 2.9.0.2.11
  • neuronx-cc: 2.22.12471
  • neuronx-distributed: 0.16.25997 (ViT-7B TP only)
  • Python Version: 3.12
  • DLAMI: Deep Learning AMI Neuron (Ubuntu 24.04) 20260227
Instance Models Status
inf2.xlarge ViT-S, ViT-B, ConvNeXt-T, ConvNeXt-B PASS
trn2.3xlarge All models including ViT-7B (TP=4) PASS

Additional Information

Benchmark Results (trn2.3xlarge, LNC=2, DP=4)

Model NEFF Size 1-Core (img/s) DP=4 Peak (img/s)
ViT-S/16 (21M) 68 MB 367 722.8
ViT-B/16 (86M) 264 MB 222 438.7
ViT-L/16 (303M) 931 MB 87.6 174.7
ViT-H+/16 (841M) 2,595 MB 5.2 10.5
ViT-7B (6.7B) TP=4 NEFF -- 38.8 (TP=4)
ConvNeXt-T (28M) 90 MB 183 363.3
ConvNeXt-B (88M) 275 MB 130 257.8

GPU Comparison (A10G g5.xlarge vs trn2.3xlarge)

Model Neuron Best (DP=4) GPU Best (torch.compile BS=16) Winner
ViT-B/16 440.6 img/s 380.0 img/s Neuron 1.16x
ConvNeXt-Tiny 364.5 img/s 1,156.2 img/s GPU 3.2x

Neuron excels on ViT (transformer ops optimized), GPU excels on ConvNeXt (conv ops heavily optimized on CUDA).

Key Design Decisions

  1. Two compilation paths: torch_neuronx.trace() for models up to 840M, neuronx-distributed TP for ViT-7B
  2. --auto-cast=matmult critical: 50-60% speedup for FP32 models with matmult bf16 autocast
  3. pretrained=False for tests: Random weights for architecture validation (avoids large downloads in CI)
  4. ViT-7B TP=4: First encoder-only vision model with tensor parallelism on Neuron; 20.1 GB NEFF exceeds single-core HBM

Known Limitations

  • ViT-H+ is HBM-bandwidth limited (2.5 GB NEFF, only 10.5 img/s DP=4)
  • ViT-7B requires TP=4 on trn2.3xlarge (single-core OOM)
  • DINOv3 pretrained weights require cloning the DINOv3 repository
  • DINOv3 License is not Apache/MIT -- review before redistribution

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Onboard Meta DINOv3 ViT and ConvNeXt backbones (21M-6.7B params) to Neuron.
Two compilation paths: torch_neuronx.trace() for models up to 840M, and
neuronx-distributed TP=4 for ViT-7B (first encoder-only vision TP on Neuron).
ViT cosine sim 1.000000, ConvNeXt 0.999989, peak DP=4: 722.8 img/s (ViT-S),
ViT-7B TP=4: 38.8 img/s at 25.77ms latency.
pretrained=False gives different random weights on each load_dinov3_model
call. compile_and_cache must receive the CPU model used for accuracy
comparison rather than loading a new one internally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant