Add Qwen2.5-VL-7B-Instruct full vision-language contrib model by jimburtoft · Pull Request #110 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-07T17:38:56Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Full vision-language implementation of Qwen2.5-VL-7B-Instruct on NxD Inference. Unlike existing Qwen2.5-VL contrib entries (3B, 32B) which only support the text backbone, this implementation provides complete vision-language inference including the vision encoder with windowed attention.

NxDI has built-in support for qwen2_vl and qwen3_vl, but skipped the qwen2_5_vl generation entirely.

Key highlights:

Text backbone reuses ~98% of Qwen2-VL code (identical architecture)
Vision encoder implements unique Qwen2.5-VL features: RMSNorm, Gated SwiGLU MLP with bias, hybrid windowed/global attention
Multi-bucket CTE optimization: 4.8x TTFT improvement for short inputs (38ms vs 183ms)
Validated on all 3 model sizes: 3B (104.3 tok/s), 7B (86.4 tok/s), 72B (44.3 tok/s)
vLLM-neuron integration validated on both 0.4.1 and 0.5.0

Model Information

Model Name: Qwen2.5-VL-7B-Instruct

Model Architecture: Vision-Language model with ViT vision encoder + decoder-only transformer text backbone. GQA (28Q/4KV heads), M-RoPE [16,24,24], SwiGLU MLP. Vision encoder uses hybrid windowed (28 layers) + global (4 layers) attention with RMSNorm and Gated SwiGLU MLP.

Purpose: Vision-language inference (image understanding, image-to-text generation)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- At least one integration test that validates model accuracy
- Uses logit validation or equivalent accuracy verification
- Test can compile and run the model on Neuron
README.md with the following sections:
- Usage Example: Clear code example showing how to use the model
- Compatibility Matrix: Table showing tested Neuron SDK versions and instance types (Trn1/Trn2/Inf2)
- Example Checkpoints: Links to compatible model checkpoints (e.g., HuggingFace Hub)
- Testing Instructions: Command to run the test suite for the model
Source Code (src/)
- Modeling code following NxD Inference patterns
- Properly structured in the contrib folder hierarchy

Optional Components

Unit Tests (CPU or Neuron-based)
- Tests for individual modeling components
- Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen2.5-VL-7B-Instruct/
  README.md
  patch_vllm_qwen25vl.py
  patch_vllm_050_qwen25vl.py
  /src
    __init__.py
    modeling_qwen2_5_vl.py
    modeling_qwen2_5_vl_text.py
    modeling_qwen2_5_vl_vision.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All 7 integration tests were run on trn2.3xlarge (TP=4, LNC=2) with Neuron SDK 2.28. The 72B model was tested on trn2.48xlarge (TP=32). Tests include:

Smoke test: Model loads from compiled artifacts
Text-only generation: Greedy generation produces "The capital of France is Paris." (exact CPU match)
Logit validation: Uses logit_validation() from neuronx_distributed_inference.experimental.core.accuracy.logit_validation -- all 8 tokens matched, max K5 error 0.0070 (threshold 0.01)
VL generation: Correctly identifies shapes/colors in synthetic test images
Multi-resolution VL: Validates 224x224, 448x448, 672x672, 640x480 inputs
vllm-neuron API: 6/6 OpenAI-compatible API tests passed (0.4.1 and 0.5.0)
Multi-bucket CTE: All tests pass with optimized bucketing [512, 1024, 2048, 4096]

Test Results:

test/integration/test_model.py::test_smoke_load PASSED
test/integration/test_model.py::test_text_generation PASSED
test/integration/test_model.py::test_logit_validation PASSED
test/integration/test_model.py::test_vl_generation PASSED
test/integration/test_model.py::test_vl_multi_resolution PASSED
test/integration/test_model.py::test_vllm_api PASSED
test/integration/test_model.py::test_multi_bucket_cte PASSED
========================= 7 passed =========================

Compatibility

Tested with:

Neuron SDK Version(s): 2.28
Instance Type(s): trn2.3xlarge (7B, 3B), trn2.48xlarge (72B)
PyTorch Version: 2.9
Python Version: 3.12

Additional Information

Performance (TP=4, trn2.3xlarge, optimized config)

Metric	Text-only	Vision-Language
Token Generation	86.4 tok/s	86.7 tok/s
TPOT	11.57 ms	11.57 ms
TTFT (short input)	38.2 ms	~70 ms
HBM per Core	4.2 GB	4.2 GB
Compile Time	~82s (5 NEFFs)	~112s (text + vision)

Multi-size validation

Model	Instance	TP	TKG tok/s
3B	trn2.3xlarge	4	104.3
7B	trn2.3xlarge	4	86.4
72B	trn2.48xlarge	32	44.3

NKI Kernel Compatibility (7B text decoder)

Kernel	Status
qkv_kernel_enabled	PASS
attn_kernel_enabled	PASS
attn_tkg_nki_kernel_enabled	PASS
mlp_kernel_enabled	FAIL (SBUF OOM: intermediate/TP=4736)
attn_tkg_builtin_kernel_enabled	FAIL (M-RoPE incompatible)
out_proj_kernel_enabled	FAIL (hidden_size=3584 % 1024 != 0)

Known Limitations

Batch size > 1 requires the VLM batch>1 fix from branch fix/qwen3-vl-batch-size-gt1-v2
MLP, builtin TKG, and out_proj NKI kernels not compatible (see kernel matrix above)
Vision qkv_kernel not compatible (fused RMSNorm+QKV ISA kernel eps type mismatch)
Multi-bucket CTE minimum bucket must be >= 512 (TKG NKI kernel assertion)
Video input not tested (architecturally supported)

Related Issues

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

Validated on both vllm-neuron 0.4.1 and 0.5.0 (6/6 API tests passed on each). Patch scripts included (patch_vllm_qwen25vl.py for 0.4.1, patch_vllm_050_qwen25vl.py for 0.5.0). Patches add Qwen2.5-VL to 4 files: constants.py, model_loader.py, model_runner.py, and NxDI constants.py.

For vLLM integration details, see: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html#nxdi-onboarding-models-vllm

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

The existing Qwen2.5-VL-3B and VL-32B contrib src/ directories had issues: - VL-3B: missing source files (mrope.py, config_qwen2vl.py), used 1D RoPE instead of M-RoPE, no vision encoder, 67% token match - VL-32B: used 1D RoPE instead of M-RoPE, no vision encoder, 0% token match The unified VL-7B implementation already supports all sizes (3B, 7B, 32B, 72B) via config-driven parameterization. Validated: 3B at 104.3 tok/s, 72B at 44.3 tok/s. Replace broken src/test with redirect READMEs that provide size-specific TP guidance and point to the unified implementation.

jimburtoft · 2026-04-07T19:41:19Z

I've added a commit that cleans up the existing Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-32B-Instruct contrib directories. Since the unified VL-7B implementation already supports all Qwen2.5-VL sizes (validated on 3B, 7B, and 72B), having separate broken implementations causes confusion.

What changed:

Removed src/ and test/ from Qwen2.5-VL-3B-Instruct/ -- the existing code was missing required files (mrope.py, config_qwen2vl.py), used standard 1D RoPE instead of M-RoPE, had no vision encoder, and achieved only 67% token match.
Removed src/ and test/ from Qwen2.5-VL-32B-Instruct/ -- same issues (1D RoPE, no vision encoder, 0% token match, boilerplate test template).
Replaced READMEs in both directories with redirect pages that point to the unified VL-7B implementation and provide size-specific guidance (model dimensions, recommended TP degree, performance numbers).

Why: All Qwen2.5-VL sizes share the identical architecture -- the only differences are numeric config parameters (hidden_size, num_layers, etc.) that are read from the HuggingFace config.json. Having separate broken stubs risks users accidentally importing them instead of the working unified implementation.

If you'd prefer to keep this cleanup as a separate PR, I'm happy to split it out. Just let me know.

jimburtoft added 2 commits April 7, 2026 13:37

Add Qwen2.5-VL-7B-Instruct full vision-language contrib model

711653c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen2.5-VL-7B-Instruct full vision-language contrib model#110

Add Qwen2.5-VL-7B-Instruct full vision-language contrib model#110
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/qwen2.5-vl-7b-upstream

jimburtoft commented Apr 7, 2026 •

edited

Loading

Uh oh!

jimburtoft commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Performance (TP=4, trn2.3xlarge, optimized config)

Multi-size validation

NKI Kernel Compatibility (7B text decoder)

Known Limitations

Related Issues

vLLM Integration

Uh oh!

jimburtoft commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jimburtoft commented Apr 7, 2026 •

edited

Loading