Skip to content

Add Qwen2.5-VL-7B-Instruct full vision-language contrib model#110

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/qwen2.5-vl-7b-upstream
Open

Add Qwen2.5-VL-7B-Instruct full vision-language contrib model#110
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/qwen2.5-vl-7b-upstream

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

@jimburtoft jimburtoft commented Apr 7, 2026

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Full vision-language implementation of Qwen2.5-VL-7B-Instruct on NxD Inference. Unlike existing Qwen2.5-VL contrib entries (3B, 32B) which only support the text backbone, this implementation provides complete vision-language inference including the vision encoder with windowed attention.

NxDI has built-in support for qwen2_vl and qwen3_vl, but skipped the qwen2_5_vl generation entirely.

Key highlights:

  • Text backbone reuses ~98% of Qwen2-VL code (identical architecture)
  • Vision encoder implements unique Qwen2.5-VL features: RMSNorm, Gated SwiGLU MLP with bias, hybrid windowed/global attention
  • Multi-bucket CTE optimization: 4.8x TTFT improvement for short inputs (38ms vs 183ms)
  • Validated on all 3 model sizes: 3B (104.3 tok/s), 7B (86.4 tok/s), 72B (44.3 tok/s)
  • vLLM-neuron integration validated on both 0.4.1 and 0.5.0

Model Information

Model Name: Qwen2.5-VL-7B-Instruct

Model Architecture: Vision-Language model with ViT vision encoder + decoder-only transformer text backbone. GQA (28Q/4KV heads), M-RoPE [16,24,24], SwiGLU MLP. Vision encoder uses hybrid windowed (28 layers) + global (4 layers) attention with RMSNorm and Gated SwiGLU MLP.

Purpose: Vision-language inference (image understanding, image-to-text generation)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • At least one integration test that validates model accuracy
    • Uses logit validation or equivalent accuracy verification
    • Test can compile and run the model on Neuron
  • README.md with the following sections:

    • Usage Example: Clear code example showing how to use the model
    • Compatibility Matrix: Table showing tested Neuron SDK versions and instance types (Trn1/Trn2/Inf2)
    • Example Checkpoints: Links to compatible model checkpoints (e.g., HuggingFace Hub)
    • Testing Instructions: Command to run the test suite for the model
  • Source Code (src/)

    • Modeling code following NxD Inference patterns
    • Properly structured in the contrib folder hierarchy

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • Tests for individual modeling components
    • Located in test/unit/ directory

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/Qwen2.5-VL-7B-Instruct/
  README.md
  patch_vllm_qwen25vl.py
  patch_vllm_050_qwen25vl.py
  /src
    __init__.py
    modeling_qwen2_5_vl.py
    modeling_qwen2_5_vl_text.py
    modeling_qwen2_5_vl_vision.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All 7 integration tests were run on trn2.3xlarge (TP=4, LNC=2) with Neuron SDK 2.28. The 72B model was tested on trn2.48xlarge (TP=32). Tests include:

  1. Smoke test: Model loads from compiled artifacts
  2. Text-only generation: Greedy generation produces "The capital of France is Paris." (exact CPU match)
  3. Logit validation: Uses logit_validation() from neuronx_distributed_inference.experimental.core.accuracy.logit_validation -- all 8 tokens matched, max K5 error 0.0070 (threshold 0.01)
  4. VL generation: Correctly identifies shapes/colors in synthetic test images
  5. Multi-resolution VL: Validates 224x224, 448x448, 672x672, 640x480 inputs
  6. vllm-neuron API: 6/6 OpenAI-compatible API tests passed (0.4.1 and 0.5.0)
  7. Multi-bucket CTE: All tests pass with optimized bucketing [512, 1024, 2048, 4096]

Test Results:

test/integration/test_model.py::test_smoke_load PASSED
test/integration/test_model.py::test_text_generation PASSED
test/integration/test_model.py::test_logit_validation PASSED
test/integration/test_model.py::test_vl_generation PASSED
test/integration/test_model.py::test_vl_multi_resolution PASSED
test/integration/test_model.py::test_vllm_api PASSED
test/integration/test_model.py::test_multi_bucket_cte PASSED
========================= 7 passed =========================

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.28
  • Instance Type(s): trn2.3xlarge (7B, 3B), trn2.48xlarge (72B)
  • PyTorch Version: 2.9
  • Python Version: 3.12

Additional Information

Performance (TP=4, trn2.3xlarge, optimized config)

Metric Text-only Vision-Language
Token Generation 86.4 tok/s 86.7 tok/s
TPOT 11.57 ms 11.57 ms
TTFT (short input) 38.2 ms ~70 ms
HBM per Core 4.2 GB 4.2 GB
Compile Time ~82s (5 NEFFs) ~112s (text + vision)

Multi-size validation

Model Instance TP TKG tok/s
3B trn2.3xlarge 4 104.3
7B trn2.3xlarge 4 86.4
72B trn2.48xlarge 32 44.3

NKI Kernel Compatibility (7B text decoder)

Kernel Status
qkv_kernel_enabled PASS
attn_kernel_enabled PASS
attn_tkg_nki_kernel_enabled PASS
mlp_kernel_enabled FAIL (SBUF OOM: intermediate/TP=4736)
attn_tkg_builtin_kernel_enabled FAIL (M-RoPE incompatible)
out_proj_kernel_enabled FAIL (hidden_size=3584 % 1024 != 0)

Known Limitations

  1. Batch size > 1 requires the VLM batch>1 fix from branch fix/qwen3-vl-batch-size-gt1-v2
  2. MLP, builtin TKG, and out_proj NKI kernels not compatible (see kernel matrix above)
  3. Vision qkv_kernel not compatible (fused RMSNorm+QKV ISA kernel eps type mismatch)
  4. Multi-bucket CTE minimum bucket must be >= 512 (TKG NKI kernel assertion)
  5. Video input not tested (architecturally supported)

Related Issues

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

Validated on both vllm-neuron 0.4.1 and 0.5.0 (6/6 API tests passed on each). Patch scripts included (patch_vllm_qwen25vl.py for 0.4.1, patch_vllm_050_qwen25vl.py for 0.5.0). Patches add Qwen2.5-VL to 4 files: constants.py, model_loader.py, model_runner.py, and NxDI constants.py.

For vLLM integration details, see: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/onboarding-models.html#nxdi-onboarding-models-vllm


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

The existing Qwen2.5-VL-3B and VL-32B contrib src/ directories had issues:
- VL-3B: missing source files (mrope.py, config_qwen2vl.py), used 1D RoPE
  instead of M-RoPE, no vision encoder, 67% token match
- VL-32B: used 1D RoPE instead of M-RoPE, no vision encoder, 0% token match

The unified VL-7B implementation already supports all sizes (3B, 7B, 32B, 72B)
via config-driven parameterization. Validated: 3B at 104.3 tok/s, 72B at 44.3 tok/s.

Replace broken src/test with redirect READMEs that provide size-specific
TP guidance and point to the unified implementation.
@jimburtoft
Copy link
Copy Markdown
Contributor Author

I've added a commit that cleans up the existing Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-32B-Instruct contrib directories. Since the unified VL-7B implementation already supports all Qwen2.5-VL sizes (validated on 3B, 7B, and 72B), having separate broken implementations causes confusion.

What changed:

  • Removed src/ and test/ from Qwen2.5-VL-3B-Instruct/ -- the existing code was missing required files (mrope.py, config_qwen2vl.py), used standard 1D RoPE instead of M-RoPE, had no vision encoder, and achieved only 67% token match.
  • Removed src/ and test/ from Qwen2.5-VL-32B-Instruct/ -- same issues (1D RoPE, no vision encoder, 0% token match, boilerplate test template).
  • Replaced READMEs in both directories with redirect pages that point to the unified VL-7B implementation and provide size-specific guidance (model dimensions, recommended TP degree, performance numbers).

Why: All Qwen2.5-VL sizes share the identical architecture -- the only differences are numeric config parameters (hidden_size, num_layers, etc.) that are read from the HuggingFace config.json. Having separate broken stubs risks users accidentally importing them instead of the working unified implementation.

If you'd prefer to keep this cleanup as a separate PR, I'm happy to split it out. Just let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant