Skip to content

Add Qwen2.5-Omni-7B multimodal support (TP=4, verified on Trn2)#122

Draft
whn09 wants to merge 14 commits intoaws-neuron:mainfrom
whn09:feature/qwen25-omni-support
Draft

Add Qwen2.5-Omni-7B multimodal support (TP=4, verified on Trn2)#122
whn09 wants to merge 14 commits intoaws-neuron:mainfrom
whn09:feature/qwen25-omni-support

Conversation

@whn09
Copy link
Copy Markdown

@whn09 whn09 commented Apr 10, 2026

Description

Add NxDI model adapter for Qwen2.5-Omni-7B with full multimodal support: text-only Thinker, vision encoder, audio encoder, Talker (speech codec generation), and Token2Wav (waveform synthesis).

All Neuron components run at TP=4: Thinker (28 heads → 7/rank), Vision (16 → 4/rank), Audio (20 → 5/rank). This enables running on trn2.xlarge (4 NeuronCores) as well as larger instances.

The Thinker's text backbone is architecturally identical to Qwen2.5-7B, so the text model reuses NeuronQwen2ForCausalLM with state-dict prefix remapping. The vision encoder uses SwiGLU MLP, RMSNorm, and separate QKV projections, compiled and run on Neuron. The audio encoder uses a hybrid CPU+Neuron architecture: Conv1d frontend + chunking on CPU, 32 transformer layers on Neuron, AvgPool + projection on CPU. The Talker stays on CPU (non-standard head_dim, 3D mRoPE, per-step thinker injection). Token2Wav runs on CPU in float32.

Model Information

Model Name: Qwen2.5-Omni-7B

Model Architecture: Multimodal encoder-decoder (Thinker + Vision + Audio + Talker + Token2Wav)

Purpose: Text generation, image-to-text, audio-to-text, text-to-speech (full omni-modal)

Checklist

Required Components

  • Accuracy Test — All 6 tests pass on trn2.48xlarge with real Qwen2.5-Omni-7B weights
  • README.md — Full multimodal documentation with architecture details, usage examples, vLLM serving instructions, and performance benchmarks
  • Source Code (src/)
    • modeling_qwen25_omni.py — Text-only + multimodal orchestration, config handling, state dict conversion
    • modeling_qwen25_omni_vision.py — Vision encoder (SwiGLU, RMSNorm, separate QKV, PatchMerger)
    • modeling_qwen25_omni_audio.py — Audio encoder (hybrid CPU+Neuron, chunked attention)
    • modeling_qwen25_omni_talker.py — Talker (HF wrapper, codec token generation, CPU)
    • modeling_qwen25_omni_token2wav.py — Token2Wav (DiT + BigVGAN, waveform synthesis, CPU)

Optional Components

  • Unit Tests — Integration tests on Trn2 with real weights (see Testing section)

Folder Structure

src/neuronx_distributed_inference/models/qwen25_omni/
  __init__.py
  modeling_qwen25_omni.py          # Text-only + multimodal orchestration
  modeling_qwen25_omni_vision.py   # Vision encoder (Neuron, TP=4)
  modeling_qwen25_omni_audio.py    # Audio encoder (CPU+Neuron hybrid, TP=4)
  modeling_qwen25_omni_talker.py   # Talker codec generation (CPU)
  modeling_qwen25_omni_token2wav.py # Token2Wav waveform synthesis (CPU)

contrib/models/Qwen2.5-Omni-7B/
  README.md                        # Full multimodal documentation
  test/integration/test_model.py   # 6 integration tests

perf_test/
  3_bench_qwen25_omni_7b.sh       # vLLM benchmark (BS=1/4, TP=4)
  apply_vllm_neuron_patch_qwen25omni.py  # vllm-neuron patch script

Testing

How did you test this change?

All 6 tests pass on trn2.48xlarge with Qwen2.5-Omni-7B weights (Neuron SDK 2.23, PyTorch 2.9):

  1. Imports: All 7 module groups import correctly
  2. Config: TP=4 head divisibility verified (Thinker 7Q/1KV, Audio 5, Vision 4 per rank)
  3. State dict: All 2448 keys convert correctly (text=339, audio=489, vision=518, talker=293, token2wav=809)
  4. Audio CPU: Frontend + postprocessor latency 20-34ms (1-30s audio)
  5. Talker CPU: 293 keys loaded, codec tokens verified (bos=8293, eos=8294, pad=8292)
  6. Text generation: Compile + load + generate on Neuron, TPOT ~11-13ms, correct outputs

Generation Results (greedy, TP=4):

Prompt Tokens Time TPOT
"What is 2+3?" 1 0.10s
"Write a haiku about the ocean." 20 0.26s ~12.8ms
"Explain quantum computing in one sentence." 30 0.34s ~11.5ms

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.23
  • Instance Type(s): Trn2 (trn2.48xlarge), should work on trn2.xlarge (4 cores)
  • PyTorch Version: 2.9
  • Python Version: 3.12

Additional Information

Architecture details:

Component Runtime TP Key dims
Thinker (text) Neuron 4 hidden=3584, heads=28, kv_heads=4, layers=28
Vision encoder Neuron 4 embed=1280, heads=16, depth=32, SwiGLU
Audio encoder CPU+Neuron 4 d_model=1280, heads=20, layers=32, chunked attn
Talker CPU N/A hidden=896, heads=12, kv_heads=4, layers=24, vocab=8448
Token2Wav CPU (fp32) N/A DiT: dim=1024, 22 blocks; BigVGAN: 6 upsample stages

Design decisions:

  • TP=4: All attention head counts (28, 16, 20) are divisible by 4, enabling single trn2.xlarge
  • Audio hybrid CPU+Neuron: Conv1d frontend + chunking on CPU (variable length), transformer on Neuron (TP=4), postprocessor on CPU
  • Talker on CPU: Non-standard head_dim (128 ≠ 896/12), 3D mRoPE with per-step thinker-state injection, only ~690M params
  • Token2Wav on CPU: ODE sampling (Runge-Kutta 4) requires float32 precision

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • vLLM-neuron patch included in perf_test/apply_vllm_neuron_patch_qwen25omni.py

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

whn09 and others added 5 commits April 10, 2026 16:38
The Thinker's text backbone is architecturally identical to Qwen2.5,
so we reuse the Qwen2 NxDI implementation with state-dict remapping
to handle the nested weight prefixes (thinker.model.* -> *).

Non-text components (Talker, Token2Wav, audio/vision encoders) are
filtered out during weight loading for text-only inference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When loading Qwen2.5-Omni from a saved compiled model path, thinker_config
is a plain dict (from JSON) rather than a SimpleNamespace. Convert it before
attribute access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PretrainedConfig defaults tie_word_embeddings=True, which overwrites the
correct False value from thinker_config.text_config. This caused lm_head
weights to be replaced by embed_tokens, producing only special tokens.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With tie_word_embeddings correctly set to False from text_config, this
method is never called, so the override is dead code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…token2wav)

Extends the text-only Thinker support with all multimodal components:

- Vision encoder (Neuron): SwiGLU MLP, RMSNorm, separate QKV projections,
  PatchMerger with spatial downsampling. Compiled and runs on Neuron cores.

- Audio encoder (CPU): Whisper-style Conv1d frontend, 32 transformer layers
  with chunked attention (n_window=100), AvgPool1d downsampling. Runs on CPU
  due to 20 attention heads not being TP-divisible by 32.

- Talker (CPU): Small Qwen2 decoder (24 layers, hidden=896, 12 heads, 4
  kv_heads) that converts Thinker hidden states into codec tokens. Wraps HF's
  Qwen2_5OmniTalkerForConditionalGeneration for autoregressive generation.

- Token2Wav (CPU, float32): DiT model (22 blocks) with ECAPA-TDNN speaker
  encoder for mel spectrogram generation, plus BigVGAN vocoder for waveform
  synthesis. Wraps HF's Qwen2_5OmniToken2WavModel.

- Multimodal config: Extracts text/vision/audio/talker/token2wav configs from
  the nested HF config structure.

- State dict conversion: Full 2448-key pipeline handling all component prefixes
  (thinker.model.*, thinker.visual.*, thinker.audio_tower.*, talker.*, token2wav.*).

All components tested on Trn2 with real Qwen2.5-Omni-7B weights:
  Text: 339 keys, Vision: 518, Audio: 489, Talker: 293, Token2Wav: 809

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@whn09 whn09 force-pushed the feature/qwen25-omni-support branch from 569f752 to 8a91d55 Compare April 10, 2026 08:42
@whn09 whn09 changed the title Add Qwen2.5-Omni-7B text-only (Thinker) inference support Add Qwen2.5-Omni-7B full multimodal inference support Apr 10, 2026
whn09 and others added 2 commits April 10, 2026 16:59
…ructions

Replace auto-generated README with comprehensive documentation covering all
5 components (Thinker, Vision, Audio, Talker, Token2Wav), vLLM serving setup,
performance benchmarks, and architecture details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eneration

- Migrate all Neuron components from TP=32 to TP=4 (Thinker 7Q/1KV per rank,
  Vision 4/rank, Audio 5/rank)
- Refactor audio encoder to hybrid CPU+Neuron: Conv1d frontend + chunking on
  CPU, 32 transformer layers on Neuron with TP=4, AvgPool + projection on CPU
- Add compile_audio_encoder() and load_audio_encoder() orchestration methods
- Implement multimodal forward() combining vision + audio embeddings via
  scatter_by_index_put
- Document Talker CPU rationale (non-standard head_dim, 3D mRoPE, per-step
  thinker injection, small model)
- Update benchmark script and README for TP=4 configs
- All 6 tests pass on trn2.48xlarge: imports, config, state_dict (2448 keys),
  audio CPU (20ms/1s), talker (293 keys), text generation (TPOT ~11-13ms)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@whn09 whn09 changed the title Add Qwen2.5-Omni-7B full multimodal inference support Add Qwen2.5-Omni-7B multimodal support (TP=4, verified on Trn2) Apr 10, 2026
whn09 and others added 3 commits April 10, 2026 18:31
Rewrite test_model.py with 8 working tests:
1. Import validation (all 7 module groups)
2. Config TP=4 head divisibility check
3. State dict conversion (2448 keys, all components)
4. Audio encoder CPU frontend/postprocessor (1s-30s synthetic audio)
5. Talker CPU model (weight loading + codec tokens)
6. Text-only Thinker compile + load + generate (3 prompts, chat template)
7. Image understanding preprocessing (Qwen official demo.jpeg)
8. Audio understanding preprocessing (Qwen official cough.wav)

Supports --test, --quick flags, and env var overrides for paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Supports 4 modes: text-only, image+text, audio+text, full multimodal.
Uses Qwen official test assets (demo.jpeg, cough.wav).

Usage:
  python3 examples/generate_qwen25_omni.py --mode text   # text only
  python3 examples/generate_qwen25_omni.py --mode image  # image understanding
  python3 examples/generate_qwen25_omni.py --mode audio  # audio understanding
  python3 examples/generate_qwen25_omni.py --mode full   # all modalities

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass model_path and compiled_path as function parameters instead of
using global statement after referencing module-level constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
whn09 and others added 4 commits April 13, 2026 10:26
The upstream merge added tensor_capture_hook to hf_adapter's
prepare_inputs_for_generation but didn't extract it from kwargs,
and didn't add it to NeuronBaseForCausalLM.forward() signature.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement Neuron/Trainium support for the Talker (690M, 24 layers) and
Token2Wav DiT (85M, 22 blocks) components, enabling full speech synthesis
on Neuron hardware alongside the existing Thinker.

Talker (modeling_qwen25_omni_talker.py):
- NeuronQwen25OmniTalkerForCausalLM with NeuronBaseForCausalLM + ImageToTextModelWrapper
- Explicit head_dim=128 (not hidden_size/num_heads=74.67), TP=4 recommended
- Fused embedding: embed_tokens(8448,3584) + proj(3584,896) → (8448,896)
- mRoPE support with mrope_section=[16,24,24] for 3D position_ids
- ThinkerToTalkerProjection for CPU-side context encoding (3584→896)
- State dict conversion with QKV fusion, codec_head→lm_head mapping

Token2Wav (modeling_qwen25_omni_token2wav.py):
- NeuronQwen25OmniToken2WavWithNeuronDiT compiles DiT via torch_neuronx.trace()
- ODE solver loop + BigVGAN vocoder stay on CPU
- Monkeypatches DiT forward to redirect to Neuron during inference

Orchestration (modeling_qwen25_omni.py):
- enable_talker/compile_talker/load_talker for CPU and Neuron modes
- enable_token2wav/compile_token2wav_dit/load_token2wav_dit
- Factory methods for Neuron talker, config, projection, and token2wav classes

Tests:
- test_talker_neuron.py: 9 unit tests with auto-mock import hook for CPU testing
- test_e2e_qwen25_omni.py: 4 end-to-end tests (text/image/audio→text, text→speech)
  validated on trn2.48xlarge with Qwen2.5-Omni-7B

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Measured per-component CPU timing: Talker 103s (41%), Token2Wav 118s (47%),
Thinker 31s (12%) for 14.1s audio output (17.9x RTF on trn2.48xlarge).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@whn09 whn09 marked this pull request as draft April 14, 2026 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant