Skip to content

Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture#128

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-27b
Open

Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture#128
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-27b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds NxDI contrib implementation of Qwen3.5-27B, a 27B parameter dense model with a hybrid DeltaNet + GQA attention architecture
  • First NxDI implementation of linear recurrent attention (DeltaNet) with custom NKI kernels
  • Includes text decoder, vision encoder (CPU), VL pipeline, 3 NKI kernels, 42 unit tests, and 8 integration tests
  • All 50 tests pass on trn2.3xlarge with SDK 2.29

Architecture

Feature Value
Parameters 27B dense
Layers 64: [3 DeltaNet + 1 GQA] × 16
Hidden size 5120
MLP SwiGLU (5120 → 17408 → 5120)
GQA 24 heads, 4 KV heads, head_dim=256
DeltaNet 48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding Partial RoPE (25% of head_dim)
Vocabulary 248,320

Key architectural features

  • Hybrid DeltaNet + GQA: 48/64 layers use gated DeltaNet (linear recurrent attention with delta rule), 16 layers use standard GQA with KV cache
  • Custom NKI kernels: Three kernels for DeltaNet forward — recurrent (TKG), per-chunk, and fused chunked (CTE) with Neumann series intra-chunk correction
  • GQA output gate: Sigmoid gate split from interleaved q_proj weights
  • +1 RMSNorm convention: norm(x) * (1 + weight) with selective +1 addition during weight conversion
  • Vision-Language support: Optional ViT encoder on CPU (HBM fully consumed by 27B text decoder)

Test Results

Unit Tests (42/42 PASS, CPU only)

Module Tests
test_config.py 26/26
test_weight_conversion.py 16/16

Integration Tests (8/8 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Test Status
Model loads PASS
Model generates PASS
Output coherence PASS
Top token valid PASS
Capital of France → "Paris" PASS
TTFT performance PASS
Throughput PASS
Multi-prompt generation (4 prompts) PASS

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, seq_len=128, bs=1)

Metric Value
TTFT (P50) 576 ms
TPOT (P50) 53 ms
Throughput 18.9 tok/s
Compilation ~13 min
HBM usage 23.57 / 24 GB

Files (15 files, ~6600 lines)

contrib/models/Qwen3.5-27B/
├── README.md                           # 283 lines
├── src/
│   ├── __init__.py                     # Public exports
│   ├── modeling_qwen35.py              # 2493 lines (text decoder)
│   ├── modeling_qwen35_vision.py       # 818 lines (vision encoder)
│   ├── modeling_qwen35_vl.py           # 662 lines (VL pipeline)
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py             # 334 lines (recurrent kernel)
│       ├── nki_deltanet_chunked.py     # 320 lines (per-chunk kernel)
│       └── nki_deltanet_fused.py       # 574 lines (fused chunked kernel)
└── test/
    ├── unit/
    │   ├── test_config.py              # 200 lines (26 tests)
    │   └── test_weight_conversion.py   # 434 lines (16 tests)
    └── integration/
        └── test_model.py               # 469 lines (8 tests)

Contrib-only

This PR only adds files under contrib/models/Qwen3.5-27B/. No changes to NxDI src/.

Checklist

  • Contrib-only (no changes to NxDI src/)
  • Unit tests (42/42 pass)
  • Integration tests (8/8 pass on trn2.3xlarge, SDK 2.29)
  • README with architecture details, benchmarks, and usage
  • Apache 2.0 license headers
  • SDK 2.29+ / NKI 0.3.0 required

First NxDI implementation of a model using linear recurrent attention
(DeltaNet) with custom NKI kernels for Neuron.

Architecture: 27B dense, 64 layers [3 DeltaNet + 1 GQA] x 16,
hidden_size=5120, SwiGLU MLP, partial RoPE, attention output gate.

Includes:
- Text decoder (modeling_qwen35.py) with DeltaNet state persistence
- Vision encoder (modeling_qwen35_vision.py) for VL pipeline
- VL pipeline (modeling_qwen35_vl.py) with CPU vision + Neuron text
- 3 NKI kernels: recurrent (TKG), chunked, fused chunked (CTE)
- 42 unit tests (config + weight conversion)
- 8 integration tests (all passing on trn2.3xlarge, SDK 2.29)
- Comprehensive README with benchmarks and usage examples

Tested on trn2.3xlarge (TP=4, LNC=2, SDK 2.29, NKI 0.3.0):
- TTFT: 576ms, TPOT: 53ms, throughput: 18.9 tok/s
- HBM usage: 23.57/24 GB per NeuronCore pair
- 50/50 tests pass (42 unit + 8 integration)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant