Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture by jimburtoft · Pull Request #128 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-12T01:23:56Z

Summary

Adds NxDI contrib implementation of Qwen3.5-27B, a 27B parameter dense model with a hybrid DeltaNet + GQA attention architecture
First NxDI implementation of linear recurrent attention (DeltaNet) with custom NKI kernels
Includes text decoder, vision encoder (CPU), VL pipeline, 3 NKI kernels, 42 unit tests, and 8 integration tests
All 50 tests pass on trn2.3xlarge with SDK 2.29

Architecture

Feature	Value
Parameters	27B dense
Layers	64: [3 DeltaNet + 1 GQA] × 16
Hidden size	5120
MLP	SwiGLU (5120 → 17408 → 5120)
GQA	24 heads, 4 KV heads, head_dim=256
DeltaNet	48 value heads, 16 key heads, k_dim=v_dim=128
Position encoding	Partial RoPE (25% of head_dim)
Vocabulary	248,320

Key architectural features

Hybrid DeltaNet + GQA: 48/64 layers use gated DeltaNet (linear recurrent attention with delta rule), 16 layers use standard GQA with KV cache
Custom NKI kernels: Three kernels for DeltaNet forward — recurrent (TKG), per-chunk, and fused chunked (CTE) with Neumann series intra-chunk correction
GQA output gate: Sigmoid gate split from interleaved q_proj weights
+1 RMSNorm convention: norm(x) * (1 + weight) with selective +1 addition during weight conversion
Vision-Language support: Optional ViT encoder on CPU (HBM fully consumed by 27B text decoder)

Test Results

Unit Tests (42/42 PASS, CPU only)

Module	Tests
test_config.py	26/26
test_weight_conversion.py	16/16

Integration Tests (8/8 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Test	Status
Model loads	PASS
Model generates	PASS
Output coherence	PASS
Top token valid	PASS
Capital of France → "Paris"	PASS
TTFT performance	PASS
Throughput	PASS
Multi-prompt generation (4 prompts)	PASS

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, seq_len=128, bs=1)

Metric	Value
TTFT (P50)	576 ms
TPOT (P50)	53 ms
Throughput	18.9 tok/s
Compilation	~13 min
HBM usage	23.57 / 24 GB

Files (15 files, ~6600 lines)

contrib/models/Qwen3.5-27B/
├── README.md                           # 283 lines
├── src/
│   ├── __init__.py                     # Public exports
│   ├── modeling_qwen35.py              # 2493 lines (text decoder)
│   ├── modeling_qwen35_vision.py       # 818 lines (vision encoder)
│   ├── modeling_qwen35_vl.py           # 662 lines (VL pipeline)
│   └── nki_kernels/
│       ├── __init__.py
│       ├── nki_deltanet.py             # 334 lines (recurrent kernel)
│       ├── nki_deltanet_chunked.py     # 320 lines (per-chunk kernel)
│       └── nki_deltanet_fused.py       # 574 lines (fused chunked kernel)
└── test/
    ├── unit/
    │   ├── test_config.py              # 200 lines (26 tests)
    │   └── test_weight_conversion.py   # 434 lines (16 tests)
    └── integration/
        └── test_model.py               # 469 lines (8 tests)

Contrib-only

This PR only adds files under contrib/models/Qwen3.5-27B/. No changes to NxDI src/.

Checklist

Contrib-only (no changes to NxDI src/)
Unit tests (42/42 pass)
Integration tests (8/8 pass on trn2.3xlarge, SDK 2.29)
README with architecture details, benchmarks, and usage
Apache 2.0 license headers
SDK 2.29+ / NKI 0.3.0 required

First NxDI implementation of a model using linear recurrent attention (DeltaNet) with custom NKI kernels for Neuron. Architecture: 27B dense, 64 layers [3 DeltaNet + 1 GQA] x 16, hidden_size=5120, SwiGLU MLP, partial RoPE, attention output gate. Includes: - Text decoder (modeling_qwen35.py) with DeltaNet state persistence - Vision encoder (modeling_qwen35_vision.py) for VL pipeline - VL pipeline (modeling_qwen35_vl.py) with CPU vision + Neuron text - 3 NKI kernels: recurrent (TKG), chunked, fused chunked (CTE) - 42 unit tests (config + weight conversion) - 8 integration tests (all passing on trn2.3xlarge, SDK 2.29) - Comprehensive README with benchmarks and usage examples Tested on trn2.3xlarge (TP=4, LNC=2, SDK 2.29, NKI 0.3.0): - TTFT: 576ms, TPOT: 53ms, throughput: 18.9 tok/s - HBM usage: 23.57/24 GB per NeuronCore pair - 50/50 tests pass (42 unit + 8 integration)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture#128

Contrib: Add Qwen3.5-27B with hybrid DeltaNet + GQA architecture#128
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3.5-27b

jimburtoft commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 12, 2026

Summary

Architecture

Key architectural features

Test Results

Unit Tests (42/42 PASS, CPU only)

Integration Tests (8/8 PASS, trn2.3xlarge, TP=4, SDK 2.29)

Performance (trn2.3xlarge, TP=4, LNC=2, BF16, seq_len=128, bs=1)

Files (15 files, ~6600 lines)

Contrib-only

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant