Add MiniMax-M2 and MiMo-V2-Flash model support by whn09 · Pull Request #119 · aws-neuron/neuronx-distributed-inference

whn09 · 2026-04-09T06:21:42Z

Description

Add NxDI contrib model adapters for MiniMax-M2 and MiMo-V2-Flash, two large-scale Mixture-of-Experts (MoE) models. Both require trn2.48xlarge with 64 logical cores (LNC=2) and feature custom MoE routing, partial RoPE, and architecture-specific attention mechanisms. Includes vLLM-neuron integration with benchmark scripts and performance results.

Model Information

Model Name: MiniMax-M2, MiMo-V2-Flash

Model Architecture: Decoder-only MoE transformer

Purpose: Text generation

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- MiniMax-M2: 3/3 integration tests passed (import, config, neuron config)
- MiMo-V2-Flash: 4/4 integration tests passed (import, config, neuron config, state dict converter)
README.md with the following sections:
- Usage Example: Code examples showing model compilation and inference
- Compatibility Matrix: Tested on Trn2 (trn2.48xlarge) with Neuron SDK 2.22+
- Example Checkpoints: Links to HuggingFace model hubs
- Testing Instructions: pytest commands for integration tests
Source Code (src/)
- models/minimax_m2/ — MiniMax-M2: NKI attention kernel, Neuron-native QK norm, sigmoid router with e_score_correction_bias, fused_qkv, partial RoPE
- models/mimo_v2/ — MiMo-V2-Flash: hybrid attention (full + sliding window), asymmetric head dims, attention sink bias
- models/mimo_v2/conversion_script/ — FP8 to BF16 weight converter
- utils/constants.py — Register both models in MODEL_TYPES for vllm-neuron discovery

Optional Components

Unit Tests — Integration tests in test/integration/

Folder Structure

contrib/models/MiniMax-M2/
  README.md
  src/
    modeling_minimax_m2.py          # Re-exports from src/ (thin wrapper)
  test/
    integration/
      test_model.py

contrib/models/MiMo-V2-Flash/
  README.md
  src/
    modeling_mimo_v2.py             # Re-exports from src/ (thin wrapper)
  test/
    integration/
      test_model.py

src/neuronx_distributed_inference/
  models/
    minimax_m2/
      __init__.py
      config.json                     # Bundled HF config for unit tests
      configuration_minimax_m2.py     # Custom config class
      modeling_minimax_m2.py          # Full implementation (~1400 lines)
        - NKI attention kernel (nki-library attention_block_tkg with partial RoPE)
        - Neuron-native QK norm (RmsNorm.apply, per-rank, no all-reduce)
        - RouterTopKWithBias (e_score_correction_bias as nn.Parameter)
        - Fused QKV support with correct state dict key ordering
    mimo_v2/
      __init__.py
      modeling_mimo_v2.py             # Full implementation (~1333 lines)
        - Hybrid attention (9 full + 39 sliding window layers)
        - Asymmetric head dims (Q/K=192, V=128)
        - Attention sink bias
      conversion_script/
        preprocess_mimo_v2_fp8.py     # FP8 -> BF16 converter
  utils/
    constants.py                      # MODEL_TYPES entries for both models

perf_test/
  README.md                           # Benchmark plan and instructions
  0_setup.sh                          # vllm-neuron install + model download
  1_bench_mimo_v2_flash.sh            # MiMo benchmark (BS=1/32/128)
  2_bench_minimax_m2.sh               # MiniMax benchmark (BS=1/256)
  vllm-neuron-mimo-minimax.patch      # vllm-neuron patch for MiMo/MiniMax

Testing

How did you test this change?

Tested on trn2.48xlarge with TP=64, LNC=2, Neuron SDK 2.22+.

MiniMax-M2:

3/3 integration tests passed
Architecture: 62 layers, 256 experts (top-8), hidden=3072, Q=48/KV=8 heads, partial RoPE (50%)
Correctness verified: 4/4 sanity prompts answered correctly (math, knowledge, reasoning, generation)

Config	Concurrency	Throughput (tok/s)	TPOT (ms)	TTFT (ms)
BS=1, TP=64/EP=1, fused_qkv	1	40.56	12.89	1075
BS=256, TP=64/EP=64, CB	1	5.76	173.83	165
BS=256, TP=64/EP=64, CB	16	54.69	287.09	513
BS=256, TP=64/EP=64, CB	32	75.85	408.66	1066
BS=256, TP=64/EP=64, CB	128	106.72	1158.08	3950
BS=256, TP=64/EP=64, CB	256	128.94	1860.69	11263

MiMo-V2-Flash:

4/4 integration tests passed
Architecture: 48 layers, 256 experts (top-8), hidden=4096, Q=64 heads, asymmetric Q/K=192 + V=128

Standalone NxDI (BF16, TP=64, EP=64):

Batch Size	Throughput (tok/s)
1	29.92
8	215.94
32	649.14

vLLM Serving (BS=32, TP=64/EP=64, CB, 900/90 tokens):

Concurrency	Throughput (tok/s)	TPOT (ms)	TTFT (ms)
1	27.98	33.65	222
16	224.57	64.95	570
32	302.61	90.23	1351

Note: Large MoE models require extended engine startup time (~30-50 min for compile+load). Set VLLM_ENGINE_READY_TIMEOUT_S=3600 before launching the vLLM server.

Compatibility

Tested with:

Neuron SDK Version(s): 2.22+
Instance Type(s): Trn2 (trn2.48xlarge)
PyTorch Version: 2.9
Python Version: 3.12

Additional Information

MiniMax-M2 key features:

NKI attention kernel: nki-library attention_block_tkg with partial RoPE support (rotary_dim=64, head_dim=128) and flat QK RMSNorm fused into the kernel
Neuron-native QK norm: Uses RmsNorm.apply (AwsNeuronRmsNorm custom call) instead of hand-rolled PyTorch ops, which compiled into different HLO in CE vs TG NEFFs
Router bias: e_score_correction_bias preserved as nn.Parameter (not register_buffer) with non-uniform init to prevent XLA optimization from eliminating the add operation
Fused QKV: convert_state_dict_to_fused_qkv runs before key rename to match expected state dict key paths
Partial RoPE (50% of head_dim=128)
Sigmoid router with learnable e_score_correction_bias

MiMo-V2-Flash key features:

Hybrid attention: 9 full attention + 39 sliding window layers
Asymmetric head dims (Q/K=192, V=128) — fused_qkv not supported
Attention sink bias in sliding window layers
EP hybrid sharding (EP=64 for prefill, EP=1 for token generation)
Requires FP8→BF16 conversion (OCP block-wise FP8 not compatible with Neuron)

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions (see READMEs and perf_test/ directory)
vllm-neuron patch included (perf_test/vllm-neuron-mimo-minimax.patch)

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

- MiniMax M2: Custom MoE (62 layers, 256 experts, top-8, sigmoid router, QK norm, partial RoPE, fused_qkv). TP=64 on trn2.48xlarge. - MiMo-V2-Flash: Custom MoE (48 layers, 256 experts, top-8, hybrid attention with full + sliding window, asymmetric head dims Q/K=192 V=128, attention sink bias). TP=64, EP=64 on trn2.48xlarge. Both models include: - Model implementations in src/neuronx_distributed_inference/models/ - Contrib wrappers following standard NxDI pattern - Integration tests - READMEs with architecture details and usage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MiMoV2InferenceConfig requires 27 attributes at init time. Test now checks get_required_attributes() without instantiating config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whn09 · 2026-04-09T21:55:21Z

End-to-End Test Results on trn2.48xlarge

Both models have been compiled and tested on a trn2.48xlarge instance (32 NeuronCores, logical_nc_config=2 → 64 logical cores) using NxDI 2.22+ (PyTorch 2.9).

MiMo-V2-Flash (XiaomiMiMo/MiMo-V2-Flash)

Configuration:

TP=64, EP=64 (hybrid sharding: CTE EP=64/TP=1, TKG EP=1/TP=64)
logical_nc_config=2, fused_qkv=False (asymmetric Q/K=192, V=128)
save_sharded_checkpoint=True (64 shards × ~19.3GB each)
BF16 weights (dequantized from FP8 original)

Result: ✅ Passed — generates coherent text output

MiniMax-M2 (MiniMax/MiniMax-M2)

Configuration:

TP=64, EP=64 (hybrid sharding: CTE EP=64/TP=1, TKG EP=1/TP=64)
logical_nc_config=2, fused_qkv=True, sigmoid router
save_sharded_checkpoint=True (64 shards, 857GB total)
BF16 weights from s3://datalab/minimax/model_hf/MiniMax-M2-BF16/

Important: Must NOT enable use_shard_on_intermediate_dynamic_while in blockwise_matmul_config — intermediate_size=1536 with TP=64 yields I_TP=24, which gets padded to 256 (10.7× inflation), causing compiler OOM.

Result: ✅ Passed — generates correct, coherent output with chat template

Sample outputs:

Prompt: "What is 2+3? Answer briefly."
Response: <think>The user asks: "What is 2+3? Answer briefly."
This is a simple arithmetic question. The answer is 5.</think>
5

Prompt: "Explain what a neural network is in one sentence."
Response: A neural network is a computational model made up of interconnected
nodes (neurons) that process inputs via weighted connections and activation
functions to learn patterns and generate outputs such as predictions or
classifications.

Prompt: "Write a haiku about the ocean."
Response: Moonlit tide whispers / silver over midnight waves / shells hold ancient breath

Test Environment

Instance: trn2.48xlarge (2TB RAM, 1.7TB NVMe)
Software: NxDI from this branch (contrib/llm-models), installed via pip install -e .
Compiler: neuronx-cc (NeuronSDK 2.22+)

whn09 · 2026-04-10T06:01:15Z

Benchmark Results: MiMo-V2-Flash & MiniMax-M2 on trn2.48xlarge

Environment

Instance: trn2.48xlarge (32 NeuronCores, logical_nc_config=2 → 64 logical cores)
neuronx-cc: 2.24.5133.0+58f8de22
torch-neuronx: 2.9.0.2.13.24727+8e870898
vllm-neuron: release-0.5.0 (with hf_config passthrough + snapshot_download patch)
NxDI: this PR branch

Configuration

TP=64, EP=1, non-continuous-batching, BS=1
Input: ~900 tokens, Output: ~90 tokens, 16 requests, concurrency=1
MiMo: fused_qkv=false, async_mode=false, moe_mask_padded_tokens=true, blockwise_matmul_config={use_shard_on_intermediate_dynamic_while: false, skip_dma_token: true}
MiniMax: fused_qkv=true, async_mode=false, moe_mask_padded_tokens=true, blockwise_matmul_config={use_shard_on_intermediate_dynamic_while: false, skip_dma_token: true}

Results

Metric	MiMo-V2-Flash	MiniMax-M2
Successful requests	16	16
Failed requests	0	0
Output throughput (tok/s)	46.17	39.35
Peak output throughput (tok/s)	91.00	74.00
Total token throughput (tok/s)	506.43	431.61
Mean TTFT (ms)	1027.72	1079.53
Median TTFT (ms)	1026.56	1079.03
P99 TTFT (ms)	1038.51	1084.67
Mean TPOT (ms)	10.39	13.61
Median TPOT (ms)	10.36	13.59
P99 TPOT (ms)	10.57	13.77
Mean ITL (ms)	10.39	13.61
P99 ITL (ms)	12.15	15.93

Notes

MiMo-V2-Flash is ~17% faster on output throughput and ~24% lower TPOT compared to MiniMax-M2
Both models load and serve successfully on trn2.48xlarge with TP=64/EP=1
EP=64 (Expert Parallel) configurations are currently blocked — both models hit RuntimeError: Cannot concatenate arrays with different element types: S32 vs S64 in blockwise matmul during compilation. This prevents testing continuous batching / high-throughput (large batch size) configs. Likely a limitation in neuronx-cc 2.24 with these MoE architectures.
Model weight loading takes ~10-20 min (MiMo ~570GB, MiniMax ~442GB); first-time compilation adds ~50 min per model

- Add vllm-neuron patch for MiMo/MiniMax architecture support - Add benchmark scripts for MiMo-V2-Flash and MiniMax-M2 (multiple BS/EP configs) - Add setup script for vllm-neuron installation and model weight download - Update READMEs with vLLM serving instructions and patch documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The upstream merge added required attribute validation in InferenceConfig.__init__. The test now provides a proper HF config via load_pretrained_config(hf_config=...). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add vLLM serving performance tables from trn2.48xlarge benchmarks: - MiMo-V2-Flash: BS=32/EP=64 with c=1/16/32 (up to 302.61 tok/s) - MiniMax-M2: Config 1 (BS=1/EP=1) and Config 2 (BS=256/EP=64) with c=1/16/32/128/256 - Add note about VLLM_ENGINE_READY_TIMEOUT_S=3600 for large MoE models Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…and router bias Key improvements over previous v3 implementation: - Add nki-library attention_block_tkg kernel override with partial RoPE (rotary_dim=64) - Fix QK normalization: use Neuron-native RmsNorm per-rank instead of hand-rolled all-reduce (which compiled differently in CE vs TG NEFFs) - Preserve e_score_correction_bias as nn.Parameter in RouterTopKWithBias (dropping the bias causes ~75% wrong expert selection since bias values ~8-9 dominate sigmoid scores 0-1) - Fix fused QKV state dict ordering: convert_state_dict_to_fused_qkv must run BEFORE the qkv_proj key rename, not after - Add KV cache rank-3 to rank-4 reshape in NKI kernel override - Fix NKI grid syntax for SDK 2.29 (plain int instead of nc() tuple) - Drop V3 suffix from class names (MiniMaxM2InferenceConfig, NeuronMiniMaxM2ForCausalLM) - Simplify contrib wrapper to direct re-export

Pass per-rank QK norm weights to the nki-library kernel's new rmsnorm_QK_flat_* parameters, which normalize Q and K across all heads concatenated (before head split) rather than per-head. The per-rank weight slice is extracted via torch.index_select on rank_util.rank for SPMD-compatible tracing, matching the approach used in the non-NKI code path. This should fix the NKI kernel output quality degradation -- QK norm was previously skipped entirely in the NKI path because the kernel only supported per-head norm.

- Fix _helper_concat_and_delete_qkv to use 'self_attn.qkv_proj.Wqkv' key path instead of 'self_attn.Wqkv' to match NxDI model parameter hierarchy - Add --enable-nki-attention flag to inference script for toggling NKI kernel - Set top_k=1 in GenerationConfig to match OnDeviceSamplingConfig global_topk=1 - Add attn_block_tkg_nki_kernel_cache_update flag for in-kernel KV cache update

MiniMax-M2: NKI attention kernel, correct QK norm, and router bias

Add imports and MODEL_TYPES entries for MiMo-V2 and MiniMax-M2 to enable vllm-neuron model discovery. Fixes ImportError after PR #7 renamed NeuronMiniMaxM2ForCausalLMV3 to NeuronMiniMaxM2ForCausalLM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ubuntu and others added 2 commits April 9, 2026 06:13

Fix MiMo-V2-Flash test to avoid bare config instantiation

29eb7d2

MiMoV2InferenceConfig requires 27 attributes at init time. Test now checks get_required_attributes() without instantiating config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

whn09 and others added 9 commits April 10, 2026 16:54

Merge remote-tracking branch 'upstream/main' into contrib/llm-models

ab3dee6

Merge pull request #7 from jimburtoft/fix/minimax-m2-nki-attention

1e062d8

MiniMax-M2: NKI attention kernel, correct QK norm, and router bias

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax-M2 and MiMo-V2-Flash model support#119

Add MiniMax-M2 and MiMo-V2-Flash model support#119
whn09 wants to merge 11 commits intoaws-neuron:mainfrom
whn09:contrib/llm-models

whn09 commented Apr 9, 2026 •

edited

Loading

Uh oh!

whn09 commented Apr 9, 2026

Uh oh!

whn09 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

whn09 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

whn09 commented Apr 9, 2026

End-to-End Test Results on trn2.48xlarge

MiMo-V2-Flash (XiaomiMiMo/MiMo-V2-Flash)

MiniMax-M2 (MiniMax/MiniMax-M2)

Test Environment

Uh oh!

whn09 commented Apr 10, 2026

Benchmark Results: MiMo-V2-Flash & MiniMax-M2 on trn2.48xlarge

Environment

Configuration

Results

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

whn09 commented Apr 9, 2026 •

edited

Loading