Skip to content

Contrib: Solar Open 100B (upstage/Solar-Open-100B) on Trainium2#107

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/solar-open-100b
Open

Contrib: Solar Open 100B (upstage/Solar-Open-100B) on Trainium2#107
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/solar-open-100b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Adds NeuronX Distributed Inference support for upstage/Solar-Open-100B, a 102.6B parameter Mixture-of-Experts model (128 routed + 1 shared expert per layer, top-8 sigmoid routing). The implementation includes custom weight loading for HuggingFace safetensors format, corrected YaRN RoPE, and optimized attention NKI kernel configuration delivering 34% TKG improvement over baseline.

Model Information

Model Name: Solar Open 100B
Model Architecture: Decoder-only MoE transformer (48 layers, 128+1 experts/layer, GQA 64Q/8KV, hidden_size=4096)
Purpose: Text generation (102.6B total params, 12B active per token)

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)
    • 4 integration tests: smoke test, logit accuracy (16 tokens, teacher forcing), CTE performance, TKG performance
    • Uses check_accuracy_logits_v2 with multi-tiered tolerances (top-5/50/1000/all)
    • Zero token divergence vs CPU HuggingFace reference
  • README.md with the following sections:
    • Usage Example: Complete code example with kernel-optimized configuration
    • Compatibility Matrix: Tested on trn2.48xlarge with SDK 2.28
    • Example Checkpoints: Link to HuggingFace model
    • Testing Instructions: Full instructions including CPU reference logit generation
  • Source Code (src/)
    • modeling_solar_open.py (1095 lines): Full inference implementation based on GPT-OSS/DeepSeek-V3 architecture
    • Custom weight loading for HF per-expert safetensors to fused format
    • Corrected YaRN RoPE implementation (ramp boundaries + interpolation formula)
    • GLU activation fix (glu_type="glu" not "swiglu")

Optional Components

  • Unit Tests (not included in this submission)

Folder Structure

/contrib/models/Solar-Open-100B/
  README.md
  /src
    __init__.py
    modeling_solar_open.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?
Tested on trn2.48xlarge (us-east-2) with SDK 2.28 (torch-neuronx 2.9.0.2.12, NxDI 0.8.16251). All 4 tests pass with optimized kernel configuration (fused_qkv=True, qkv_kernel_enabled=True, qkv_nki_kernel_enabled=True).
Test Results:

================================================================================
Solar Open 100B Integration Tests
================================================================================
1. Smoke Test...
PASS: Smoke test - Model loaded successfully
2. Logit Accuracy Test (CPU reference vs Neuron)...
  Reference logits shape: torch.Size([16, 1, 196608])
  Prompt: 'The capital of France is'
PASS: Logit accuracy validated (16 tokens)
3. CTE Performance Test...
  CTE latency: 1565.2 ms (avg of 5 runs)
PASS: CTE latency 1565.2 ms
4. TKG Performance Test...
  TKG latency: 12.0 ms (83.4 tok/s)
PASS: TKG latency 12.0 ms (83.4 tok/s)
================================================================================
All tests passed!
================================================================================

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.28 (neuronxcc 2.23.6484)
  • Instance Type(s): trn2.48xlarge (tp=64)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

Performance Highlights

  • TKG: 11.83 ms / 84.5 tok/s (with attention NKI kernels)
  • CTE: 1,565 ms at seq_len=4096
  • 34% TKG improvement from attention NKI kernels vs baseline
  • Maximum seq_len: 32,768; Maximum batch: 4 at seq_len=4096

Known Limitations

  • MoE NKI kernels cannot be used (intermediate_size/tp = 1280/64 = 20, requires % 128 == 0)
  • CPU reference logits require transformers >= 5.0 (separate venv from NxDI)
  • seq_len > 32768 and batch > 4 at seq_len=4096 fail with serialization error

Issues Fixed During Onboarding (5 total)

  1. hidden_act override: Config incorrectly defaulted to "sigmoid" instead of "silu"
  2. HF weight format: Per-expert safetensors needed conversion to fused [E, H, 2*I] format
  3. YaRN RoPE: Inverted ramp boundaries + wrong interpolation formula
  4. glu_type mismatch: Solar Open requires "glu" not "swiglu" (despite SiLU activation)

Related Issues

N/A - New model contribution

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

NeuronX Distributed Inference implementation of upstage/Solar-Open-100B,
a 102.6B MoE model (128 routed + 1 shared expert, top-8 sigmoid routing).

- TP=64 on trn2.48xlarge, BF16
- Logit validation passes (check_accuracy_logits_v2, 16 tokens)
- CTE: 341.7 ms, TKG: 10.2 ms (98 tok/s)
- 5 architecture issues fixed during onboarding (hidden_act, weight format,
  YaRN RoPE, glu_type)
…G improvement)

Enable fused_qkv, qkv_kernel_enabled, and qkv_nki_kernel_enabled for
84.5 tok/s TKG (up from 55.8 baseline). Update README with kernel
benchmark results, seq_len/batch sweep tables, and revised performance
metrics. Increase default seq_len from 128 to 4096. All 4 integration
tests pass on trn2.48xlarge with the new config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant