Skip to content

Add YuE Music Generation contrib model (full-song from lyrics)#126

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/yue-music-generation
Open

Add YuE Music Generation contrib model (full-song from lyrics)#126
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/yue-music-generation

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds YuE (M-A-P/HKUST) full-song music generation from lyrics to contrib
  • Two-stage NxDI pipeline: S1 (7B LLaMA, TP=2) generates coarse codec tokens with classifier-free guidance, S2 (1B LLaMA, TP=1) refines via teacher-forcing
  • Includes NKI MLP kernel optimization (20% speedup) and S2 batching

Model Details

  • Source: YuE (M-A-P/HKUST, Apache 2.0)
  • Architecture: S1 = 7B LLaMA (vocab ~84K), S2 = 1B LLaMA (vocab ~84K), xcodec_mini audio decoder
  • Instance: trn2.3xlarge

Performance

Config Pipeline Total RTF vs GPU (L4)
Baseline 514s 17.6x 2.8x faster
+ KV-cache + NKI + bs=2 ~440s ~15x 3.3x faster

S1: 34.5 tok/s (3.5x faster than L4 GPU). S2: 4.3x faster than L4 GPU with KV-cache.

Key Technical Contributions

  • Subprocess architecture: S1 and S2 run in separate processes (NxDI models with different TP cannot coexist)
  • Custom CFG generation loop: Maintains synchronized KV caches for conditional/unconditional rows
  • NKI MLP TKG kernel: Fuses RMSNorm + Gate/Up + SiLU + Down, eliminates 5-6 HBM round-trips per layer
  • S2 KV-cache optimization: 3.3x speedup over naive teacher-forcing loop

Testing

  • 10 files: README.md, genre.txt, lyrics.txt, setup.sh, src/{init,nki_mlp_patch,yue_e2e_neuron,yue_stage1_worker,yue_stage2_worker}.py, test/{init,integration/{init,test_model},unit/init}.py
  • Integration tests: lyrics parsing, file format validation, E2E pipeline (with and without NKI kernels)
  • Validated on trn2.3xlarge with SDK 2.28

Two-stage LLaMA pipeline (S1 7B + S2 1B) with CFG, KV-cache-aware
teacher-forcing, NKI MLP TKG fused kernels (20% speedup), and S2
batching. 3.3x faster than L4 GPU end-to-end, 3.5x on S1 token gen.
Produces vocals + instrumentals + mix from lyrics on trn2.3xlarge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant