Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.
Description
Two-stage pipeline for song structure parsing and lyrics transcription with timestamps. SongPrep-7B converts audio waveforms into structured lyrics with section labels (
[verse],[chorus], etc.) and timestamps using a MuCodec audio encoder (329.5M params, FP32 Wav2Vec2-Conformer + RVQ) followed by a Qwen2 7B decoder (BF16).Key implementation details:
torch_neuronx.trace()with--auto-cast=matmulton_device_sampling_config=None(extended vocabulary of 168,040 tokens exceeds on-device sampling NKI kernel limit)Model Information
Model Name: SongPrep-7B
Model Architecture: MuCodec audio encoder (Wav2Vec2-Conformer + 1-RVQ) + Qwen2 decoder (GQA, RoPE, SiLU)
Purpose: Audio-to-text: song structure parsing and lyrics transcription with timestamps
Checklist
Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.
Required Components
Accuracy Test (ex.
test/integration/test_model.py)README.md with the following sections:
Source Code (
src/)modeling_songprep.py: MuCodec tracing, Qwen2 NxDI config, SongPrepPipeline classOptional Components
Folder Structure
Confirm your contribution follows this structure:
Testing
How did you test this change?
All tests were run on a trn2.3xlarge instance (LNC=2, 4 logical cores) in sa-east-1 with Neuron SDK 2.27 and the Deep Learning AMI Neuron (Ubuntu 24.04).
Test Results:
Compatibility
Tested with:
Additional Information
This is the first contrib model to use
torch_neuronx.trace()for a component (MuCodec encoder) alongside NxD Inference for the decoder. The split pipeline pattern (CPU preprocessing + Neuron backbone) may be useful for other audio/speech models with non-traceable preprocessing stages.Known limitation: the SongPrep source repository must be cloned separately for MuCodec model definitions (not packaged as a pip-installable library).
Related Issues
None
vLLM Integration
vLLM integration is blocked by the extended vocabulary exceeding the on-device sampling NKI kernel limit. NxD Inference direct mode is used instead.
By submitting this PR, I confirm that: