Skip to content

Add SongPrep-7B contrib model#118

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/songprep-7b
Open

Add SongPrep-7B contrib model#118
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/songprep-7b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

Two-stage pipeline for song structure parsing and lyrics transcription with timestamps. SongPrep-7B converts audio waveforms into structured lyrics with section labels ([verse], [chorus], etc.) and timestamps using a MuCodec audio encoder (329.5M params, FP32 Wav2Vec2-Conformer + RVQ) followed by a Qwen2 7B decoder (BF16).

Key implementation details:

  • MuCodec encoder uses a split pipeline: CPU MelSTFT preprocessing (torch.stft not traceable due to overlapping window strides) + Neuron Conformer+RVQ via torch_neuronx.trace() with --auto-cast=matmult
  • Qwen2 decoder compiled via NxD Inference with on_device_sampling_config=None (extended vocabulary of 168,040 tokens exceeds on-device sampling NKI kernel limit)

Model Information

Model Name: SongPrep-7B

Model Architecture: MuCodec audio encoder (Wav2Vec2-Conformer + 1-RVQ) + Qwen2 decoder (GQA, RoPE, SiLU)

Purpose: Audio-to-text: song structure parsing and lyrics transcription with timestamps

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • MuCodec encoder: codec token match rate (Neuron vs CPU)
    • Qwen2 decoder: token-level match with greedy decoding
    • End-to-end pipeline: structural validity and timing
    • Tests run on trn2.3xlarge
  • README.md with the following sections:

    • Usage Example: Step-by-step trace, compile, and run pipeline
    • Compatibility Matrix: trn2.3xlarge validated with SDK 2.27
    • Example Checkpoints: tencent/SongPrep-7B on HuggingFace
    • Testing Instructions: pytest command with environment variables
  • Source Code (src/)

    • modeling_songprep.py: MuCodec tracing, Qwen2 NxDI config, SongPrepPipeline class
    • Follows contrib folder hierarchy

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • Not included (unit/ directory created but empty)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/SongPrep-7B/
  README.md
  /src
    __init__.py
    modeling_songprep.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests were run on a trn2.3xlarge instance (LNC=2, 4 logical cores) in sa-east-1 with Neuron SDK 2.27 and the Deep Learning AMI Neuron (Ubuntu 24.04).

Test Results:

  • MuCodec encoder: 96.8% codec token match (Neuron vs CPU, 250 tokens from 10s audio)
  • Qwen2 decoder: 100% token match (first 200 tokens, greedy decoding, Neuron vs CPU BF16)
  • MuCodec latency: 89-244ms for 10-60s audio (112-246x realtime)
  • Qwen2 throughput: 21-26 tok/s
  • End-to-end: structurally valid output with section tags and timestamps

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.27
  • Instance Type(s): trn2.3xlarge
  • PyTorch Version: 2.9
  • Python Version: 3.12

Additional Information

This is the first contrib model to use torch_neuronx.trace() for a component (MuCodec encoder) alongside NxD Inference for the decoder. The split pipeline pattern (CPU preprocessing + Neuron backbone) may be useful for other audio/speech models with non-traceable preprocessing stages.

Known limitation: the SongPrep source repository must be cloned separately for MuCodec model definitions (not packaged as a pip-installable library).

Related Issues

None

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

vLLM integration is blocked by the extended vocabulary exceeding the on-device sampling NKI kernel limit. NxD Inference direct mode is used instead.


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant