Add Voxtral Mini 3B contrib model (audio-language)#125
Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Open
Add Voxtral Mini 3B contrib model (audio-language)#125jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
Conversation
Mistral AI's Voxtral Mini 3B audio-language model on Neuron (Trainium2/Inferentia2) using a decomposed pipeline: - Audio encoder: torch_neuronx.trace() with inline_weights_to_neff=False - Projector: CPU (25M params) - LLM backbone: NxDI ImageToTextModelWrapper (Llama 3.3B, TP=1) Supports text-only generation, audio transcription, and audio understanding. Validated at 58.5 tok/s on trn2.3xlarge and 28.4 tok/s on inf2.xlarge.
- Add -O1, tensorizer options, and --lnc=2 to audio encoder trace (matched to decoder optimization level) - Fix neuron-ls trn2 detection: use plain 'neuron-ls' instead of '--json-output' which does not contain instance type string - Add benchmark_encoder.py for component-level latency measurement Benchmark results (SDK 2.28, trn2.3xlarge): encoder 224ms with both minimal and optimized flags -- SDK 2.28 already fully optimizes the encoder trace. Projector trace to Neuron saves 2ms (3ms to 1ms).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
torch_neuronx.trace(), Llama LLM via NxDIImageToTextModelWrapperModel Details
Performance
Key Technical Contributions
VoxtralTextModelextendsNeuronLlamaModel-- no custom attention/MLPscatter_by_index_putfor audio token embedding--lnc=2flagTesting