Skip to content

max-lt/voxtral-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

voxtral-cpp

Local Voxtral speech-to-text using llama.cpp's mtmd library with GPU acceleration (Metal/CUDA/Vulkan).

Features

  • File transcription (mp3, wav, flac)
  • Microphone recording with --rec mode (loops for multiple recordings)
  • Audio enhancement enabled by default (use --raw to disable)
  • Interactive model selection (Q4, Q8, BF16 presets)
  • GPU acceleration: Metal (macOS), CUDA (Nvidia), Vulkan (cross-platform)
  • Faster than real-time transcription (~2x on M1)
  • JSON output format

Build

mkdir build && cd build
cmake ..
make voxtral-cli -j8

The build auto-detects your GPU: Metal on macOS, CUDA if available on Linux/Windows, otherwise CPU.

To force a specific backend:

cmake .. -DGGML_CUDA=ON    # Nvidia CUDA
cmake .. -DGGML_VULKAN=ON  # Vulkan (AMD, Intel, Nvidia)
cmake .. -DGGML_METAL=OFF  # CPU only on macOS

The build automatically applies patches to llama.cpp (see Patches section).

Download Models

Download the Q4 model (recommended, 2.3GB):

mkdir -p models && cd models
curl -LO https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/resolve/main/Voxtral-Mini-3B-2507-Q4_K_M.gguf
curl -LO https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/resolve/main/mmproj-Voxtral-Mini-3B-2507-Q8_0.gguf

For higher quality, download Q8 or BF16 from bartowski's HuggingFace.

For best quality, try Voxtral Small (24B) from bartowski - requires ~16GB+ VRAM.

Usage

# Transcribe audio file
./build/voxtral-cli audio.mp3

# Record from microphone
./build/voxtral-cli --rec

# Show help
./build/voxtral-cli --help

Output is JSON:

{"transcription": "Hello world, this is a test."}

Options

  • --rec - Record from microphone (Enter to start/stop, loops after each transcription)
  • --raw - Disable audio enhancement (enhancement is ON by default)
  • -m, --model <preset|path> - Model preset or path to model file
  • --mmproj <path> - Path to multimodal projector (auto-detected for presets)
  • --debug - Save raw and enhanced audio to debug/ folder
  • -ngl <n> - Number of layers to offload to GPU (default: 99)
  • --no-gpu - Disable GPU acceleration

Available presets:

Preset Model Size Notes
q4 Mini Q4_K_M 2.3GB Fast, good quality (default)
q8 Mini Q8_0 4.0GB Better quality
bf16 Mini BF16 7.5GB Full precision
small-q4 Small Q4_K_M 14GB Best quality, needs ~16GB VRAM
small-q8 Small Q8_0 25GB Maximum quality, needs ~32GB VRAM

If no model is specified, an interactive menu lets you choose between available presets.

Audio Enhancement

Audio enhancement is enabled by default (use --raw to disable):

  1. DC offset removal - Centers the waveform
  2. High-pass filter (80Hz) - Removes low-frequency rumble and hum
  3. Noise gate - Attenuates background noise below estimated noise floor
  4. Normalization - Scales audio to -1dB peak

Uses Apple Accelerate on macOS, pure C++ on other platforms. Significantly improves transcription quality for microphone recordings.

Patches

This project applies patches to llama.cpp at build time via CMake's PATCH_COMMAND.

voxtral-begin-audio.patch

Fixes missing [BEGIN_AUDIO] tag for Voxtral in mtmd.cpp. Without this patch, transcription can be flaky.

See: ggml-org/llama.cpp#17868

The patch adds PROJECTOR_TYPE_VOXTRAL to the condition that sets the audio begin tag (already done for Ultravox).

About

Local implementation for voxtral

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors