voxtral-cpp

Local Voxtral speech-to-text using llama.cpp's mtmd library with GPU acceleration (Metal/CUDA/Vulkan).

Features

File transcription (mp3, wav, flac)
Microphone recording with --rec mode (loops for multiple recordings)
Audio enhancement enabled by default (use --raw to disable)
Interactive model selection (Q4, Q8, BF16 presets)
GPU acceleration: Metal (macOS), CUDA (Nvidia), Vulkan (cross-platform)
Faster than real-time transcription (~2x on M1)
JSON output format

Build

mkdir build && cd build
cmake ..
make voxtral-cli -j8

The build auto-detects your GPU: Metal on macOS, CUDA if available on Linux/Windows, otherwise CPU.

To force a specific backend:

cmake .. -DGGML_CUDA=ON    # Nvidia CUDA
cmake .. -DGGML_VULKAN=ON  # Vulkan (AMD, Intel, Nvidia)
cmake .. -DGGML_METAL=OFF  # CPU only on macOS

The build automatically applies patches to llama.cpp (see Patches section).

Download Models

Download the Q4 model (recommended, 2.3GB):

mkdir -p models && cd models
curl -LO https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/resolve/main/Voxtral-Mini-3B-2507-Q4_K_M.gguf
curl -LO https://huggingface.co/ggml-org/Voxtral-Mini-3B-2507-GGUF/resolve/main/mmproj-Voxtral-Mini-3B-2507-Q8_0.gguf

For higher quality, download Q8 or BF16 from bartowski's HuggingFace.

For best quality, try Voxtral Small (24B) from bartowski - requires ~16GB+ VRAM.

Usage

# Transcribe audio file
./build/voxtral-cli audio.mp3

# Record from microphone
./build/voxtral-cli --rec

# Show help
./build/voxtral-cli --help

Output is JSON:

{"transcription": "Hello world, this is a test."}

Options

--rec - Record from microphone (Enter to start/stop, loops after each transcription)
--raw - Disable audio enhancement (enhancement is ON by default)
-m, --model <preset|path> - Model preset or path to model file
--mmproj <path> - Path to multimodal projector (auto-detected for presets)
--debug - Save raw and enhanced audio to debug/ folder
-ngl <n> - Number of layers to offload to GPU (default: 99)
--no-gpu - Disable GPU acceleration

Available presets:

Preset	Model	Size	Notes
`q4`	Mini Q4_K_M	2.3GB	Fast, good quality (default)
`q8`	Mini Q8_0	4.0GB	Better quality
`bf16`	Mini BF16	7.5GB	Full precision
`small-q4`	Small Q4_K_M	14GB	Best quality, needs ~16GB VRAM
`small-q8`	Small Q8_0	25GB	Maximum quality, needs ~32GB VRAM

If no model is specified, an interactive menu lets you choose between available presets.

Audio Enhancement

Audio enhancement is enabled by default (use --raw to disable):

DC offset removal - Centers the waveform
High-pass filter (80Hz) - Removes low-frequency rumble and hum
Noise gate - Attenuates background noise below estimated noise floor
Normalization - Scales audio to -1dB peak

Uses Apple Accelerate on macOS, pure C++ on other platforms. Significantly improves transcription quality for microphone recordings.

Patches

This project applies patches to llama.cpp at build time via CMake's PATCH_COMMAND.

voxtral-begin-audio.patch

Fixes missing [BEGIN_AUDIO] tag for Voxtral in mtmd.cpp. Without this patch, transcription can be flaky.

See: ggml-org/llama.cpp#17868

The patch adds PROJECTOR_TYPE_VOXTRAL to the condition that sets the audio begin tag (already done for Ultravox).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
patches		patches
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

voxtral-cpp

Features

Build

Download Models

Usage

Options

Audio Enhancement

Patches

voxtral-begin-audio.patch

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

voxtral-cpp

Features

Build

Download Models

Usage

Options

Audio Enhancement

Patches

voxtral-begin-audio.patch

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages