📸Test Feedback: Multimodal Models (Vision / Audio / Omni) #85

JamePeng · 2026-03-08T11:28:00Z

JamePeng
Mar 8, 2026
Maintainer

📸 Test Feedback: Omni-Modal / Multimodal Models (Vision / Audio / Video)

Hi all,

I am collecting real-world test results and edge-case reports for our newly overhauled Multimodal architecture (covering Vision, Audio, Omni models, and future Video support).

Whether you are running multi-turn interactive chats or stateless single-turn inferences (like ComfyUI nodes), your feedback on KV cache alignment, memory shifts, and M-RoPE position folding is extremely valuable.

⚠️ CRITICAL REQUIREMENT for Logs

To help trace the underlying C++ chunk evaluations and temporal position (n_tokens, n_pos) math, you MUST set verbose=True in both your Llama and ChatHandler initializations. (Logs without verbose output lack the C++ backend traces needed to debug multimodal issues).

📝 What to include in your reply:

If you are testing any Multimodal models (e.g., Qwen-VL, LLaVA, MiniCPM-V, etc.), please reply with:

Models & Repos: * The main LLM path (e.g., Qwen3.5-VL-7B.gguf)
- The multimodal projector path (e.g., mmproj-BF16.gguf)
Environment: OS, Hardware setup, VRAM capacity, and multiple-GPU config (if any).
Version: Your llama-cpp-python commit hash or branch.
Media Details: What are you passing to the model? (e.g., Single 1080p image, 3-12 concurrent images, a 5-second MP3/WAV audio file, or video(future) etc.)
Important Flags & Configs: Context size (n_ctx), batch size (n_batch), and checkpoint settings (ctx_checkpoints).
Performance: Prompt eval time (crucial for heavy images), generation tokens/s, and VRAM usage.
The Verbose Logs: Paste the full runtime logs (especially the lines starting with Llama.generate: and MTMDChatHandler(__call__):).
Feedback: Any crashes (like negative token errors), infinite loops, unexpected OOMs, quality degradation, or surprisingly good results!

Even a single successful log or a minor observation helps us bulletproof this architecture. Thank you! 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📸Test Feedback: Multimodal Models (Vision / Audio / Omni) #85

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

📸Test Feedback: Multimodal Models (Vision / Audio / Omni) #85

Uh oh!

JamePeng Mar 8, 2026 Maintainer

📸 Test Feedback: Omni-Modal / Multimodal Models (Vision / Audio / Video)

⚠️ CRITICAL REQUIREMENT for Logs

📝 What to include in your reply:

Replies: 0 comments

JamePeng
Mar 8, 2026
Maintainer