You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am collecting real-world test results and edge-case reports for our newly overhauled Multimodal architecture (covering Vision, Audio, Omni models, and future Video support).
Whether you are running multi-turn interactive chats or stateless single-turn inferences (like ComfyUI nodes), your feedback on KV cache alignment, memory shifts, and M-RoPE position folding is extremely valuable.
⚠️ CRITICAL REQUIREMENT for Logs
To help trace the underlying C++ chunk evaluations and temporal position (n_tokens, n_pos) math, you MUST set verbose=True in both your Llama and ChatHandler initializations.(Logs without verbose output lack the C++ backend traces needed to debug multimodal issues).
📝 What to include in your reply:
If you are testing any Multimodal models (e.g., Qwen-VL, LLaVA, MiniCPM-V, etc.), please reply with:
Models & Repos: * The main LLM path (e.g., Qwen3.5-VL-7B.gguf)
The multimodal projector path (e.g., mmproj-BF16.gguf)
Version: Your llama-cpp-python commit hash or branch.
Media Details: What are you passing to the model? (e.g., Single 1080p image, 3-12 concurrent images, a 5-second MP3/WAV audio file, or video(future) etc.)
Important Flags & Configs: Context size (n_ctx), batch size (n_batch), and checkpoint settings (ctx_checkpoints).
Performance: Prompt eval time (crucial for heavy images), generation tokens/s, and VRAM usage.
The Verbose Logs: Paste the full runtime logs (especially the lines starting with Llama.generate: and MTMDChatHandler(__call__):).
Feedback: Any crashes (like negative token errors), infinite loops, unexpected OOMs, quality degradation, or surprisingly good results!
Even a single successful log or a minor observation helps us bulletproof this architecture. Thank you! 🚀
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
📸 Test Feedback: Omni-Modal / Multimodal Models (Vision / Audio / Video)
Hi all,
I am collecting real-world test results and edge-case reports for our newly overhauled Multimodal architecture (covering Vision, Audio, Omni models, and future Video support).
Whether you are running multi-turn interactive chats or stateless single-turn inferences (like ComfyUI nodes), your feedback on KV cache alignment, memory shifts, and M-RoPE position folding is extremely valuable.
To help trace the underlying C++ chunk evaluations and temporal position (
n_tokens,n_pos) math, you MUST setverbose=Truein both yourLlamaandChatHandlerinitializations. (Logs without verbose output lack the C++ backend traces needed to debug multimodal issues).📝 What to include in your reply:
If you are testing any Multimodal models (e.g., Qwen-VL, LLaVA, MiniCPM-V, etc.), please reply with:
Models & Repos: * The main LLM path (e.g.,
Qwen3.5-VL-7B.gguf)mmproj-BF16.gguf)Environment: OS, Hardware setup, VRAM capacity, and multiple-GPU config (if any).
Version: Your
llama-cpp-pythoncommit hash or branch.Media Details: What are you passing to the model? (e.g., Single 1080p image, 3-12 concurrent images, a 5-second MP3/WAV audio file, or video(future) etc.)
Important Flags & Configs: Context size (
n_ctx), batch size (n_batch), and checkpoint settings (ctx_checkpoints).Performance: Prompt eval time (crucial for heavy images), generation tokens/s, and VRAM usage.
The Verbose Logs: Paste the full runtime logs (especially the lines starting with
Llama.generate:andMTMDChatHandler(__call__):).Feedback: Any crashes (like negative token errors), infinite loops, unexpected OOMs, quality degradation, or surprisingly good results!
Even a single successful log or a minor observation helps us bulletproof this architecture. Thank you! 🚀
Beta Was this translation helpful? Give feedback.
All reactions