[v0.3.31] Release Note: Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging #80
JamePeng
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Omni-Modal Media Pipeline, Hybrid 1-Token Rollback and Enhanced Logging
Release v0.3.31 introduces structural updates to the multi-modal processing pipeline, addresses a specific caching behavior in hybrid models, and improves how underlying C++ backend errors are surfaced to the Python layer.
Here is a detailed breakdown of the changes in this version.
1. Omni-Modal Media Pipeline
The media parsing and loading pipeline in
MTMDChatHandlerhas been rewritten to handle both vision and audio inputs within a unified architecture._init_mtmd_contextmethod now actively probes the C++ backend forctx_v(vision) andctx_a(audio) encoders. This provides proactive validation of the model's capabilities before media processing begins.get_image_urlsandsplit_text_on_image_urlswith_get_media_items. This parsesimage_url,input_audio, andaudio_urlwhile strictly maintaining the chronological order of user prompts and enforcing OpenAI format specifications.load_mediadispatcher has been introduced. It includes a newdetect_audio_formatmethod that mimicsllama.cpp's C++ magic bytes sniffing (RIFF/WAVE, ID3/MPEG, fLaC) to prevent backend crashes caused by unsupported or corrupted audio formats.ThreadPoolExecutorin_process_mtmd_prompthas been updated to concurrently fetch and decode both image and audio payloads into unifiedmtmd_bitmapstructures.2. Hybrid Model 1-Token Rollbacks (N-1 Checkpointing)
This release addresses an issue where generating responses with hybrid or recurrent models (like RNNs) could result in empty outputs or state desyncs when the prompt cache matched 100%.
When a prompt matches the cache entirely (e.g., when a user regenerates a response with the same prompt but a different seed), the engine attempts a "1-token rollback" to refresh the sampling logits. Because hybrid models cannot arbitrarily truncate their internal states like standard Transformers, rolling back one token without a dedicated snapshot caused the state machine to fail.
The engine now forces an N-1 state snapshot during the prompt prefilling phase for hybrid models. This ensures the engine can safely perform a 1-token rollback to refresh logits upon 100% cache matches, preventing desyncs without requiring a full re-evaluation of the prompt.
3. Exposing Critical C++ Errors
We have removed the OS-level log suppression (
suppress_stdout_stderr) around critical C++ backend calls, specifically within_init_mtmd_context,_create_bitmap_from_bytes, andclose.Previously, when
verbose=False, this file descriptor redirection was inadvertently swallowing fatal C++ backend errors—such asstb_imagedecoding failures, corrupted.mmprojmodel weights, or CUDA Out-Of-Memory aborts. This resulted in silent crashes that were difficult to debug.The framework now relies entirely on the native C-API
llama_log_callbackto route logs to Python. This ensures that critical decoding and hardware exceptions remain visible in the console, while standard processing logs can still be filtered by the Python logging module.4. Upstream Synchronization
llama.cppbackend toggml-org/llama.cppcommit [f5ddcd1696eca5069dc7915f4d4c03c9a709afea](ggml-org/llama.cpp@f5ddcd1).Beta Was this translation helpful? Give feedback.
All reactions