🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling #86
JamePeng
announced in
Announcements
Replies: 2 comments 1 reply
-
|
windows😭 |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
谢谢 for your work on this! 👏 It's been saving my butt working on projects with qwen3-reranker / qwen35 and the upstream llama-cpp-python seemingly dead for months now 🔥 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling
Hi everyone,
I am excited to share the release of
v0.3.32. In this update, my primary focus was resolving several deep-rooted architectural bottlenecks specifically affecting single-turn workflows (such as ComfyUI nodes or stateless API endpoints) when using Hybrid and Multimodal models (like Qwen3.5). I also completely overhauled the sampling seed management to guarantee thread safety and strict determinism.Here is a detailed breakdown of what is new and how you can leverage these optimizations.
⚡ 1. Zero-Latency Single-Turn Optimizations for Hybrid/Multimodal Models
Previously, running Hybrid models (which maintain internal hidden states) in stateless or single-turn environments introduced massive inefficiencies. Even if you didn't need multi-turn chat history, the engine was still attempting to extract and save ~150MB+ of state data from VRAM to RAM over the PCIe bus at the end of every generation, causing an unavoidable ~3-second blocking delay.
Worse, if you tried to disable this cache, the N-1 prefix matching logic would trigger a catastrophic KV cache clear, leading to
Invalid input batchcrashes when encountering multimodal pseudo-tokens.I have fundamentally refactored this pipeline:
Llama.generate. If caching is disabled, the engine now skips the N-1 truncation entirely. It reuses the freshly computed logits from the Multimodal Handler and instantly starts generating text.self._input_ids[:self.n_tokens].tolist()). This ensures absolutely zero memory allocation and zero overhead for single-turn hybrid workflows.💡 How to use this optimization:
If you are building ComfyUI nodes or running single-turn API wrappers where you do not need multi-turn state rollbacks, simply initialize your Llama instance with
ctx_checkpoints=0:Note:
MTMDChatHandlerhas also been updated to suppress cache-related anchoring logs whenmax_checkpoints <= 0.🎲 2. Fixed Sampling Seed Determinism & Thread Safety
I noticed reports of the
seedparameter seemingly having no effect during generation. After tracing the parameter chain, I found that explicit seeds were not being passed down to the C++llama_sampling_context.Furthermore,
_create_completionwas using an anti-pattern: it was mutating the globalself._seedof the Llama instance during generation. In a concurrent environment (like a web server handling multiple requests), this caused thread-unsafe state pollution.I have completely removed this global mutation:
seedparameter to bothgenerateandsamplemethod signatures.LlamaSamplingParamsinstances.🛠️ 3. Modernized Bug Report Template & Upstream Sync
verbose=Truelogging requirements with code examples to help me debug your C++ backend logs faster.A huge thank you to everyone in the discussions and issues who provided the logs and insights that made tracing these deep state-machine bugs possible.
Happy coding!
— JamePeng
Beta Was this translation helpful? Give feedback.
All reactions