🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling #86

JamePeng · 2026-03-08T12:30:53Z

JamePeng
Mar 8, 2026
Maintainer

🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling

Hi everyone,

I am excited to share the release of v0.3.32. In this update, my primary focus was resolving several deep-rooted architectural bottlenecks specifically affecting single-turn workflows (such as ComfyUI nodes or stateless API endpoints) when using Hybrid and Multimodal models (like Qwen3.5). I also completely overhauled the sampling seed management to guarantee thread safety and strict determinism.

Here is a detailed breakdown of what is new and how you can leverage these optimizations.

⚡ 1. Zero-Latency Single-Turn Optimizations for Hybrid/Multimodal Models

Previously, running Hybrid models (which maintain internal hidden states) in stateless or single-turn environments introduced massive inefficiencies. Even if you didn't need multi-turn chat history, the engine was still attempting to extract and save ~150MB+ of state data from VRAM to RAM over the PCIe bus at the end of every generation, causing an unavoidable ~3-second blocking delay.

Worse, if you tried to disable this cache, the N-1 prefix matching logic would trigger a catastrophic KV cache clear, leading to Invalid input batch crashes when encountering multimodal pseudo-tokens.

I have fundamentally refactored this pipeline:

The "FAST PATH" Bypass: I introduced a 100% prefix match bypass in Llama.generate. If caching is disabled, the engine now skips the N-1 truncation entirely. It reuses the freshly computed logits from the Multimodal Handler and instantly starts generating text.
Zero PCIe I/O & Zero GC Overhead: Added strict early-exit intercepts. When the cache is disabled, the engine completely bypasses the C++ state extraction and the expensive Python array slicing (self._input_ids[:self.n_tokens].tolist()). This ensures absolutely zero memory allocation and zero overhead for single-turn hybrid workflows.
Unified Prompt Evaluation: Bypassed the N-1 evaluation split so the C++ engine can process the entire prompt in a single, efficient batch.

💡 How to use this optimization:
If you are building ComfyUI nodes or running single-turn API wrappers where you do not need multi-turn state rollbacks, simply initialize your Llama instance with ctx_checkpoints=0:

llm = Llama(
    model_path="./Qwen3.5-VL-7B.gguf",
    chat_handler=MTMDChatHandler(clip_model_path="./mmproj.gguf"),
    n_ctx=4096,
    ctx_checkpoints=0  # <-- SET THIS TO 0 TO ENABLE ZERO-LATENCY FAST PATH
)

Note: MTMDChatHandler has also been updated to suppress cache-related anchoring logs when max_checkpoints <= 0.

🎲 2. Fixed Sampling Seed Determinism & Thread Safety

I noticed reports of the seed parameter seemingly having no effect during generation. After tracing the parameter chain, I found that explicit seeds were not being passed down to the C++ llama_sampling_context.

Furthermore, _create_completion was using an anti-pattern: it was mutating the global self._seed of the Llama instance during generation. In a concurrent environment (like a web server handling multiple requests), this caused thread-unsafe state pollution.

I have completely removed this global mutation:

Added the seed parameter to both generate and sample method signatures.
Implemented strict local parameter passthrough. The resolved seed is now directly injected into temporary LlamaSamplingParams instances.
Result: 100% deterministic outputs and absolute thread safety across concurrent API requests. You can now confidently use fixed seeds for prompt engineering and benchmarking.

🛠️ 3. Modernized Bug Report Template & Upstream Sync

Issue Template: I have completely revamped the GitHub Bug Report template to streamline troubleshooting. It now includes an anti-AI-spam policy, a detailed OS/Hardware matrix, and forced verbose=True logging requirements with code examples to help me debug your C++ backend logs faster.
Upstream Sync: Update llama.cpp to ggml-org/llama.cpp/commit/b283f6d5b3d2d079019ae5ed3cbbdb4b3be03b25

A huge thank you to everyone in the discussions and issues who provided the logs and insights that made tracing these deep state-machine bugs possible.

Happy coding!

— JamePeng

wudioql · 2026-03-08T15:51:47Z

wudioql
Mar 8, 2026

windows😭

1 reply

JamePeng Mar 8, 2026
Maintainer Author

The Windows version is still being compiled, but cu128 and cu130 version has already been released. 🤣

fortiema · 2026-03-08T22:37:31Z

fortiema
Mar 8, 2026

谢谢 for your work on this! 👏 It's been saving my butt working on projects with qwen3-reranker / qwen35 and the upstream llama-cpp-python seemingly dead for months now 🔥

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling #86

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling #86

Uh oh!

Uh oh!

JamePeng Mar 8, 2026 Maintainer

🚀 Release v0.3.32: Omni-Modal Single-Turn Optimizations & Deterministic Sampling

⚡ 1. Zero-Latency Single-Turn Optimizations for Hybrid/Multimodal Models

🎲 2. Fixed Sampling Seed Determinism & Thread Safety

🛠️ 3. Modernized Bug Report Template & Upstream Sync

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

wudioql Mar 8, 2026

Uh oh!

Uh oh!

JamePeng Mar 8, 2026 Maintainer Author

Uh oh!

fortiema Mar 8, 2026

JamePeng
Mar 8, 2026
Maintainer

Replies: 2 comments 1 reply

wudioql
Mar 8, 2026

JamePeng Mar 8, 2026
Maintainer Author

fortiema
Mar 8, 2026