Avoid large merge backlog to block the indexer#25
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent indexers from being assigned ingest v1 pipelines when QW_DISABLE_INGEST_V1 is enabled, avoiding expensive metastore work that can create a large merge backlog. It also adds instrumentation to help identify slow IndexingService actor handlers.
Changes:
- Skip scheduling ingest v1 (
SourceParams::IngestApi) sources whendisable_ingest_v1()is true. - Add a
SlowHandlerGuardand wire it into severalIndexingServicehandlers to emit periodic “slow handler” warnings. - Improve indexing task map formatting by omitting entries with empty task lists.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| quickwit/quickwit-indexing/src/actors/indexing_service.rs | Adds SlowHandlerGuard and instantiates it in multiple actor handlers to detect slow operations. |
| quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs | Skips scheduling ingest v1 sources when disabled; tweaks debug-formatting to hide empty task lists. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub struct SlowHandlerGuard { | ||
| _cancel_tx: oneshot::Sender<()>, | ||
| } | ||
|
|
||
| impl SlowHandlerGuard { | ||
| pub fn new(handler_name: &'static str) -> Self { | ||
| let (cancel_tx, mut cancel_rx) = oneshot::channel::<()>(); | ||
| let start = Instant::now(); |
There was a problem hiding this comment.
SlowHandlerGuard uses oneshot::Sender / oneshot::channel, but oneshot is not imported anywhere in this file. This will not compile as-is; add an appropriate import (e.g. from tokio::sync) or fully-qualify the path at the usage sites.
| pub fn new(handler_name: &'static str) -> Self { | ||
| let (cancel_tx, mut cancel_rx) = oneshot::channel::<()>(); | ||
| let start = Instant::now(); | ||
| tokio::spawn(async move { | ||
| loop { | ||
| tokio::select! { | ||
| _ = tokio::time::sleep(Duration::from_secs(5)) => { | ||
| warn!(handler=handler_name, elapsed_secs=start.elapsed().as_secs(), "slow indexing service handler"); | ||
| } | ||
| _ = &mut cancel_rx => { break; } | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
SlowHandlerGuard::new spawns a new Tokio task on every handler invocation (including periodic supervise_loop). Even when the handler is fast, this creates avoidable task-scheduling overhead and can amplify load under high message volume. Consider implementing the slow-handler warning without spawning (e.g., wrap the handler future in a tokio::select! loop with an interval/sleep tick and log while the future is still pending).
| SourceParams::IngestApi => { | ||
| // TODO ingest v1 is scheduled differently | ||
| if disable_ingest_v1() { | ||
| // Existing indexes might still have the _ingest-api-source | ||
| continue; | ||
| } | ||
| // Note: ingest v1 is scheduled differently |
There was a problem hiding this comment.
This new branch makes get_sources_to_schedule behavior depend on QW_DISABLE_INGEST_V1. There is an existing unit test (test_get_sources_to_schedule) that includes an IngestApi source and asserts a fixed number of scheduled sources; with this change it will need to account for the disabled-ingest-v1 case (otherwise it can fail in test runs where QW_DISABLE_INGEST_V1=true). Consider updating/adding assertions so the test covers both outcomes (scheduled when enabled, skipped when disabled).
Description
Currently, when the indexer starts it is assigned all the ingest v1 pipelines even though DISABLE_INGEST_V1 is set to true. This triggers a huge scan for merge oportunities that happened to be very slow on the indexer node that accumulated a huge backlog of merges. This PR de-activates completely the assignation of ingest v1 pipelines, which should avoid this very slow metastore call.
It also adds some logs to help troubleshoot further issues with the indexing service.