Summary
Implement PrefetchQueue to hide I/O latency by downloading the next job while processing the current one. This is a critical performance component from DISTRIBUTED_DESIGN.md.
Context
From DISTRIBUTED_DESIGN.md:
Prefetch Pipeline: Download next job while processing current (hides I/O latency)
The design calls for 2 prefetch slots per worker to achieve 40% throughput improvement from I/O overlap.
Current State
- Workers process one job at a time sequentially
- Each job requires S3/OSS download before processing starts
- I/O wait time adds directly to total processing time
Target State
┌─────────────────────────────────────────────────────────────────┐
│ Worker Internal Architecture │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Prefetch Pipeline │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Slot 1 │ │ Slot 2 │ │ │
│ │ │ (downloading│ │ (queued) │ │ │
│ │ │ next job) │ │ │ │ │
│ │ └──────┬──────┘ └─────────────┘ │ │
│ └─────────┼───────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────▼───────────────────────────────────────────────┐ │
│ │ Active Job Processing │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Tasks
1. Define PrefetchQueue Structure
pub struct PrefetchQueue {
/// Slots for prefetching (default: 2)
slots: Vec<Option<PrefetchSlot>>,
/// Background downloader task
downloader: Arc<Downloader>,
/// Available jobs to prefetch
pending: Receiver<JobRecord>,
/// Storage for downloads
storage: Arc<Storage>,
}
struct PrefetchSlot {
job: JobRecord,
local_path: PathBuf,
status: PrefetchStatus,
}
enum PrefetchStatus {
Downloading,
Ready,
Failed(String),
}
2. Implement Parallel Range Downloader
pub struct ParallelDownloader {
/// Number of parallel connections (default: 16)
connections: usize,
/// Chunk size for range requests (default: 8MB)
chunk_size: usize,
}
impl ParallelDownloader {
/// Download file using parallel range requests
pub async fn download(&self, url: &str, dest: &Path) -> Result<()>;
/// Resume from existing partial download
pub async fn resume(&self, url: &str, dest: &Path) -> Result<()>;
}
3. Implement PrefetchQueue Logic
- Initialization: Start with 2 slots, queue first 2 jobs for download
- Get Next Job:
- Return ready slot if available
- Wait for download to complete if downloading
- Queue next job download in background
- Download Strategy:
- Use parallel range requests (16 connections)
- Memory-mapped file for large downloads
- Retry on failure with exponential backoff
4. Integrate with Worker
// In Worker::run()
let prefetch = PrefetchQueue::new(storage, config);
loop {
// Get next ready job (blocks until download complete)
let job = prefetch.next_ready().await?;
// Process job (I/O already done)
let result = self.process_job(&job).await?;
// Notify prefetch to queue another download
prefetch.notify_complete();
}
5. Add Configuration
pub struct PrefetchConfig {
/// Number of prefetch slots
pub slots: usize, // default: 2
/// Parallel download connections
pub download_connections: usize, // default: 16
/// Range request chunk size
pub chunk_size: usize, // default: 8MB
/// Local buffer directory
pub buffer_dir: PathBuf,
}
6. Add Metrics
roboflow_prefetch_queue_depth (Gauge)
roboflow_prefetch_download_duration_seconds (Histogram)
roboflow_prefetch_bytes_downloaded_total (Counter)
roboflow_prefetch_errors_total (Counter)
Performance Target
| Metric |
Before |
After |
Improvement |
| I/O Wait |
3-8 sec |
Hidden (overlap) |
~100% |
| Throughput |
~536 Mbps/worker |
~750 Mbps/worker |
+40% |
Dependencies
- Requires: Streaming S3 reader (exists)
- Enables: 10 Gbps throughput target
Files to Create
crates/roboflow-distributed/src/prefetch.rs
crates/roboflow-distributed/src/prefetch/queue.rs
crates/roboflow-distributed/src/prefetch/downloader.rs
Files to Modify
crates/roboflow-distributed/src/worker.rs
crates/roboflow-distributed/src/lib.rs
Acceptance Criteria
Summary
Implement PrefetchQueue to hide I/O latency by downloading the next job while processing the current one. This is a critical performance component from DISTRIBUTED_DESIGN.md.
Context
From DISTRIBUTED_DESIGN.md:
Current State
Target State
Tasks
1. Define PrefetchQueue Structure
2. Implement Parallel Range Downloader
3. Implement PrefetchQueue Logic
4. Integrate with Worker
5. Add Configuration
6. Add Metrics
roboflow_prefetch_queue_depth(Gauge)roboflow_prefetch_download_duration_seconds(Histogram)roboflow_prefetch_bytes_downloaded_total(Counter)roboflow_prefetch_errors_total(Counter)Performance Target
Dependencies
Files to Create
crates/roboflow-distributed/src/prefetch.rscrates/roboflow-distributed/src/prefetch/queue.rscrates/roboflow-distributed/src/prefetch/downloader.rsFiles to Modify
crates/roboflow-distributed/src/worker.rscrates/roboflow-distributed/src/lib.rsAcceptance Criteria