Problem
Workers in the distributed system currently treat config_hash as a local file path. In a distributed environment where workers run in separate pods/machines, they don't have access to the submit node's filesystem. This causes workers to fall back to empty configs, resulting in 0 frames written.
Solution
Store dataset configuration TOML content in TiKV using content-addressable storage (SHA-256 hash). Jobs reference configs by hash, and workers fetch the config content from TiKV.
Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ Submit Node │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Config File │ ──▶ │ Read & Hash │ ──▶ │ Store in TiKV│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ SHA-256 Hash /roboflow/v1/configs/{hash} │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ JobRecord { │ │
│ │ id: "job-abc", │ │
│ │ config_hash: "a3f5b...", // ← hash reference, NOT file path │ │
│ │ ... │ │
│ │ } │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
│
│ Job Queue (TiKV)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ Worker Node │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Claim Job │ ──▶ │ Get by Hash │ ──▶ │ Parse TOML │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ config_hash: get_config() LerobotConfig │
│ "a3f5b..." from TiKV from TOML │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Tasks
1. Submit Command (src/bin/commands/submit.rs)
- Add
load_or_store_config() helper function
- Read config file content
- Compute SHA-256 hash
- Store
ConfigRecord in TiKV if not already present
- Use hash as
config_hash in JobRecord
2. Worker (crates/roboflow-distributed/src/worker.rs)
- Change
create_lerobot_config() to async
- Fetch config from TiKV using
config_hash
- Parse TOML content to
LerobotConfig
- Fail job if config not found (don't fall back to empty config)
3. Config Parsing (crates/roboflow-dataset/src/lerobot/config.rs)
- Add
from_toml(content: &str) method
- Keep existing
from_file(path) for backward compatibility
Existing Infrastructure
The following components are already implemented and ready to use:
| Component |
Location |
ConfigRecord struct |
crates/roboflow-distributed/src/tikv/schema.rs |
ConfigKeys::config() |
crates/roboflow-distributed/src/tikv/key.rs |
TikvClient::put_config() |
crates/roboflow-distributed/src/tikv/client.rs |
TikvClient::get_config() |
crates/roboflow-distributed/src/tikv/client.rs |
| SHA-256 hashing |
ConfigRecord::compute_hash() |
Design Decisions
| Question |
Decision |
| Config not found in TiKV? |
Fail job immediately |
| Backward compatibility? |
Detect hash vs path (64-char hex = hash) |
| Config validation? |
Parse TOML on submit before storing |
| Config updates? |
Immutable (new content = new hash) |
| Caching? |
Optional: LRU cache in worker for same config |
Files to Modify
src/bin/commands/submit.rs
crates/roboflow-distributed/src/worker.rs
crates/roboflow-dataset/src/lerobot/config.rs
Related
- Existing TiKV infrastructure in
roboflow-distributed crate
Problem
Workers in the distributed system currently treat
config_hashas a local file path. In a distributed environment where workers run in separate pods/machines, they don't have access to the submit node's filesystem. This causes workers to fall back to empty configs, resulting in 0 frames written.Solution
Store dataset configuration TOML content in TiKV using content-addressable storage (SHA-256 hash). Jobs reference configs by hash, and workers fetch the config content from TiKV.
Architecture
Implementation Tasks
1. Submit Command (
src/bin/commands/submit.rs)load_or_store_config()helper functionConfigRecordin TiKV if not already presentconfig_hashin JobRecord2. Worker (
crates/roboflow-distributed/src/worker.rs)create_lerobot_config()to asyncconfig_hashLerobotConfig3. Config Parsing (
crates/roboflow-dataset/src/lerobot/config.rs)from_toml(content: &str)methodfrom_file(path)for backward compatibilityExisting Infrastructure
The following components are already implemented and ready to use:
ConfigRecordstructcrates/roboflow-distributed/src/tikv/schema.rsConfigKeys::config()crates/roboflow-distributed/src/tikv/key.rsTikvClient::put_config()crates/roboflow-distributed/src/tikv/client.rsTikvClient::get_config()crates/roboflow-distributed/src/tikv/client.rsConfigRecord::compute_hash()Design Decisions
Files to Modify
src/bin/commands/submit.rscrates/roboflow-distributed/src/worker.rscrates/roboflow-dataset/src/lerobot/config.rsRelated
roboflow-distributedcrate