┌─────────────────────────────────────────────────────────────────────────┐
│ Alibaba Cloud Infrastructure │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────────────────────────────────────┐ │
│ │ Web UI │───▶│ ACK/EKS Cluster │ │
│ │ (Monitor) │ │ │ │
│ └──────────────┘ │ ┌────────────────────────────────────────┐ │ │
│ │ │ Scanner Actor (Leader-elected) │ │ │
│ ┌──────────────┐ │ │ - List OSS for new files │ │ │
│ │ CLI │───▶│ │ - Insert jobs into TiKV │ │ │
│ │ (Submit/Mgmt)│ │ └────────────────────────────────────────┘ │ │
│ └──────────────┘ │ │ │
│ │ ┌────────────────────────────────────────┐ │ │
│ ┌──────────────┐ │ │ Worker Pods (N identical peers) │ │ │
│ │ OSS (Input) │◀──▶│ │ - Claim jobs via TiKV CAS │ │ │
│ │ /raw-data/ │ │ │ - Stream from OSS (range requests) │ │ │
│ └──────────────┘ │ │ - Checkpoint progress to TiKV │ │ │
│ │ │ - Multipart upload to OSS │ │ │
│ ┌──────────────┐ │ └────────────────────────────────────────┘ │ │
│ │ OSS (Output) │◀──▶│ │ │
│ │ /lerobot/ │ │ ┌────────────────────────────────────────┐ │ │
│ └──────────────┘ │ │ API Server │ │ │
│ │ │ - REST API for job management │ │ │
│ ┌──────────────┐ │ │ - Web UI for monitoring │ │ │
│ │ TiKV │◀──▶│ └────────────────────────────────────────┘ │ │
│ │ Cluster │ │ │ │
│ └──────────────┘ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Phase 1-6: ✅ COMPLETE
Phase 7: Pipeline Integration (CURRENT PRIORITY)
#72 (LerobotWriter) ─► #73 (Checkpoint Save) ─► #47 (Pipeline Hooks)
└─► #48 (Graceful Shutdown)
Phase 8: GPU (parallel, optional)
#47 ─► #49 (NVENC)
Phase 9: Kubernetes
#47 + #48 ─► #18 ─► #20
Phase 10: CLI & Web UI
#40 ─► #50 (CLI) ─► #51 (Web UI)
Phase 11: Observability
#47 ─► #21 (Metrics)
#18 ─► #22 (Logging)
Overview
Transform roboflow into a distributed, fault-tolerant system using TiKV for coordination and shared-nothing compute architecture.
Design Documents
Key Characteristics
Architecture
TiKV Data Model
/jobs/{hash}/locks/{hash}/state/{hash}/heartbeat/{pod_id}/system/scanner_lockImplementation Phases
Phase 1-3: Storage & LeRobot ✅ COMPLETE
Phase 4: TiKV Coordination Layer ✅ COMPLETE
Phase 5: Checkpointing System ✅ COMPLETE
Phase 6: Storage Enhancements ✅ COMPLETE
Phase 7: Pipeline Integration 🚧 IN PROGRESS
Phase 8: GPU Acceleration (Optional)
Phase 9: Kubernetes Deployment
Phase 10: CLI & Web UI
Phase 11: Observability
New 5-Phase Roadmap (from DISTRIBUTED_DESIGN.md)
Dependency Graph
User Interaction
CLI Commands
Web UI Features
Success Criteria
Architecture Benefits