Summary
The _verify_commit(self, ref: str, path: Path) -> InstallStatusResult method is currently duplicated in both src/cloudai/systems/kubernetes/kubernetes_installer.py and src/cloudai/systems/slurm/slurm_installer.py.
Proposed Action
Extract the shared implementation into a common location — for example, a protected method on BaseInstaller or a dedicated InstallerVerifyMixin — and update both KubernetesInstaller and SlurmInstaller to rely on the shared implementation.
Additional Scope: Workload-level installation hooks
Some workloads require custom installation steps beyond cloning a git repository. For example, Megatron-Bridge currently installs numpy and wandb via pip install at submit/runtime inside the launcher wrapper script, rather than at install time. This is fragile: parallel submissions race in site-packages and transient network failures break otherwise-valid launches.
The installable object model should be extended to support custom installation hooks, allowing workloads to declare additional steps (e.g. pip install numpy wandb) that are executed once at install time, within the managed installation phase, rather than on every job submission. Megatron-Bridge is the concrete motivating example.
References
Requested by @podkidyshev.
Summary
The
_verify_commit(self, ref: str, path: Path) -> InstallStatusResultmethod is currently duplicated in bothsrc/cloudai/systems/kubernetes/kubernetes_installer.pyandsrc/cloudai/systems/slurm/slurm_installer.py.Proposed Action
Extract the shared implementation into a common location — for example, a protected method on
BaseInstalleror a dedicatedInstallerVerifyMixin— and update bothKubernetesInstallerandSlurmInstallerto rely on the shared implementation.Additional Scope: Workload-level installation hooks
Some workloads require custom installation steps beyond cloning a git repository. For example, Megatron-Bridge currently installs
numpyandwandbviapip installat submit/runtime inside the launcher wrapper script, rather than at install time. This is fragile: parallel submissions race in site-packages and transient network failures break otherwise-valid launches.The installable object model should be extended to support custom installation hooks, allowing workloads to declare additional steps (e.g.
pip install numpy wandb) that are executed once at install time, within the managed installation phase, rather than on every job submission. Megatron-Bridge is the concrete motivating example.References
Requested by @podkidyshev.