Background
PR #830 introduced the exclude_nodes parameter to SlurmSystem.get_nodes_by_spec() and wired it through the Megatron-Bridge command generation strategy. However, several other workloads and infrastructure components call get_nodes_by_spec() directly without forwarding test_run.exclude_nodes, so the exclusion has no effect for those paths.
Affected call sites
The following production code locations call get_nodes_by_spec() without passing exclude_nodes:
| File |
Line |
Caller context |
src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.py |
126 |
NeMo-Run command gen |
src/cloudai/workloads/nemo_launcher/slurm_command_gen_strategy.py |
40 |
NeMo-Launcher command gen |
src/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py |
86 |
Triton Inference (_get_server_client_split) |
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py |
90 |
DeepEP command gen |
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py |
206 |
AI Dynamo command gen |
src/cloudai/systems/slurm/slurm_command_gen_strategy.py |
325 |
Base Slurm strategy (_enable_vboost_cmd) |
src/cloudai/systems/slurm/single_sbatch_runner.py |
95, 118 |
Single-sbatch runner (node list aggregation and per-test-run node resolution) |
Required changes
For each call site above:
- Parse
test_run.exclude_nodes (a comma-separated string) into a set[str] using parse_node_list (already available in slurm_system.py), consistent with how SlurmCommandGenStrategy does it in this PR.
- Pass the resulting set as
exclude_nodes=... to get_nodes_by_spec().
- For
single_sbatch_runner.py, collect and union all exclude_nodes values across the test runs being batched.
Priority
Low — this is a follow-up to PR #830; existing behaviour (no exclusion) is preserved until these sites are updated.
References
Background
PR #830 introduced the
exclude_nodesparameter toSlurmSystem.get_nodes_by_spec()and wired it through the Megatron-Bridge command generation strategy. However, several other workloads and infrastructure components callget_nodes_by_spec()directly without forwardingtest_run.exclude_nodes, so the exclusion has no effect for those paths.Affected call sites
The following production code locations call
get_nodes_by_spec()without passingexclude_nodes:src/cloudai/workloads/nemo_run/slurm_command_gen_strategy.pysrc/cloudai/workloads/nemo_launcher/slurm_command_gen_strategy.pysrc/cloudai/workloads/triton_inference/slurm_command_gen_strategy.py_get_server_client_split)src/cloudai/workloads/deepep/slurm_command_gen_strategy.pysrc/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.pysrc/cloudai/systems/slurm/slurm_command_gen_strategy.py_enable_vboost_cmd)src/cloudai/systems/slurm/single_sbatch_runner.pyRequired changes
For each call site above:
test_run.exclude_nodes(a comma-separated string) into aset[str]usingparse_node_list(already available inslurm_system.py), consistent with howSlurmCommandGenStrategydoes it in this PR.exclude_nodes=...toget_nodes_by_spec().single_sbatch_runner.py, collect and union allexclude_nodesvalues across the test runs being batched.Priority
Low — this is a follow-up to PR #830; existing behaviour (no exclusion) is preserved until these sites are updated.
References