Harden CI for self-hosted SLURM runners#1135
Harden CI for self-hosted SLURM runners#1135sbryngelson wants to merge 32 commits intoMFlowCode:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
read -t 0.1 (sub-second timeout) in a loop with process substitution file descriptors triggers a bash internal error (unwind_frame_run: read_builtin: frame not found) leading to a segfault. Use integer timeout (read -t 1) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRefactors SLURM monitoring to use state-driven polling, improves exit-code extraction and tailing behavior, adds cancellation on abnormal exit; updates GPU SBATCH requests to Changes
Sequence Diagram(s)sequenceDiagram
participant Monitor as monitor_slurm_job.sh
participant SlurmCtl as squeue/sacct/scontrol
participant JobFS as Job output file (filesystem)
participant Tail as tail process
Monitor->>SlurmCtl: get_job_state(job_id) (squeue → sacct fallback)
SlurmCtl-->>Monitor: STATE (PENDING/RUNNING/COMPLETED/UNKNOWN)
alt STATE PENDING/CONFIGURING
Monitor->>Monitor: sleep (~10s), continue polling
else STATE RUNNING/COMPLETING
Monitor->>Tail: start non-blocking tail (1s timeout, burst cap)
Tail->>JobFS: read new lines
JobFS-->>Tail: new output
Tail-->>Monitor: heartbeat + output
else STATE TERMINAL
Monitor->>Tail: stop tail, drain remaining lines (1s timeout, cap)
Monitor->>SlurmCtl: scontrol show job -> ExitCode (fallback sacct)
SlurmCtl-->>Monitor: ExitCode or UNKNOWN
alt ExitCode == 0
Monitor-->>Monitor: set monitor_success, exit 0
else
Monitor->>SlurmCtl: scancel job_id (if needed)
Monitor-->>Monitor: exit non-zero
end
else STATE UNKNOWN
Monitor->>Monitor: periodic warning, longer sleep, retry
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Updates the Phoenix (GT) GitHub Actions SLURM submission scripts to target H200 GPU resources for improved scheduling, and adjusts the SLURM job monitor script to avoid a Bash segfault caused by fractional read -t timeouts.
Changes:
- Update Phoenix GPU
sbatchdirectives to request H200 GPUs and adjust task count per node. - Replace fractional
read -ttimeouts with integer timeouts in the SLURM job monitor script.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| .github/workflows/phoenix/submit.sh | Switch GPU submission options to request H200 GPUs and increase --ntasks-per-node. |
| .github/workflows/phoenix/submit-bench.sh | Same as above for benchmark submissions. |
| .github/scripts/monitor_slurm_job.sh | Use integer read -t timeouts to prevent Bash segfaults during output streaming. |
PR MFlowCode#1124 changed bench.yml to use workflow_run (triggered after Test Suite completes), which broke the approve-to-run flow for fork PRs. Revert to the original pull_request + pull_request_review triggers while keeping improvements (frontier_amd matrix, concurrency group, timeout, run_parallel_benchmarks.sh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1135 +/- ##
=======================================
Coverage 44.07% 44.07%
=======================================
Files 70 70
Lines 20431 20431
Branches 1974 1974
=======================================
Hits 9004 9004
Misses 10291 10291
Partials 1136 1136 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Write failed test UUIDs to tests/failed_uuids.txt after a test run. In CI, if 1-5 tests fail, automatically re-run just those tests. If 6+ fail, treat it as a real issue and fail immediately. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/workflows/test.yml">
<violation number="1" location=".github/workflows/test.yml:138">
P2: The initial test run is forced to succeed with `|| true`, so real failures can be masked when `tests/failed_uuids.txt` is missing. Preserve the exit code and fail the job if no retry is performed.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/test.yml:
- Around line 137-153: The test step currently uses "|| true" after invoking
mfc.sh which hides crashes; remove that and capture the test command's exit
status (run mfc.sh test -v ... and set TEST_EXIT=$? or use if ! ...; then
TEST_EXIT=$? else TEST_EXIT=0) so you can decide later; keep the existing retry
logic that checks tests/failed_uuids.txt (NUM_FAILED, FAILED) and, after that
block, if tests/failed_uuids.txt does not exist and TEST_EXIT is non‑zero then
exit with TEST_EXIT (or otherwise propagate the original non‑zero exit code) to
avoid reporting success on a crash. Ensure you reference the same mfc.sh
invocation and tests/failed_uuids.txt and use TEST_EXIT when making the final
exit decision.
🧹 Nitpick comments (2)
toolchain/mfc/test/test.py (1)
209-216: Race condition:failed_testsis mutated from worker threads without synchronization.
failed_testsis appended to inhandle_case(line 505) which runs in worker threads. While CPython's GIL makeslist.appendatomic, iterating overfailed_testshere (line 213) is safe only becausesched.sched(line 179) has already joined all workers by this point. This is fine but fragile — a future refactor that moves this block could introduce a bug. A brief comment would help..github/workflows/test.yml (1)
144-144: Minor: useless use ofcat.
tr '\n' ' ' < tests/failed_uuids.txtis slightly cleaner, though this is purely cosmetic.
Don't mask non-zero exit codes when tests crash before writing failed_uuids.txt. Only suppress the exit code when the file exists (meaning the test framework ran to completion and we can retry). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace squeue exit-code polling with get_job_state() that parses the actual state string (squeue + sacct fallback). Never give up on UNKNOWN state — CI timeout is the backstop. Cancel orphaned SLURM jobs on abnormal monitor exit. Include job state in heartbeats. Incorporates changes from PR MFlowCode#1140. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name=".github/scripts/monitor_slurm_job.sh">
<violation number="1" location=".github/scripts/monitor_slurm_job.sh:40">
P2: Guard the `squeue` pipeline so transient command failures don't abort the script under `set -euo pipefail`; otherwise a temporary SLURM outage exits the monitor instead of returning "UNKNOWN".</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Split Frontier GPU test configs into 2 shards (~75 min each) so they fit within the batch partition's 2h wall time limit. This allows all Frontier SLURM jobs to run concurrently instead of serially on the extended partition (which has a 1-job-per-user limit), reducing total CI wall clock from ~4.5h to ~2h. Changes: - Add --shard CLI argument (e.g., --shard 1/2) with modulo-based round-robin distribution across shards - Switch Frontier submit scripts from extended to batch/hackathon (CFD154 account, 1h59m wall time) - Shard the 3 Frontier GPU matrix entries into 6 (2 shards each) - CPU entries remain unsharded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
1 issue found across 7 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="toolchain/mfc/test/test.py">
<violation number="1" location="toolchain/mfc/test/test.py:103">
P2: Validate the `--shard` argument before using it; as written, invalid or `0` shard counts will raise exceptions (including ZeroDivisionError) during modulo operations.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move build retry logic from shell scripts to GHA using nick-fields/retry with 60s backoff between attempts. This gives better visibility into retries and lets login node memory pressure subside between attempts. Also reduce build parallelism from -j 8 to -j 4 to lower peak memory on shared Frontier login nodes, and remove the outdated Node 16 version overrides from self-hosted runner env. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
Without set -e, the benchmark build loop could silently ignore failures of earlier benchmarks if a later one succeeded, since only the last command's exit code would propagate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
3 issues found across 4 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name=".github/workflows/frontier/build.sh">
<violation number="1" location=".github/workflows/frontier/build.sh:23">
P2: Benchmark loop no longer checks for errors, so a failing benchmark can be masked by later successful runs (script has no `set -e`). Add explicit failure handling to stop on the first failing benchmark.</violation>
</file>
<file name=".github/workflows/frontier_amd/build.sh">
<violation number="1" location=".github/workflows/frontier_amd/build.sh:23">
P2: Failures in benchmark runs can be silently ignored because the loop doesn't check each `./mfc.sh run` exit code (the script doesn't use `set -e`). A failing benchmark earlier in the loop can be masked by a later successful run, letting CI pass incorrectly. Consider exiting non-zero as soon as a benchmark run fails.</violation>
</file>
<file name=".github/workflows/bench.yml">
<violation number="1" location=".github/workflows/bench.yml:106">
P2: `nick-fields/retry@v3` requires `timeout_minutes` or `timeout_seconds`. Without one, the step will fail before running the build command. Add a timeout input to keep the retry step valid.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
nick-fields/retry@v3 requires either timeout_minutes or timeout_seconds. Set to 480 minutes to match the GHA job timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a build is killed mid-way (e.g., SIGTERM), the build directory is left in a corrupted state. Subsequent retry attempts fail immediately because set -e exits on the first error from the dirty build tree. Running ./mfc.sh clean between retries restores a fresh build state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| @@ -57,14 +111,13 @@ exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1) | |||
| tail_pid=$! | |||
There was a problem hiding this comment.
Suggestion: Use coproc to reliably capture the tail process PID. The current use of $! with process substitution is not guaranteed to work, which can lead to leaked processes. [possible issue, importance: 8]
New proposed code:
# Start tail and redirect its output to file descriptor 3 for multiplexing
# This allows us to stream tail output while also printing heartbeat messages
-exec 3< <(stdbuf -oL -eL tail -f "$output_file" 2>&1)
-tail_pid=$!
+coproc TAIL_PROC { stdbuf -oL -eL tail -f "$output_file" 2>&1; }
+exec 3<&"${TAIL_PROC[0]}"
+tail_pid="${TAIL_PROC_PID}"| if ARG("shard") is not None: | ||
| parts = ARG("shard").split("/") | ||
| if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): | ||
| raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').") | ||
| shard_idx, shard_count = int(parts[0]), int(parts[1]) | ||
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | ||
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] |
There was a problem hiding this comment.
Suggestion: Sort test cases by UUID before sharding them. This ensures a deterministic distribution of tests across shards, preventing flaky CI behavior caused by inconsistent test ordering. [possible issue, importance: 9]
| if ARG("shard") is not None: | |
| parts = ARG("shard").split("/") | |
| if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): | |
| raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').") | |
| shard_idx, shard_count = int(parts[0]), int(parts[1]) | |
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | |
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] | |
| if ARG("shard") is not None: | |
| parts = ARG("shard").split("/") | |
| if len(parts) != 2 or not all(p.isdigit() for p in parts) or int(parts[1]) < 1 or not 1 <= int(parts[0]) <= int(parts[1]): | |
| raise MFCException(f"Invalid --shard '{ARG('shard')}': expected 'i/n' with 1 <= i <= n (e.g., '1/2').") | |
| shard_idx, shard_count = int(parts[0]), int(parts[1]) | |
| cases = sorted(cases, key=lambda c: c.get_uuid()) | |
| skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1] | |
| cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1] |
| - name: Print Logs | ||
| if: always() |
There was a problem hiding this comment.
Suggestion: Add the shard identifier to log filenames and artifact names in the CI workflow. This prevents sharded jobs from overwriting each other's logs, which is crucial for debugging. [possible issue, importance: 10]
New proposed code:
- name: Print Logs
if: always()
- run: cat test-${{ matrix.device }}-${{ matrix.interface }}.out
+ run: cat test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.out
- name: Archive Logs
uses: actions/upload-artifact@v4
if: matrix.cluster != 'phoenix'
with:
- name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}
- path: test-${{ matrix.device }}-${{ matrix.interface }}.out
+ name: logs-${{ strategy.job-index }}-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}
+ path: test-${{ matrix.device }}-${{ matrix.interface }}${{ matrix.shard != '' && format('-{0}', matrix.shard) || '' }}.outwait %1 and %2 require job control (set -m), which is disabled in non-interactive shells. nick-fields/retry runs commands in a non-interactive shell, causing 'no such job' (exit 127) even when both builds succeed. Use $! PIDs instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
CodeAnt AI is running Incremental review Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
CodeAnt AI Incremental review completed. |
User description
Summary
Systematic hardening of CI infrastructure for self-hosted runners (Phoenix, Frontier) on Lustre filesystems:
get_job_state()(squeue primary, sacct fallback). Adds orphan job cleanup via EXIT trap, terminal state detection, and informative heartbeat messagesread -t 0.1to integerread -t 1to avoid process substitution FD corruptionrm -rfbefore checkout withclean: falseto avoid ESTALE/ENOTEMPTY errors fromgit cleanon Lustre-X -Pflags for parsable, job-level-only output|| trueto squeue/sacct command substitutionsshutil.rmtreeup to 5 times with backoff in Python toolchainpull_request_reviewdoesn't cancelpull_requestrunsworkflow_runback to directpull_request+pull_request_reviewtriggers@page/@refpatternsTest plan
🤖 Generated with Claude Code
CodeAnt-AI Description
Harden CI for self-hosted SLURM runners and reduce CI flakiness
What Changed
Impact
✅ Fewer orphaned SLURM jobs✅ Fewer sporadic CI test failures due to automatic targeted retries✅ Fewer Lustre cleanup/check-out failures💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.