Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859
Open
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
Open
Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
Conversation
Make Adam_Optimizer::IncrementStep idempotent for repeated calls at the same logical step. ZeRO-3/SuperOffload can invoke multiple subgroup updates for one step on a shared native optimizer object; the previous logic mixed multiply and recompute paths, producing non-bit-identical bias-correction metadata between subgroup calls. This updates both CPU and XPU headers with aligned step-transition logic and clarifies first-step/non-sequential-step behavior. It also adds CPUAdam regression tests for subgroup-style repeated same-step updates via both step_subgroup() and step() param swapping. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
PKUWZP
approved these changes
Feb 18, 2026
Collaborator
PKUWZP
left a comment
There was a problem hiding this comment.
@tohtana very well-written fix, a few optional work suggested here:
- Adding a test with non-sequential steps (e.g., jump from step 2 to step 5) to validate the pow fallback path after the refactor
- Adding a test with beta changes mid-training to exercise the if-branch + fallthrough path
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #7819
The root cause was non-idempotent CPUAdam step-state handling under ZeRO-3 subgroup updates: repeated calls at the same logical step could take different internal paths and produce slightly different bias-correction metadata.
The fix makes same-step calls a no-op while preserving fast sequential updates, and adds regression tests covering both
step_subgroup()andstep()subgroup-style paths.Validated with focused CPUAdam tests.