Skip to content

Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859

Open
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7819-zero3-cpuadam-bias-correction
Open

Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)#7859
tohtana wants to merge 2 commits intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7819-zero3-cpuadam-bias-correction

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Feb 18, 2026

Fixes #7819

The root cause was non-idempotent CPUAdam step-state handling under ZeRO-3 subgroup updates: repeated calls at the same logical step could take different internal paths and produce slightly different bias-correction metadata.

The fix makes same-step calls a no-op while preserving fast sequential updates, and adds regression tests covering both step_subgroup() and step() subgroup-style paths.
Validated with focused CPUAdam tests.

Make Adam_Optimizer::IncrementStep idempotent for repeated calls at the
same logical step. ZeRO-3/SuperOffload can invoke multiple subgroup updates
for one step on a shared native optimizer object; the previous logic mixed
multiply and recompute paths, producing non-bit-identical bias-correction
metadata between subgroup calls.

This updates both CPU and XPU headers with aligned step-transition logic and
clarifies first-step/non-sequential-step behavior. It also adds CPUAdam
regression tests for subgroup-style repeated same-step updates via both
step_subgroup() and step() param swapping.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tohtana very well-written fix, a few optional work suggested here:

  • Adding a test with non-sequential steps (e.g., jump from step 2 to step 5) to validate the pow fallback path after the refactor
  • Adding a test with beta changes mid-training to exercise the if-branch + fallthrough path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DeepSpeed ZeRO Stage-3 + CPU offloaded optimizer (CPUAdam) inconsistency metadata between subgroup

2 participants