Skip to content

Fix ZeRO legacy grad-hook crash when next_functions is missing#7857

Draft
harshang03 wants to merge 1 commit intodeepspeedai:masterfrom
harshang03:fix/7830-grad-hook-next-functions
Draft

Fix ZeRO legacy grad-hook crash when next_functions is missing#7857
harshang03 wants to merge 1 commit intodeepspeedai:masterfrom
harshang03:fix/7830-grad-hook-next-functions

Conversation

@harshang03
Copy link

Summary

  • harden legacy (torch<2.1) grad-hook registration when checkpointed graphs do not expose param.expand_as(...).grad_fn.next_functions
  • fall back to param.register_hook safely instead of crashing with 'NoneType' object has no attribute next_functions'
  • wire hook-handle registration defensively in ZeRO-1/2, ZeRO-3, and BF16 optimizer paths and add focused regression tests

Testing

  • python3 -m pytest tests/unit/runtime/test_register_grad_hook.py (fails in this environment: missing dependency torch)

Fixes #7830

Fix ZeRO/BF16 hook setup for checkpointed graphs where expand_as no longer exposes next_functions by falling back to parameter hooks, and add regression tests for both legacy accumulator and fallback paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]AttributeError in ZeRO-3 with gradient_checkpointing: 'NoneType' object has no attribute 'next_functions' (DeepSpeed 0.18.5)

1 participant