Skip to content

feat: add training stability guards and fix transformers 5.2.0 compat…#964

Merged
research4pan merged 1 commit intoOptimalScale:mainfrom
Jingyuan-zhu:fix/training-safeguards
Mar 23, 2026
Merged

feat: add training stability guards and fix transformers 5.2.0 compat…#964
research4pan merged 1 commit intoOptimalScale:mainfrom
Jingyuan-zhu:fix/training-safeguards

Conversation

@Jingyuan-zhu
Copy link
Contributor

Description

This PR introduces several hardware-aware safeguards to prevent common crashes during training and improves compatibility with newer library versions.

Improvements:

  • BF16 Auto-Fallback: Automatically falls back to fp16 if bf16 is requested on unsupported hardware (pre-Ampere GPUs), preventing cryptic CUDA errors.

  • Gradient Checkpointing Guard: Automatically disables use_cache when gradient checkpointing is active to prevent known training conflicts.

  • Vocabulary Mismatch Guard: Automatically resizes model embeddings if the tokenizer's vocabulary exceeds the model's dimension (essential when using custom chat templates).

  • Library Compatibility:
    Fixed AttributeError for overwrite_output_dir appearing in transformers >= 5.2.0.
    Added pytest.importorskip for sglang to prevent test collection crashes in environments without sglang.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@research4pan research4pan merged commit c2febf1 into OptimalScale:main Mar 23, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants