feat: add training stability guards and fix transformers 5.2.0 compat… by Jingyuan-zhu · Pull Request #964 · OptimalScale/LMFlow

Jingyuan-zhu · 2026-02-26T04:55:49Z

Description

This PR introduces several hardware-aware safeguards to prevent common crashes during training and improves compatibility with newer library versions.

Improvements:

BF16 Auto-Fallback: Automatically falls back to fp16 if bf16 is requested on unsupported hardware (pre-Ampere GPUs), preventing cryptic CUDA errors.
Gradient Checkpointing Guard: Automatically disables use_cache when gradient checkpointing is active to prevent known training conflicts.
Vocabulary Mismatch Guard: Automatically resizes model embeddings if the tokenizer's vocabulary exceeds the model's dimension (essential when using custom chat templates).
Library Compatibility:
Fixed AttributeError for overwrite_output_dir appearing in transformers >= 5.2.0.
Added pytest.importorskip for sglang to prevent test collection crashes in environments without sglang.

…ibility

research4pan

LGTM

feat: add training stability guards and fix transformers 5.2.0 compat…

d68f2e8

…ibility

research4pan approved these changes Mar 23, 2026

View reviewed changes

research4pan merged commit c2febf1 into OptimalScale:main Mar 23, 2026
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add training stability guards and fix transformers 5.2.0 compat…#964

feat: add training stability guards and fix transformers 5.2.0 compat…#964
research4pan merged 1 commit intoOptimalScale:mainfrom
Jingyuan-zhu:fix/training-safeguards

Jingyuan-zhu commented Feb 26, 2026

Uh oh!

research4pan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jingyuan-zhu commented Feb 26, 2026

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants