Skip to content

Clamp logit_scale to prevent numerical instability#530

Open
Mr-Neutr0n wants to merge 1 commit intoopenai:mainfrom
Mr-Neutr0n:fix/clamp-logit-scale
Open

Clamp logit_scale to prevent numerical instability#530
Mr-Neutr0n wants to merge 1 commit intoopenai:mainfrom
Mr-Neutr0n:fix/clamp-logit-scale

Conversation

@Mr-Neutr0n
Copy link

Summary

  • Adds torch.clamp(logit_scale, max=100) after exponentiation in CLIP.forward() to prevent the learned temperature parameter from causing numerical overflow during training/fine-tuning.

Problem

The logit_scale parameter is learned during training with no upper bound enforced. As training progresses, self.logit_scale can grow large enough that self.logit_scale.exp() overflows, producing Inf/NaN in the cosine similarity logits. This causes the contrastive loss to become NaN and training to diverge entirely.

This is a known failure mode when fine-tuning CLIP, and the fix is consistent with:

  • The original CLIP training procedure described in the paper (Section 2.4: "clipped to prevent scaling the logits by more than 100")
  • OpenCLIP's implementation (open_clip/model.py)

Fix

One-line addition in clip/model.py:

logit_scale = self.logit_scale.exp()
logit_scale = torch.clamp(logit_scale, max=100)  # added

The max value of 100 corresponds to a minimum temperature of 0.01, which produces an extremely sharp softmax distribution and is well beyond any practical operating point.

Test plan

  • Verified the change is a single line addition with no side effects on inference
  • Consistent with the CLIP paper's described training procedure
  • Matches the approach used in OpenCLIP and other widely-used reimplementations

The logit_scale parameter (initialized to ln(1/0.07) ≈ 2.66) can grow
unbounded during training since there is no upper bound enforced on it.
When logit_scale becomes too large, the exponentiated value overflows
and produces NaN/Inf in the similarity logits, causing training to
diverge.

This adds torch.clamp(logit_scale, max=100) after exponentiation,
consistent with the original CLIP training procedure and other
reference implementations (e.g., OpenCLIP). The cap of 100 corresponds
to a temperature of 0.01, which is already an extremely sharp
distribution and well beyond any practical operating point.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant