Skip to content

Fix ROCm BF16 conversion intrinsics in inference v2 (#7843)#7846

Open
tohtana wants to merge 1 commit intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7843-mi300a-bf16-intrinsics
Open

Fix ROCm BF16 conversion intrinsics in inference v2 (#7843)#7846
tohtana wants to merge 1 commit intodeepspeedai:masterfrom
tohtana:tohtana/fix-issue7843-mi300a-bf16-intrinsics

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Feb 12, 2026

Fixes #7843

On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in the code are not provided, e.g.:

  • __ll2bfloat16_rn
  • __int2bfloat16_rn
  • __short2bfloat16_rn
  • __bfloat162uint_rn

This causes compilation errors on HIP platforms.

This PR introduces fallback paths using functions available on HIP platform mirroring the conversion util in csrc. The converion paths are:

  • int/uint -> bf16: convert to float (or double for 64-bit), then to bf16.
  • bf16 -> int/uint: convert bf16 to float, then to the integer type.
  • float -> bf16: build from bf16 via supported HIP helpers.

Add __HIP_PLATFORM_AMD__ fallbacks for BF16 conversions in
inference v2 conversion helpers to avoid unsupported HIP intrinsics
on ROCm toolchains (e.g. __ll2bfloat16_rn, __int2bfloat16_rn,
__bfloat162uint_rn). The fallback logic mirrors csrc conversion
helpers to keep CUDA behavior unchanged while enabling ROCm builds.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana requested a review from hwchen2017 as a code owner February 12, 2026 02:49
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8a8f2be62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

DS_D_INLINE __nv_bfloat16 to(int32_t val)
{
#ifdef __HIP_PLATFORM_AMD__
return __float2bfloat16(__int2float_rn(val));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use exact intermediate for int32/uint32 BF16 casts on HIP

On AMD HIP, this new path converts 32-bit integers through float (__int2float_rn/__uint2float_rn) before __float2bfloat16, which can double-round for |val| > 2^24 and pick a different bf16 than a direct integer→bf16 round (the CUDA branch uses __int2bfloat16_rn/__uint2bfloat16_rn). This means large integer inputs can now produce numerically different bf16 values on ROCm; converting via double for 32-bit ints would keep the conversion exact before the final bf16 rounding.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion is aligned with the existing code. If we want to change the behavior, that should be a separated PR.

Copy link
Collaborator

@PKUWZP PKUWZP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problems while Deepspeed building from scratch on AMD MI300a APU

2 participants