Fix ROCm BF16 conversion intrinsics in inference v2 (#7843)#7846
Fix ROCm BF16 conversion intrinsics in inference v2 (#7843)#7846tohtana wants to merge 1 commit intodeepspeedai:masterfrom
Conversation
Add __HIP_PLATFORM_AMD__ fallbacks for BF16 conversions in inference v2 conversion helpers to avoid unsupported HIP intrinsics on ROCm toolchains (e.g. __ll2bfloat16_rn, __int2bfloat16_rn, __bfloat162uint_rn). The fallback logic mirrors csrc conversion helpers to keep CUDA behavior unchanged while enabling ROCm builds. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8a8f2be62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DS_D_INLINE __nv_bfloat16 to(int32_t val) | ||
| { | ||
| #ifdef __HIP_PLATFORM_AMD__ | ||
| return __float2bfloat16(__int2float_rn(val)); |
There was a problem hiding this comment.
Use exact intermediate for int32/uint32 BF16 casts on HIP
On AMD HIP, this new path converts 32-bit integers through float (__int2float_rn/__uint2float_rn) before __float2bfloat16, which can double-round for |val| > 2^24 and pick a different bf16 than a direct integer→bf16 round (the CUDA branch uses __int2bfloat16_rn/__uint2bfloat16_rn). This means large integer inputs can now produce numerically different bf16 values on ROCm; converting via double for 32-bit ints would keep the conversion exact before the final bf16 rounding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This conversion is aligned with the existing code. If we want to change the behavior, that should be a separated PR.
Fixes #7843
On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in the code are not provided, e.g.:
__ll2bfloat16_rn__int2bfloat16_rn__short2bfloat16_rn__bfloat162uint_rnThis causes compilation errors on HIP platforms.
This PR introduces fallback paths using functions available on HIP platform mirroring the conversion util in csrc. The converion paths are: