[clamp] Fix float16 scalar overflow check inconsistency between CPU and GPU#185756
[clamp] Fix float16 scalar overflow check inconsistency between CPU and GPU#185756brijrajk wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/185756
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "ciflow/trunk" |
|
@pytorchbot label "topic: not user facing" |
|
The ciflow label(s) ciflow/trunk will be added, but CI won't be triggered until the workflows are approved (scroll to the bottom of this page). Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
The following ciflow label(s) have been added but CI has not been triggered yet because the workflows are awaiting approval:
Once a maintainer approves the workflows (scroll to the bottom of the PR page), the corresponding CI jobs will be triggered automatically. Please ping one of the reviewers if you do not have access to approve and run workflows. |
…in meta function torch.clamp/torch.clip on float16 tensors was inconsistent between CPU and GPU. CPU kernels convert the scalar bound via .to<scalar_t>() which calls c10::check_overflow and raises RuntimeError for out-of-range values. CUDA kernels first promote the scalar to opmath_t (float for float16), so a value like 65507 fits in float without overflow, silently bypassing the check and producing incorrect results (the bound saturates to inf when stored back as float16). The fix adds an explicit overflow check in TORCH_META_FUNC(clamp) for isReducedFloatingType dtypes (float16, bfloat16, float8 variants). The meta function runs before kernel dispatch on all devices, so the check fires consistently whether the tensor is on CPU, CUDA, or ROCm. The existing c10::Scalar::to<T>() mechanism is reused so the error message matches the one already produced by CPU kernels. Fixes pytorch#171356 Test Plan: ``` cd /tmp source /path/to/.venv-src/bin/activate python3 test/test_shape_ops.py \ TestShapeOpsCPU.test_clamp_float16_scalar_overflow_cpu \ TestShapeOpsCUDA.test_clamp_float16_scalar_overflow_cuda \ -v ``` All tests pass. Verified on AMD Radeon AI PRO R9700 (gfx1201, ROCm 7.0). bfloat16 with max=65507 correctly does not raise (65507 is representable in bf16). Authored by Claude.
|
Rebased on latest main. @jbschlosser — you've recently touched |
8d40819 to
15d8663
Compare
Summary
torch.clamp/torch.cliponfloat16tensors had inconsistent validationbetween CPU and GPU. When a scalar bound exceeds the float16 range (~65504),
CPU correctly raises
RuntimeErrorbut GPU silently succeeded and returnedincorrect results — the out-of-range bound saturates to
infwhen stored backas float16, giving wrong clamp behavior with no warning.
Root Cause
CPU kernels (
cpu/TensorCompareKernel.cpp) convert the scalar bound via.to<scalar_t>()wherescalar_t = at::Half, which callsc10::check_overflowand raises. CUDA kernels (
cuda/TensorCompare.cu) first promote the scalar toopmath_t(floatfor float16) before using it, so a value like65507.0fits in
floatwithout triggering any overflow check — the error is silentlybypassed.
Fix
Add the overflow check in
TORCH_META_FUNC(clamp)inTensorCompare.cpp.The meta function runs before kernel dispatch on all devices, making the
check device-agnostic in a single change. The existing
c10::Scalar::to<T>()mechanism is reused so the error message is identical to what CPU already
produced — no new error strings introduced.
isReducedFloatingType+AT_DISPATCH_REDUCED_FLOATING_TYPESensures thecheck covers
float16,bfloat16, and allfloat8variants.bfloat16withmax=65507correctly does not raise since 65507 is representable inbfloat16 (same exponent range as float32).
Prior Attempts and Related Issues
custom
c10::overflowscheck. Left XPU, ROCm, and any future backendsunpatched, and produced mangled C++ type names in the error message.
backend, confirming that a CUDA-only fix is insufficient. Our meta-function
approach fixes XPU automatically.
Fixes #171356
Checklist
lintrunner aten/src/ATen/native/TensorCompare.cpp test/test_shape_ops.py)BC-breaking?
No — this turns a silent wrong result into a
RuntimeErrorconsistent withwhat CPU already raised. No previously correct code is broken.