ENH: Use AVX512-FP16 SVML content for float16 umath functions by r-devulap · Pull Request #23351 · numpy/numpy

r-devulap · 2023-03-06T20:28:34Z

Leverage AVX512 FP16 SVML content. These are up to 4-5x faster than using FP32 SVML functions which were already (added in #21955). Max ULP errors are listed below, still working on getting exact benchmark numbers.

Requires gcc >= 12.x for build.

ufunc	Max ULP err
acos	2.535
acosh	2.086
asin	3.055
asinh	1.506
atan	2.605
atanh	1.875
cbrt	1.57
cos	1.428
cosh	1.326
exp2	1.331
exp	1.273
expm1	0.5294
ln	1.8
log10	1.27
log1p	1.882
log2	1.797
sin	1.875
sinh	2.049
tan	2.264
tanh	3

Benchmark results on Intel Sapphire Rapids shows upto 6x speed up for float16 functions. I am not sure why it shows regression on some of the other ufuncs which are unrelated to this patch.


       before           after         ratio
     [094416f7]       [839b9b77]
     <main>           <avx512fp16>
+        208±10μs        395±0.8μs     1.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'f')
+        271±10μs         460±10μs     1.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'd')
+         263±5μs         372±80μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'd')
+         207±7μs         280±70μs     1.36  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 1, 'f')
+        296±80μs          394±2μs     1.33  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'expm1'>, 1, 1, 'f')
+        554±20μs        707±100μs     1.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'd')
+        320±60μs          397±2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cbrt'>, 1, 1, 'f')
+        685±90μs          803±1μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'd')
+       402±0.4μs         462±10μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 1, 'd')
+        1.09±0ms         1.24±0ms     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')
+        344±40μs        387±0.7μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'f')
+     1.34±0.01ms      1.48±0.02ms     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'f')
+       365±0.5μs          402±2μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp2'>, 1, 1, 'f')
+       402±0.1μs         440±20μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'd')
+         411±7μs         446±20μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'd')
+         406±2μs         440±10μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'd')
+       750±0.8μs          807±2μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp'>, 1, 1, 'd')
+        857±70μs          919±2μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'd')
+         411±6μs         439±10μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'd')
+         373±1μs          397±2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cbrt'>, 1, 1, 'f')
+         666±7μs         704±10μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 1, 'f')
-     5.29±0.01ms       5.02±0.2ms     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'e')
-     1.71±0.01ms      1.61±0.03ms     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'f')
-     5.06±0.01ms       4.74±0.1ms     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'e')
-     5.13±0.07ms      4.81±0.06ms     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'e')
-     1.06±0.02ms         983±20μs     0.93  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 1, 'e')
-      5.17±0.1ms       4.81±0.1ms     0.93  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 1, 1, 'e')
-        1.28±0ms         1.18±0ms     0.92  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')
-        1.28±0ms         1.17±0ms     0.92  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     1.28±0.03ms      1.17±0.04ms     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     5.48±0.03ms       5.01±0.3ms     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'e')
-        1.28±0ms      1.14±0.07ms     0.90  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')
-        1.28±0ms      1.14±0.07ms     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-        1.28±0ms      1.13±0.08ms     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-         894±1μs         780±30μs     0.87  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 1, 'd')
-       768±0.7μs          663±8μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 1, 1, 'd')
-         810±3μs          687±3μs     0.85  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'f')
-        1.27±0ms      1.03±0.01ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
-        1.27±0ms         1.03±0ms     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-     1.27±0.01ms         1.03±0ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
-        1.27±0ms      1.02±0.01ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-        1.27±0ms         1.02±0ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'f')
-     1.27±0.02ms         1.03±0ms     0.80  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
-         238±3μs        188±0.4μs     0.79  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'expm1'>, 1, 1, 'e')
-        369±10μs        200±0.6μs     0.54  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'e')
-       347±0.5μs        160±0.5μs     0.46  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 1, 'e')
-        191±10μs       70.7±0.4μs     0.37  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp'>, 1, 1, 'e')
-         274±2μs       96.0±0.4μs     0.35  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
-       202±0.5μs       70.1±0.2μs     0.35  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 1, 1, 'e')
-         195±2μs         66.3±4μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'e')
-       184±0.6μs       62.2±0.5μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
-       200±0.7μs       67.3±0.5μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 1, 'e')
-         290±1μs         88.2±1μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'e')
-         263±6μs       78.4±0.6μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'e')
-       435±0.9μs        115±0.2μs     0.26  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log1p'>, 1, 1, 'e')
-         338±2μs         79.1±5μs     0.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'e')
-       371±0.8μs         83.8±1μs     0.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 1, 'e')
-        339±20μs       73.5±0.7μs     0.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cbrt'>, 1, 1, 'e')
-       351±0.9μs       73.0±0.3μs     0.21  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cbrt'>, 1, 1, 'e')
-         382±5μs       78.3±0.9μs     0.20  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'e')
-        937±50μs        189±0.3μs     0.20  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'expm1'>, 1, 1, 'e')
-         596±1μs          119±4μs     0.20  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccosh'>, 1, 1, 'e')
-        389±10μs       76.6±0.4μs     0.20  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arctan'>, 1, 1, 'e')
-         334±3μs         64.5±1μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'e')
-         514±3μs       91.6±0.3μs     0.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctanh'>, 1, 1, 'e')
-         735±2μs        115±0.5μs     0.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsinh'>, 1, 1, 'e')
-     1.40±0.05ms          143±1μs     0.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 1, 'e')
-     1.28±0.03ms        121±0.3μs     0.09  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'e')
-        923±20μs         82.7±1μs     0.09  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsin'>, 1, 1, 'e')
-     1.18±0.02ms       94.7±0.1μs     0.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
-         967±5μs       69.9±0.7μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 1, 1, 'e')
-     1.68±0.02ms        114±0.6μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsinh'>, 1, 1, 'e')
-     1.24±0.02ms         84.3±1μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 1, 1, 'e')
-         958±3μs         64.5±3μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
-     1.67±0.01ms          110±7μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log1p'>, 1, 1, 'e')
-     3.59±0.08ms        233±0.7μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'e')
-        3.67±0ms          224±9μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'e')
-     3.22±0.06ms          182±2μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'e')
-     1.43±0.03ms         64.8±1μs     0.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'e')
-     1.55±0.03ms         66.8±1μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log10'>, 1, 1, 'e')
-     1.62±0.02ms       67.6±0.9μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'e')
-     1.67±0.01ms         68.2±4μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'e')
-     2.77±0.03ms       91.8±0.5μs     0.03  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arctanh'>, 1, 1, 'e')

seberg · 2023-03-09T08:43:52Z

In general this seems fine to me although @seiko2plus would be nice to have your opinion.

Is there any reason to worry that 3ULP in half precision is a much bigger relative error compared to float32 precision? Right now, we effectively get 0.5ULP reliably (since we use float32), bumping the ULPs on these so much seems like it is bound to be noticed.

I do not disagree that that whoever does notice it probably shouldn't be using float16 to begin with (doing math in float16 seems very specialized for certain applications).

r-devulap · 2023-03-14T20:37:48Z

2020b31 is a hack, I am open to better alternatives. SVML FP16 requires the latest assembler and we shouldn't add them to extra_objs before adding_extension('_multiarray_umath'). Not sure how else to do that.

r-devulap · 2023-03-21T19:15:05Z

Looks like sin and cos do not report an invalid exception for np.inf. Working on fixing it.

r-devulap · 2023-03-22T16:37:20Z

Added benchmark results above.

mattip · 2023-03-24T04:43:02Z

Requires gcc >= 12.x for build.

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

r-devulap · 2023-03-28T19:20:21Z

Requires gcc >= 12.x for build.

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

Yeah, is there a path to using gcc >= 12.x?

r-devulap · 2023-03-28T19:21:16Z

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

Also true for the other PR #23435

mattip · 2023-03-29T04:26:24Z

Yeah, is there a path to using gcc >= 12.x?

Maybe, but it would involve:

adding gcc 12 to all the relevant CI and cibuildwheel jobs, and make conda also depend on gcc12 (is that possible?)
clarify what happens with MSVC
add a note to documentation

Then we would have to add at least one CI job with an older compiler.

r-devulap · 2023-04-05T16:30:55Z

rebased.

r-devulap · 2023-05-23T20:42:02Z

numpy/SVML#3 takes care of raising invalid for sin/cos. This patch, when build with gcc-12, passes all the umath tests locally on my SKX:

sde -spr -- python -m pytest numpy/core/tests/test_umath* numpy/core/tests/test_ufunc.py numpy/linalg/tests/test_*

It does, however, fail two linalg tests which I don't think are related to this patch (they fail with the main branch too) and I think those are SDE bugs as well:

numpy/linalg/linalg.py:115: LinAlgError
========================================================================== short test summary info ===========================================================================
FAILED numpy/linalg/tests/test_linalg.py::TestCholesky::test_basic_property[float32-shape3] - numpy.linalg.LinAlgError: Matrix is not positive definite
FAILED numpy/linalg/tests/test_linalg.py::TestCholesky::test_basic_property[float64-shape3] - numpy.linalg.LinAlgError: Matrix is not positive definite

r-devulap · 2023-07-06T03:41:55Z

ping.

seiko2plus

LGTM, Thanks.

seiko2plus · 2023-07-06T19:32:47Z

Maybe, but it would involve:
adding gcc 12 to all the relevant CI and cibuildwheel jobs, and make conda also depend on gcc12 (is that possible?)

intel_spr_sde_test (pull_request) passed only the build log, it seems SDE didn't enabled SPR.

@r-devulap, Would it be better to use :

python runtests.py -n  -- -k 'test_umath|linalg|test_ufunc'

instead of calling pytest module directly so we can get the runtime features log during testing.

clarify what happens with MSVC

Rather SVML, nor our infrastructure supports AVX512-FP16 on MSVC.

r-devulap · 2023-07-06T19:36:28Z

The SDE has a bug which corrupts the x87 stack leads to lot of failures in the test suite. This is why we only have a build test. It should be fixed in the next release of SDE and we can enable the tests back again.

r-devulap · 2023-07-31T22:43:42Z

ping :)

seiko2plus · 2024-01-17T10:19:34Z

@r-devulap needs a rebase

r-devulap · 2024-01-18T17:32:42Z

rebased with main, the CI run on SPR should test this patch.

r-devulap · 2024-01-18T17:44:16Z

hah, we moved to meson since and that needs updating.

seiko2plus · 2024-01-22T20:29:42Z

Thanks @r-devulap

This is a manual revert of numpygh-23351 since things were moved around quite a lot since then.

github-actions Bot added the 01 - Enhancement label Mar 6, 2023

r-devulap changed the title ~~ENH: Use AVX512-FP16 SVML content for FP16 umath functions~~ ENH: Use AVX512-FP16 SVML content for float16 umath functions Mar 6, 2023

r-devulap added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 6, 2023

r-devulap mentioned this pull request Mar 8, 2023

Add SVML AVX512 FP16 content numpy/SVML#2

Merged

r-devulap force-pushed the avx512fp16 branch from a4a2074 to 53cb336 Compare March 14, 2023 20:31

r-devulap mentioned this pull request Mar 30, 2023

CI: Add CI to test using gcc-12 on Intel Sapphire Rapids #23502

Merged

r-devulap force-pushed the avx512fp16 branch from 839b9b7 to b10bcb4 Compare April 5, 2023 16:29

r-devulap mentioned this pull request Apr 28, 2023

CI: Enable CI on gcc 12.x on Intel SPR #23655

Merged

r-devulap force-pushed the avx512fp16 branch from b10bcb4 to e114179 Compare May 23, 2023 20:36

r-devulap force-pushed the avx512fp16 branch from e114179 to bf523b9 Compare July 6, 2023 03:41

seiko2plus reviewed Jul 6, 2023

View reviewed changes

Comment thread numpy/core/tests/test_umath.py

seiko2plus approved these changes Jul 6, 2023

View reviewed changes

r-devulap force-pushed the avx512fp16 branch from bf523b9 to efc6af3 Compare January 18, 2024 17:31

Raghuveer Devulapalli added 2 commits January 18, 2024 09:32

ENH: Vectorize FP16 math functions on Intel Sapphire Rapids

55fee61

Fix linter errors

c8c0a18

Raghuveer Devulapalli added 2 commits January 18, 2024 09:32

TST: Adjust for ULP errors in FP16 math functions

8d2af7f

MAINT: Correct svml function signature

efc6af3

Add half precision svml files to meson

f95896d

seiko2plus reviewed Jan 19, 2024

View reviewed changes

Comment thread numpy/_core/src/common/npy_svml.h Outdated

BUG: Fix function declaration for half precision svml funcs

0a0ac43

r-devulap force-pushed the avx512fp16 branch from 99ee0c1 to 34ec3fb Compare January 19, 2024 23:11

BLD: Conditionally add SVML FP16 objects

34ec3fb

r-devulap commented Jan 19, 2024

View reviewed changes

Comment thread numpy/_core/meson.build Outdated

BLD: combine variable names into one

bc27299

seiko2plus merged commit 1d0edc1 into numpy:main Jan 22, 2024

seiko2plus added this to the 2.0.0 release milestone Jan 22, 2024

r-devulap mentioned this pull request Feb 20, 2026

BUG: Highly degraded precision for tanh for arrays of float16 on Sapphire Rapids - Regression in 2.3.0 #30821

Closed

seberg added a commit to seberg/numpy that referenced this pull request Apr 7, 2026

REV: Manual revert of float16 svml use

d53ac26

This is a manual revert of numpygh-23351 since things were moved around quite a lot since then.

seberg added a commit to seberg/numpy that referenced this pull request Apr 7, 2026

REV: Manual revert of float16 svml use

03b0a6e

This is a manual revert of numpygh-23351 since things were moved around quite a lot since then.

seberg mentioned this pull request Apr 7, 2026

REV: Manual revert of float16 svml use #31178

Merged

charris mentioned this pull request Apr 11, 2026

REV: Manual revert of float16 svml use (#31178) #31212

Merged

Uh oh!

Uh oh!

Conversation

r-devulap commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Mar 9, 2023

Uh oh!

r-devulap commented Mar 14, 2023

Uh oh!

r-devulap commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r-devulap commented Mar 22, 2023

Uh oh!

mattip commented Mar 24, 2023

Uh oh!

r-devulap commented Mar 28, 2023

Uh oh!

r-devulap commented Mar 28, 2023

Uh oh!

mattip commented Mar 29, 2023

Uh oh!

r-devulap commented Apr 5, 2023

Uh oh!

r-devulap commented May 23, 2023

Uh oh!

r-devulap commented Jul 6, 2023

Uh oh!

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

seiko2plus commented Jul 6, 2023

Uh oh!

r-devulap commented Jul 6, 2023

Uh oh!

r-devulap commented Jul 31, 2023

Uh oh!

seiko2plus commented Jan 17, 2024

Uh oh!

r-devulap commented Jan 18, 2024

Uh oh!

r-devulap commented Jan 18, 2024

Uh oh!

Uh oh!

Uh oh!

seiko2plus commented Jan 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

r-devulap commented Mar 6, 2023 •

edited

Loading

r-devulap commented Mar 21, 2023 •

edited

Loading