Skip to content

ENH: Use AVX512-FP16 SVML content for float16 umath functions#23351

Merged
seiko2plus merged 8 commits into
numpy:mainfrom
r-devulap:avx512fp16
Jan 22, 2024
Merged

ENH: Use AVX512-FP16 SVML content for float16 umath functions#23351
seiko2plus merged 8 commits into
numpy:mainfrom
r-devulap:avx512fp16

Conversation

@r-devulap

@r-devulap r-devulap commented Mar 6, 2023

Copy link
Copy Markdown
Member

Leverage AVX512 FP16 SVML content. These are up to 4-5x faster than using FP32 SVML functions which were already (added in #21955). Max ULP errors are listed below, still working on getting exact benchmark numbers.

Requires gcc >= 12.x for build.

ufunc Max ULP err
acos 2.535
acosh 2.086
asin 3.055
asinh 1.506
atan 2.605
atanh 1.875
cbrt 1.57
cos 1.428
cosh 1.326
exp2 1.331
exp 1.273
expm1 0.5294
ln 1.8
log10 1.27
log1p 1.882
log2 1.797
sin 1.875
sinh 2.049
tan 2.264
tanh 3

Benchmark results on Intel Sapphire Rapids shows upto 6x speed up for float16 functions. I am not sure why it shows regression on some of the other ufuncs which are unrelated to this patch.


       before           after         ratio
     [094416f7]       [839b9b77]
     <main>           <avx512fp16>
+        208±10μs        395±0.8μs     1.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'f')
+        271±10μs         460±10μs     1.70  bench_ufunc_strides.UnaryFP.time_unary(<ufunc '_ones_like'>, 1, 1, 'd')
+         263±5μs         372±80μs     1.41  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc '_ones_like'>, 1, 1, 'd')
+         207±7μs         280±70μs     1.36  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'reciprocal'>, 1, 1, 'f')
+        296±80μs          394±2μs     1.33  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'expm1'>, 1, 1, 'f')
+        554±20μs        707±100μs     1.28  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'd')
+        320±60μs          397±2μs     1.24  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cbrt'>, 1, 1, 'f')
+        685±90μs          803±1μs     1.17  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'd')
+       402±0.4μs         462±10μs     1.15  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'positive'>, 1, 1, 'd')
+        1.09±0ms         1.24±0ms     1.14  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'd')
+        344±40μs        387±0.7μs     1.13  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'f')
+     1.34±0.01ms      1.48±0.02ms     1.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'f')
+       365±0.5μs          402±2μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp2'>, 1, 1, 'f')
+       402±0.1μs         440±20μs     1.10  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'd')
+         411±7μs         446±20μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'conjugate'> (1), 1, 1, 'd')
+         406±2μs         440±10μs     1.09  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'positive'>, 1, 1, 'd')
+       750±0.8μs          807±2μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp'>, 1, 1, 'd')
+        857±70μs          919±2μs     1.07  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'd')
+         411±6μs         439±10μs     1.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'conjugate'> (0), 1, 1, 'd')
+         373±1μs          397±2μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cbrt'>, 1, 1, 'f')
+         666±7μs         704±10μs     1.06  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'logical_not'>, 1, 1, 'f')
-     5.29±0.01ms       5.02±0.2ms     0.95  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'floor'>, 1, 1, 'e')
-     1.71±0.01ms      1.61±0.03ms     0.94  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'f')
-     5.06±0.01ms       4.74±0.1ms     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'ceil'>, 1, 1, 'e')
-     5.13±0.07ms      4.81±0.06ms     0.94  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'e')
-     1.06±0.02ms         983±20μs     0.93  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sign'>, 1, 1, 'e')
-      5.17±0.1ms       4.81±0.1ms     0.93  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sqrt'>, 1, 1, 'e')
-        1.28±0ms         1.18±0ms     0.92  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'd')
-        1.28±0ms         1.17±0ms     0.92  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'd')
-     1.28±0.03ms      1.17±0.04ms     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'd')
-     5.48±0.03ms       5.01±0.3ms     0.91  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'e')
-        1.28±0ms      1.14±0.07ms     0.90  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'd')
-        1.28±0ms      1.14±0.07ms     0.89  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-        1.28±0ms      1.13±0.08ms     0.88  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'd')
-         894±1μs         780±30μs     0.87  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 1, 'd')
-       768±0.7μs          663±8μs     0.86  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'logical_not'>, 1, 1, 'd')
-         810±3μs          687±3μs     0.85  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sign'>, 1, 1, 'f')
-        1.27±0ms      1.03±0.01ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'rad2deg'>, 1, 1, 'f')
-        1.27±0ms         1.03±0ms     0.81  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-     1.27±0.01ms         1.03±0ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'deg2rad'>, 1, 1, 'f')
-        1.27±0ms      1.02±0.01ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'fabs'>, 1, 1, 'f')
-        1.27±0ms         1.02±0ms     0.81  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'radians'>, 1, 1, 'f')
-     1.27±0.02ms         1.03±0ms     0.80  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'degrees'>, 1, 1, 'f')
-         238±3μs        188±0.4μs     0.79  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'expm1'>, 1, 1, 'e')
-        369±10μs        200±0.6μs     0.54  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tan'>, 1, 1, 'e')
-       347±0.5μs        160±0.5μs     0.46  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sinh'>, 1, 1, 'e')
-        191±10μs       70.7±0.4μs     0.37  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp'>, 1, 1, 'e')
-         274±2μs       96.0±0.4μs     0.35  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
-       202±0.5μs       70.1±0.2μs     0.35  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log'>, 1, 1, 'e')
-         195±2μs         66.3±4μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log2'>, 1, 1, 'e')
-       184±0.6μs       62.2±0.5μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
-       200±0.7μs       67.3±0.5μs     0.34  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log10'>, 1, 1, 'e')
-         290±1μs         88.2±1μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cos'>, 1, 1, 'e')
-         263±6μs       78.4±0.6μs     0.30  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'sin'>, 1, 1, 'e')
-       435±0.9μs        115±0.2μs     0.26  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'log1p'>, 1, 1, 'e')
-         338±2μs         79.1±5μs     0.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsin'>, 1, 1, 'e')
-       371±0.8μs         83.8±1μs     0.23  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccos'>, 1, 1, 'e')
-        339±20μs       73.5±0.7μs     0.22  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cbrt'>, 1, 1, 'e')
-       351±0.9μs       73.0±0.3μs     0.21  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'cbrt'>, 1, 1, 'e')
-         382±5μs       78.3±0.9μs     0.20  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctan'>, 1, 1, 'e')
-        937±50μs        189±0.3μs     0.20  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'expm1'>, 1, 1, 'e')
-         596±1μs          119±4μs     0.20  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arccosh'>, 1, 1, 'e')
-        389±10μs       76.6±0.4μs     0.20  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arctan'>, 1, 1, 'e')
-         334±3μs         64.5±1μs     0.19  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'tanh'>, 1, 1, 'e')
-         514±3μs       91.6±0.3μs     0.18  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arctanh'>, 1, 1, 'e')
-         735±2μs        115±0.5μs     0.16  bench_ufunc_strides.UnaryFP.time_unary(<ufunc 'arcsinh'>, 1, 1, 'e')
-     1.40±0.05ms          143±1μs     0.10  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sinh'>, 1, 1, 'e')
-     1.28±0.03ms        121±0.3μs     0.09  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccosh'>, 1, 1, 'e')
-        923±20μs         82.7±1μs     0.09  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsin'>, 1, 1, 'e')
-     1.18±0.02ms       94.7±0.1μs     0.08  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cosh'>, 1, 1, 'e')
-         967±5μs       69.9±0.7μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp'>, 1, 1, 'e')
-     1.68±0.02ms        114±0.6μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arcsinh'>, 1, 1, 'e')
-     1.24±0.02ms         84.3±1μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arccos'>, 1, 1, 'e')
-         958±3μs         64.5±3μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'exp2'>, 1, 1, 'e')
-     1.67±0.01ms          110±7μs     0.07  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log1p'>, 1, 1, 'e')
-     3.59±0.08ms        233±0.7μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'cos'>, 1, 1, 'e')
-        3.67±0ms          224±9μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'sin'>, 1, 1, 'e')
-     3.22±0.06ms          182±2μs     0.06  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tan'>, 1, 1, 'e')
-     1.43±0.03ms         64.8±1μs     0.05  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'tanh'>, 1, 1, 'e')
-     1.55±0.03ms         66.8±1μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log10'>, 1, 1, 'e')
-     1.62±0.02ms       67.6±0.9μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log'>, 1, 1, 'e')
-     1.67±0.01ms         68.2±4μs     0.04  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'log2'>, 1, 1, 'e')
-     2.77±0.03ms       91.8±0.5μs     0.03  bench_ufunc_strides.UnaryFPSpecial.time_unary(<ufunc 'arctanh'>, 1, 1, 'e')

@r-devulap r-devulap changed the title ENH: Use AVX512-FP16 SVML content for FP16 umath functions ENH: Use AVX512-FP16 SVML content for float16 umath functions Mar 6, 2023
@r-devulap r-devulap added the component: SIMD Issues in SIMD (fast instruction sets) code or machinery label Mar 6, 2023
@seberg

seberg commented Mar 9, 2023

Copy link
Copy Markdown
Member

In general this seems fine to me although @seiko2plus would be nice to have your opinion.

Is there any reason to worry that 3ULP in half precision is a much bigger relative error compared to float32 precision? Right now, we effectively get 0.5ULP reliably (since we use float32), bumping the ULPs on these so much seems like it is bound to be noticed.

I do not disagree that that whoever does notice it probably shouldn't be using float16 to begin with (doing math in float16 seems very specialized for certain applications).

@r-devulap

Copy link
Copy Markdown
Member Author

2020b31 is a hack, I am open to better alternatives. SVML FP16 requires the latest assembler and we shouldn't add them to extra_objs before adding_extension('_multiarray_umath'). Not sure how else to do that.

@r-devulap

r-devulap commented Mar 21, 2023

Copy link
Copy Markdown
Member Author

Looks like sin and cos do not report an invalid exception for np.inf. Working on fixing it.

@r-devulap

Copy link
Copy Markdown
Member Author

Added benchmark results above.

@mattip

mattip commented Mar 24, 2023

Copy link
Copy Markdown
Member

Requires gcc >= 12.x for build.

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

@r-devulap

Copy link
Copy Markdown
Member Author

Requires gcc >= 12.x for build.

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

Yeah, is there a path to using gcc >= 12.x?

@r-devulap

Copy link
Copy Markdown
Member Author

So this would be unused in our wheels, which use a manylinux_214 docker container to build. If I recall correctly, that uses gcc 9.3.

Also true for the other PR #23435

@mattip

mattip commented Mar 29, 2023

Copy link
Copy Markdown
Member

Yeah, is there a path to using gcc >= 12.x?

Maybe, but it would involve:

  • adding gcc 12 to all the relevant CI and cibuildwheel jobs, and make conda also depend on gcc12 (is that possible?)
  • clarify what happens with MSVC
  • add a note to documentation

Then we would have to add at least one CI job with an older compiler.

@r-devulap

Copy link
Copy Markdown
Member Author

rebased.

@r-devulap

Copy link
Copy Markdown
Member Author

numpy/SVML#3 takes care of raising invalid for sin/cos. This patch, when build with gcc-12, passes all the umath tests locally on my SKX:

sde -spr -- python -m pytest numpy/core/tests/test_umath* numpy/core/tests/test_ufunc.py numpy/linalg/tests/test_*

It does, however, fail two linalg tests which I don't think are related to this patch (they fail with the main branch too) and I think those are SDE bugs as well:

numpy/linalg/linalg.py:115: LinAlgError
========================================================================== short test summary info ===========================================================================
FAILED numpy/linalg/tests/test_linalg.py::TestCholesky::test_basic_property[float32-shape3] - numpy.linalg.LinAlgError: Matrix is not positive definite
FAILED numpy/linalg/tests/test_linalg.py::TestCholesky::test_basic_property[float64-shape3] - numpy.linalg.LinAlgError: Matrix is not positive definite

@r-devulap

Copy link
Copy Markdown
Member Author

ping.

Comment thread numpy/core/tests/test_umath.py

@seiko2plus seiko2plus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks.

@seiko2plus

Copy link
Copy Markdown
Member

Maybe, but it would involve:
adding gcc 12 to all the relevant CI and cibuildwheel jobs, and make conda also depend on gcc12 (is that possible?)

intel_spr_sde_test (pull_request) passed only the build log, it seems SDE didn't enabled SPR.

@r-devulap, Would it be better to use :

python runtests.py -n  -- -k 'test_umath|linalg|test_ufunc'

instead of calling pytest module directly so we can get the runtime features log during testing.

clarify what happens with MSVC

Rather SVML, nor our infrastructure supports AVX512-FP16 on MSVC.

@r-devulap

Copy link
Copy Markdown
Member Author

The SDE has a bug which corrupts the x87 stack leads to lot of failures in the test suite. This is why we only have a build test. It should be fixed in the next release of SDE and we can enable the tests back again.

@r-devulap

Copy link
Copy Markdown
Member Author

ping :)

@seiko2plus

Copy link
Copy Markdown
Member

@r-devulap needs a rebase

@r-devulap

Copy link
Copy Markdown
Member Author

rebased with main, the CI run on SPR should test this patch.

@r-devulap

Copy link
Copy Markdown
Member Author

hah, we moved to meson since and that needs updating.

Comment thread numpy/_core/src/common/npy_svml.h Outdated
Comment thread numpy/_core/meson.build Outdated
@seiko2plus seiko2plus merged commit 1d0edc1 into numpy:main Jan 22, 2024
@seiko2plus

Copy link
Copy Markdown
Member

Thanks @r-devulap

@seiko2plus seiko2plus added this to the 2.0.0 release milestone Jan 22, 2024
seberg added a commit to seberg/numpy that referenced this pull request Apr 7, 2026
This is a manual revert of numpygh-23351 since things were moved around
quite a lot since then.
seberg added a commit to seberg/numpy that referenced this pull request Apr 7, 2026
This is a manual revert of numpygh-23351 since things were moved around
quite a lot since then.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement component: SIMD Issues in SIMD (fast instruction sets) code or machinery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants