Skip to content

Optimize int4 vector computations by avoiding conversions#15742

Merged
kaivalnp merged 4 commits intoapache:mainfrom
kaivalnp:optimize-vector-comp
Feb 26, 2026
Merged

Optimize int4 vector computations by avoiding conversions#15742
kaivalnp merged 4 commits intoapache:mainfrom
kaivalnp:optimize-vector-comp

Conversation

@kaivalnp
Copy link
Contributor

Spinoff from #15736, where @mccullocht identified vector conversions as a potentially slow area (thanks!).
This PR loads and operates on SIMD registers of the preferred bit size to avoid intermediate conversions.

This was sparked from #15697 where we observed a performance drop in JMH benchmarks of some 4-bit vector computations after initial warmup. It's possible that this issue only affects ARM machines.

Ran JMH benchmarks on an AWS Graviton3 host using:

java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte.*Vector" -p size=1024

Baseline:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  11.846 ± 0.034  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.618 ± 0.009  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.733 ± 0.063  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  12.599 ± 0.022  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.603 ± 0.008  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.492 ± 0.033  ops/us

This PR:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  17.356 ± 0.052  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  19.157 ± 0.055  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.575 ± 0.049  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  16.030 ± 0.077  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.247 ± 0.120  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.952 ± 0.113  ops/us

Comment on lines 1019 to 1023
// upper
ByteVector va8 = unpacked.load(Int4Constants.BYTE_SPECIES, i + j + packed.length());
ByteVector diff8 = vb8.and((byte) 0x0F).sub(va8);
Vector<Short> diff16 = diff8.convertShape(B2S, Int4Constants.SHORT_SPECIES, 0);
acc0 = acc0.add(diff16.mul(diff16));

// lower
ByteVector vc8 = unpacked.load(Int4Constants.BYTE_SPECIES, i + j);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I found interesting: we see a performance drop after a few warmup iterations if we operate on upper (multiply with self and add to accumulator) before loading lower:

# Warmup Iteration   1: 12.736 ops/us
# Warmup Iteration   2: 16.919 ops/us
# Warmup Iteration   3: 4.171 ops/us
# Warmup Iteration   4: 4.174 ops/us
Iteration   1: 4.179 ops/us
Iteration   2: 4.188 ops/us
Iteration   3: 4.193 ops/us
Iteration   4: 4.201 ops/us
Iteration   5: 4.198 ops/us

@mccullocht
Copy link
Contributor

It looks like there are some losses on x86 -- binaryHalfByteDotProductVector sees a small loss, binaryHalfByteSquareVector sees a fairly large loss. I'll investigate a bit.

AMD Ryzen AI 395 (AVX 512)
Baseline:

VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  23.318 ± 0.107  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  11.839 ± 0.075  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  66.883 ± 0.965  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  29.886 ± 0.167  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  12.464 ± 0.374  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  71.097 ± 0.476  ops/us

Experiment:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  35.444 ± 0.184  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  42.581 ± 0.464  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  62.781 ± 0.686  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  33.665 ± 0.254  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  41.314 ± 0.367  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  58.312 ± 0.703  ops/us

Mac M2 (128 bit/NEON):
Baseline:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  15.866 ± 0.206  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   2.746 ± 0.029  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  13.612 ± 0.127  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  15.815 ± 0.068  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   2.758 ± 0.031  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  13.440 ± 0.088  ops/us

Experiment:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  23.285 ± 0.371  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  25.559 ± 0.601  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  17.269 ± 1.498  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  21.115 ± 0.188  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  24.063 ± 0.477  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  17.184 ± 0.077  ops/us

@mccullocht
Copy link
Contributor

Re: x86 regression:

  • For binaryHalfByteDotProductVector I was able to mitigate the regression by using two ShortVector accumulators and an outer IntVector accumulator. Profiles showed 30-40% of time in reduceLanes.
  • For binaryHalfByteSquareVector I repeated the same steps as for dot product and got a 10% bump but still about 10% below the baseline. I'm getting a large drop off from warmup iterations to test iterations in jmh but otherwise the profiles are very similar to dot product. I don't have a good explanation for this. The old implementation widens before multiplying (not really necessary) but it also widening 256 -> 512 bits. 🤷

@kaivalnp
Copy link
Contributor Author

Latest JMH benchmarks on AWS Graviton3:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  18.930 ± 0.037  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  21.819 ± 0.085  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  26.335 ± 0.196  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  16.055 ± 0.030  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.465 ± 0.063  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.935 ± 0.146  ops/us

..which is a slight improvement from earlier

knnPerfTest.py results on Cohere v3 vectors, 1024d, dot_product similarity:

Baseline with -reindex

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.918        2.806   2.803        0.999  500000   100     100       64        250     4 bits     8223     93.42       5351.94          141.31             1         2276.79            null                N/A       1.000      2204.895      251.770       false       HNSW

Candidate with -reindex

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.918        2.655   2.654        1.000  500000   100     100       64        250     4 bits     8220     96.46       5183.39          131.73             1         2276.76            null                N/A       1.000      2204.895      251.770       false       HNSW

Baseline search-only, use existing index

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.918        2.666   2.665        0.999  500000   100     100       64        250     4 bits     8220     96.46       5183.39          131.73             1         2276.76            null                N/A       1.000      2204.895      251.770       false       HNSW

Candidate search-only, use existing index

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.918        2.540   2.539        0.999  500000   100     100       64        250     4 bits     8220     96.46       5183.39          131.73             1         2276.76            null                N/A       1.000      2204.895      251.770       false       HNSW

Seems like there's a small improvement with this PR

@kaivalnp
Copy link
Contributor Author

Thanks for the benchmarks + fixes for x86 @mccullocht -- please feel free to push to this branch directly, or I'll try to recreate from your description in a day or so..

@mccullocht
Copy link
Contributor

@kaivalnp the changes in 7ce3c00 look like what I did.

@kaivalnp
Copy link
Contributor Author

I ran this JMH benchmark on branch_10x with Java 24:

java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte.*Vector" -p size=1024
openjdk 24.0.2 2025-07-15
OpenJDK Runtime Environment (build 24.0.2+12-54)
OpenJDK 64-Bit Server VM (build 24.0.2+12-54, mixed mode, sharing)

Baseline:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15   0.471 ± 0.004  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  16.144 ± 0.039  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.829 ± 0.101  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  14.145 ± 0.030  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.309 ± 0.031  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.687 ± 0.113  ops/us

There is a drop in performance of binaryHalfByteDotProductBothPackedVector after some warmup iterations:

# Warmup Iteration   1: 10.793 ops/us
# Warmup Iteration   2: 13.988 ops/us
# Warmup Iteration   3: 0.468 ops/us
# Warmup Iteration   4: 0.464 ops/us
Iteration   1: 0.463 ops/us
Iteration   2: 0.466 ops/us
Iteration   3: 0.470 ops/us
Iteration   4: 0.470 ops/us
Iteration   5: 0.469 ops/us

Candidate (cherry-pick this PR):

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  19.708 ± 0.095  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  21.649 ± 0.125  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  26.482 ± 0.145  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  15.783 ± 0.043  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.710 ± 0.056  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  19.008 ± 0.151  ops/us

The cherry-pick was successful without conflicts, and performance seems to improve in general -- should we target this for 10.5?

@kaivalnp kaivalnp marked this pull request as ready for review February 24, 2026 20:03
@github-actions github-actions bot added this to the 10.5.0 milestone Feb 24, 2026
@kaivalnp
Copy link
Contributor Author

Thanks @mccullocht could you help with a sanity-check JMH benchmark and knnPerfTest run on your x86 machine to verify that we're not regressing anything there?

@mccullocht
Copy link
Contributor

x86 benchmarks:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  23.675 ± 0.031  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  11.855 ± 0.059  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  68.691 ± 0.816  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  30.269 ± 0.148  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  12.770 ± 0.139  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  72.246 ± 0.406  ops/us

Experiment:
Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  35.692 ± 0.152  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  42.843 ± 0.523  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  65.359 ± 0.618  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  34.310 ± 0.105  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  42.061 ± 0.312  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  57.315 ± 0.760  ops/us

VectorUtilBenchmark.binaryHalfByteSquareVector is quite a bit slower, but I also don't know under what conditions we'd actually run this. I suspect we don't do a ton of distance comparisons using two unpacked int4 vectors.

@mccullocht
Copy link
Contributor

1M cohere 1024d vectors, dot_product.

Baseline Results:
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.914        2.317   2.309        0.997  1000000   100     100       64        250     4 bits     8707    119.32       8380.54          213.79             1         4559.57            null                N/A       1.000      4409.790      503.540       false       HNSW

Experiment Results:
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.914        2.194   2.187        0.996  1000000   100     100       64        250     4 bits     8707    119.32       8380.54          213.79             1         4559.57            null                N/A       1.000      4409.790      503.540       false       HNSW

Looks like about 5%, roughly the same as on Graviton 3.

Targeting 10.5 SGTM.

@kaivalnp
Copy link
Contributor Author

VectorUtilBenchmark.binaryHalfByteSquareVector is quite a bit slower

This worried me, so I attempted to change it back to operating on shorts (but still avoiding convert using reinterpret casting + bit manipulation).

JMH benchmarks on AWS Graviton3 somehow improved further:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  19.555 ± 0.032  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  20.531 ± 0.079  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  26.234 ± 0.118  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  16.225 ± 0.040  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  17.840 ± 0.040  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  20.381 ± 0.083  ops/us

@kaivalnp
Copy link
Contributor Author

I got my hands on a temporary x86 EC2 machine. lscpu says:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
Stepping:            4
CPU MHz:             3089.456
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            33792K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke

Baseline:

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15   6.856 ± 0.036  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   8.771 ± 0.047  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  14.856 ± 0.300  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15   6.741 ± 0.069  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   8.134 ± 0.069  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  13.244 ± 0.240  ops/us

This PR (latest changes):

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15   6.781 ± 0.086  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15   8.806 ± 0.093  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  14.904 ± 0.242  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15   6.685 ± 0.019  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15   8.755 ± 0.092  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  16.753 ± 0.193  ops/us

@mccullocht sorry for another iteration, but could you verify performance on your x86 machine with the latest changes?

@mccullocht
Copy link
Contributor

As of ac60bae

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15  35.753 ± 0.190  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  42.972 ± 0.165  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  65.319 ± 0.773  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  34.473 ± 0.121  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  42.166 ± 0.359  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  72.583 ± 0.999  ops/us

lscpu output for completeness

Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             48 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      32
  On-line CPU(s) list:       0-31
Vendor ID:                   AuthenticAMD
  Model name:                AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
    CPU family:              26
    Model:                   112
    Thread(s) per core:      2
    Core(s) per socket:      16
    Socket(s):               1
    Stepping:                0
    Frequency boost:         enabled
    CPU(s) scaling MHz:      41%
    CPU max MHz:             5187.5000
    CPU min MHz:             625.0000
    BogoMIPS:                5988.47
    Flags:                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxs
                             r_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni 
                             pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_l
                             egacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx
                              cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi
                             1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512
                             bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf
                             16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeas
                             sists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfn
                             i vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor
                              smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze

@kaivalnp
Copy link
Contributor Author

Thanks @mccullocht! Looks like performance is equivalent or better in all functions (except small dip in binaryHalfByteDotProductVector from ~69 ops/us -> ~65 ops/us).

I'll merge + backport this to 10.x in a day or so!

@kaivalnp kaivalnp merged commit 98e1c07 into apache:main Feb 26, 2026
13 checks passed
@kaivalnp kaivalnp deleted the optimize-vector-comp branch February 26, 2026 16:39
kaivalnp added a commit that referenced this pull request Feb 26, 2026
…voiding conversions (#15742)

Replace intermediate vector conversions with reinterpret casting + bit manipulation.

(cherry picked from commit 98e1c07)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants