Optimize int4 vector computations by avoiding conversions#15742
Optimize int4 vector computations by avoiding conversions#15742kaivalnp merged 4 commits intoapache:mainfrom
Conversation
| // upper | ||
| ByteVector va8 = unpacked.load(Int4Constants.BYTE_SPECIES, i + j + packed.length()); | ||
| ByteVector diff8 = vb8.and((byte) 0x0F).sub(va8); | ||
| Vector<Short> diff16 = diff8.convertShape(B2S, Int4Constants.SHORT_SPECIES, 0); | ||
| acc0 = acc0.add(diff16.mul(diff16)); | ||
|
|
||
| // lower | ||
| ByteVector vc8 = unpacked.load(Int4Constants.BYTE_SPECIES, i + j); |
There was a problem hiding this comment.
Something I found interesting: we see a performance drop after a few warmup iterations if we operate on upper (multiply with self and add to accumulator) before loading lower:
# Warmup Iteration 1: 12.736 ops/us
# Warmup Iteration 2: 16.919 ops/us
# Warmup Iteration 3: 4.171 ops/us
# Warmup Iteration 4: 4.174 ops/us
Iteration 1: 4.179 ops/us
Iteration 2: 4.188 ops/us
Iteration 3: 4.193 ops/us
Iteration 4: 4.201 ops/us
Iteration 5: 4.198 ops/us
|
It looks like there are some losses on x86 -- binaryHalfByteDotProductVector sees a small loss, binaryHalfByteSquareVector sees a fairly large loss. I'll investigate a bit. AMD Ryzen AI 395 (AVX 512) Experiment: Mac M2 (128 bit/NEON): Experiment: |
|
Re: x86 regression:
|
|
Latest JMH benchmarks on AWS Graviton3: ..which is a slight improvement from earlier
Baseline with Candidate with Baseline search-only, use existing index Candidate search-only, use existing index Seems like there's a small improvement with this PR |
|
Thanks for the benchmarks + fixes for x86 @mccullocht -- please feel free to push to this branch directly, or I'll try to recreate from your description in a day or so.. |
|
I ran this JMH benchmark on java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte.*Vector" -p size=1024Baseline: There is a drop in performance of Candidate (cherry-pick this PR): The cherry-pick was successful without conflicts, and performance seems to improve in general -- should we target this for 10.5? |
|
Thanks @mccullocht could you help with a sanity-check JMH benchmark and |
|
x86 benchmarks: VectorUtilBenchmark.binaryHalfByteSquareVector is quite a bit slower, but I also don't know under what conditions we'd actually run this. I suspect we don't do a ton of distance comparisons using two unpacked int4 vectors. |
|
1M cohere 1024d vectors, dot_product. Looks like about 5%, roughly the same as on Graviton 3. Targeting 10.5 SGTM. |
This worried me, so I attempted to change it back to operating on shorts (but still avoiding JMH benchmarks on AWS Graviton3 somehow improved further: |
|
I got my hands on a temporary Baseline: This PR (latest changes): @mccullocht sorry for another iteration, but could you verify performance on your |
|
As of ac60bae lscpu output for completeness |
|
Thanks @mccullocht! Looks like performance is equivalent or better in all functions (except small dip in I'll merge + backport this to 10.x in a day or so! |
Spinoff from #15736, where @mccullocht identified vector conversions as a potentially slow area (thanks!).
This PR loads and operates on SIMD registers of the preferred bit size to avoid intermediate conversions.
This was sparked from #15697 where we observed a performance drop in JMH benchmarks of some 4-bit vector computations after initial warmup. It's possible that this issue only affects ARM machines.
Ran JMH benchmarks on an AWS Graviton3 host using:
java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte.*Vector" -p size=1024Baseline:
This PR: