Neon MVP by hildebrandmw · Pull Request #777 · microsoft/DiskANN

hildebrandmw · 2026-02-16T05:49:26Z

Adds a (mostly) complete AArch64 Neon backend to diskann-wide and wires it through diskann-vector, diskann-quantization, and diskann-benchmark-simd.

This PR has existed in a largely completed state for quite a while now - but as usual the last 10% takes a considerable amount of work. So here it is.

`diskann-wide` — Neon backend

Neon implementations for all SIMD types matching the existing x86_64 (V3/V4) backends:

16 register types across 64-bit and 128-bit widths: u8x8, i8x8, f32x2, u8x16, i8x16, u16x8,
i16x8, u32x4, i32x4, f32x4, u64x2, i64x2, f16x4, f16x8.
Doubled types (f32x8, f32x16, u8x32, i8x32, i32x8, etc.) via the existing Doubled machinery.
Masks: move_mask, from_mask, and optimized keep_first for all 8 mask widths.
Arithmetic: Add, Sub, Mul, FMA, Abs, MinMax.
Comparisons: Full SIMDPartialEq and SIMDPartialOrd.
Bit operations: Not, And, Or, Xor, Shr, Shl (with Miri fallbacks for variable shifts).
Dot products: i16×i16→i32, u8×i8→i32, i8×u8→i32 using vdotq_s32 (requires +dotprod).
Reductions: sum_tree via pairwise addition (vpaddq).
Conversions: f16↔f32 (lossless and cast), u8→i16, i8→i16, i32→f32, split/join for all appropriate types.

Optimized load_simd_first (algorithms/load_first.rs):

Rather than falling back to scalar Emulated element-by-element loads, partial loads use Neon-native primitives:

≤8 bytes: GPR-only overlapping reads — no SIMD instructions needed.
8–16 bytes: Two overlapping vld1_u8 loads combined with vqtbl1q_u8 (TBL shuffle). Includes a Miri shim since Miri does not support vqtbl1q_u8.
32-bit / 64-bit element types: Simple if-else chains using vld1_lane / vcombine.

The aarch64_define_loadstore! macro accepts a $load_first function, and f16x4/f16x8 delegate to the u16x4/u16x8 primitives respectively.

Doubled types implement load_simd_first / store_simd_first branchlessly by passing the full count to the first half and first.saturating_sub(HALF) to the second.

Test infrastructure:

test_neon() helper with WIDE_TEST_MIN_ARCH env-var support, matching the x86_64 test_arch_number() pattern. Supports "all" / "neon" (panics if unavailable) and "scalar" (skips).
All tests use if let Some(arch) = test_neon() { ... } — graceful skip when Neon is unavailable, hard failure when explicitly requested.

`diskann-vector` — Neon distance kernels

14 SIMDSchema implementations covering:

L2, InnerProduct, Cosine for f32, f16, u8, i8.
L1Norm for f32 and f16.
All use scalar epilogues (SIMD epilogues deferred pending Arm64 benchmarking).

`diskann-quantization`

Neon Hadamard transform impl (delegates to scalar via retarget()).
Bit distances almost universally target the scalar architecture as well via retarget().
Neon test paths for bit-slice distances (1–8 bit), bit-transpose distances, and full distances.

`diskann-benchmark-simd`

Neon kernel registrations for f32, f16, u8, and i8.
Refactored per-architecture DispatchRule impls into a match_arch! macro.
Improved dispatch scoring for better mismatch diagnostics.
Added test-aarch64.json and architecture-aware integration test selection.

Other changes

.cargo/config.toml: Enables +neon,+dotprod for aarch64 targets.
.github/workflows/ci.yml: Added aarch64-unknown-linux-gnu to cross-compilation targets.
diskann-providers: Relaxed a PQ distance test tolerance (6e-7 → 6.3e-7) for the different floating opint association used by the Neon implementations.

Design decisions

Compile-time architecture gating. The Neon backend uses a compile-time token rather than runtime feature detection. Neon is mandatory on AArch64.
Runtime dispatch can be added later if needed.
+dotprod required. Needed for vdotq in dot-product kernels. This excludes pre-2018 cores but shoud covers mainstream server and desktop targets (Graviton 2+, Apple M1+, Ampere Altra). ARMv8.4+ mandates it.
Scalar epilogues in diskann-vector. The SIMD epilogues could use load_simd_first for a potential win on i8/u8 cosine where the masked load cost is amortized across multiple operations, but real Arm64 benchmarking is needed first.

Suggested reviewing order

diskann-wide/src/arch/aarch64/mod.rs — Architecture definition, Neon token, dispatch, test_neon().
diskann-wide/src/arch/aarch64/macros.rs — The macro infrastructure that all type files build on.
diskann-wide/src/arch/aarch64/masks.rs — Mask representations and operations (move_mask, from_mask, keep_first).
diskann-wide/src/arch/aarch64/algorithms/load_first.rs — Optimized partial load primitives. Read bottom-up: impl functions first, then wrappers.
One representative type file (e.g., f32x4_.rs for 128-bit float, or i32x4_.rs for dot products) — the rest are structurally identical.
diskann-wide/src/arch/aarch64/double.rs and diskann-wide/src/doubled.rs — Doubled types and branchless partial load/store.
diskann-vector/src/distance/simd.rs — Neon distance kernels.
diskann-benchmark-simd/src/lib.rs — match_arch! refactor and Neon registration.
diskann-quantization/ — Neon test paths (mechanical).

codecov-commenter · 2026-02-16T06:37:50Z

Codecov Report

❌ Patch coverage is 85.37736% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.98%. Comparing base (7cd231a) to head (890bce4).

Files with missing lines	Patch %	Lines
diskann-benchmark-simd/src/lib.rs	47.05%	18 Missing ⚠️
diskann-vector/src/distance/simd.rs	83.33%	4 Missing ⚠️
diskann-wide/src/test_utils/dot_product.rs	94.93%	4 Missing ⚠️
diskann-benchmark-simd/src/bin.rs	50.00%	3 Missing ⚠️
diskann-vector/src/conversion.rs	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #777      +/-   ##
==========================================
- Coverage   89.00%   88.98%   -0.02%     
==========================================
  Files         428      428              
  Lines       78417    78563     +146     
==========================================
+ Hits        69795    69913     +118     
- Misses       8622     8650      +28

Flag	Coverage Δ
miri	`88.98% <85.37%> (-0.02%)`	⬇️
unittests	`88.98% <85.37%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-providers/src/model/pq/distance/dynamic.rs	`86.40% <ø> (ø)`
diskann-quantization/src/algorithms/hadamard.rs	`97.94% <ø> (ø)`
diskann-quantization/src/bits/distances.rs	`91.49% <100.00%> (+0.05%)`	⬆️
diskann-quantization/src/spherical/iface.rs	`92.90% <ø> (+0.32%)`	⬆️
diskann-vector/src/distance/distance_provider.rs	`100.00% <ø> (ø)`
diskann-vector/src/distance/implementations.rs	`95.93% <ø> (-0.46%)`	⬇️
diskann-wide/src/arch/emulated/mod.rs	`100.00% <100.00%> (ø)`
diskann-wide/src/arch/mod.rs	`83.79% <ø> (ø)`
diskann-wide/src/doubled.rs	`86.72% <100.00%> (+0.02%)`	⬆️
diskann-wide/src/emulated.rs	`95.20% <100.00%> (+0.59%)`	⬆️
... and 8 more

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw · 2026-02-16T21:05:31Z

Codecov Report

❌ Patch coverage is 84.11215% with 34 lines in your changes missing coverage. Please review. ✅ Project coverage is 88.99%. Comparing base (7cd231a) to head (c18da52).

Files with missing lines Patch % Lines
diskann-benchmark-simd/src/lib.rs 47.05% 18 Missing ⚠️
diskann-vector/src/distance/simd.rs 83.33% 4 Missing ⚠️
diskann-wide/src/test_utils/dot_product.rs 94.93% 4 Missing ⚠️
diskann-benchmark-simd/src/bin.rs 50.00% 3 Missing ⚠️
diskann-vector/src/distance/implementations.rs 0.00% 3 Missing ⚠️
diskann-vector/src/conversion.rs 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #777      +/-   ##
==========================================
- Coverage   89.00%   88.99%   -0.02%     
==========================================
  Files         428      428              
  Lines       78417    78565     +148     
==========================================
+ Hits        69795    69917     +122     
- Misses       8622     8648      +26     
Flag Coverage Δ
miri 88.99% <84.11%> (-0.02%) ⬇️
unittests 88.99% <84.11%> (-0.02%) ⬇️
Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/model/pq/distance/dynamic.rs 86.40% <ø> (ø)
diskann-quantization/src/algorithms/hadamard.rs 97.94% <ø> (ø)
diskann-quantization/src/bits/distances.rs 91.49% <100.00%> (+0.05%) ⬆️
diskann-quantization/src/spherical/iface.rs 92.90% <ø> (+0.32%) ⬆️
diskann-vector/src/distance/distance_provider.rs 100.00% <ø> (ø)
diskann-wide/src/arch/mod.rs 83.79% <ø> (ø)
diskann-wide/src/doubled.rs 86.72% <100.00%> (+0.02%) ⬆️
diskann-wide/src/emulated.rs 95.20% <100.00%> (+0.59%) ⬆️
diskann-wide/src/helpers.rs 100.00% <ø> (ø)
diskann-wide/src/lib.rs 86.66% <ø> (ø)
... and 7 more
... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

I can see that coverage on non-x86-64 architectures is going to be fun to deal with ...

This reverts commit ac3d251.

This reverts commit c47c5aa.

Copilot

Pull request overview

This PR adds an AArch64 Neon SIMD backend to diskann-wide and wires it into higher-level crates (diskann-vector, diskann-quantization, diskann-benchmark-simd) so Arm64 builds can use the same wide-SIMD abstractions and dispatch patterns as existing x86_64 backends.

Changes:

Add diskann-wide::arch::aarch64 with Neon architecture token, masks, load/store (incl. optimized partial loads), and Neon implementations for core SIMD register types + doubled types.
Integrate Neon into distance dispatch/specialization (diskann-vector), conversions (diskann-vector), and quantization distance dispatch/retargeting (diskann-quantization).
Register Neon kernels and improve architecture dispatch diagnostics in diskann-benchmark-simd, plus enable Arm64 CI and default +neon,+dotprod rustflags for AArch64.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
diskann-wide/tests/dispatch.rs	Extends dispatch test coverage to AArch64 Neon and refactors inner product loop into a shared helper.
diskann-wide/src/test_utils/ops.rs	Removes x86-only gating so SplitJoin test helpers/macros can be reused by Neon tests.
diskann-wide/src/test_utils/dot_product.rs	Adds additional expected-dot implementations + expands test coverage for new dot-product combinations.
diskann-wide/src/lib.rs	Broadens test-arch env var support to include AArch64 and adjusts internal module gating.
diskann-wide/src/helpers.rs	Extends conversion macro support and tightens cfg gating for x86-only shift helpers.
diskann-wide/src/emulated.rs	Adds missing emulated dot-product impls and corresponding tests.
diskann-wide/src/doubled.rs	Adds `load_simd_first`/`store_simd_first` for doubled vectors and inlines some doubled-mask ops.
diskann-wide/src/arch/mod.rs	Adds AArch64 `aarch64` module and dispatch plumbing; adjusts x86 module cfg gating.
diskann-wide/src/arch/emulated/mod.rs	Adds a Level sanity check for Scalar in tests.
diskann-wide/src/arch/aarch64/mod.rs	Defines Neon architecture token, dispatch helpers, Current selection, and test gating (`test_neon`).
diskann-wide/src/arch/aarch64/macros.rs	Macro infrastructure for defining Neon SIMDVector types, bitops, comparisons, splat, and split/join.
diskann-wide/src/arch/aarch64/masks.rs	Implements Neon mask representations + `move_mask`/`from_mask`/`keep_first` for multiple lane widths.
diskann-wide/src/arch/aarch64/algorithms/mod.rs	AArch64 algorithm module root (partial loads).
diskann-wide/src/arch/aarch64/algorithms/load_first.rs	Optimized Neon partial-load primitives used by `load_simd_first`.
diskann-wide/src/arch/aarch64/u8x8_.rs	Neon `u8x8` implementation + tests.
diskann-wide/src/arch/aarch64/u8x16_.rs	Neon `u8x16` implementation + tests + split/join.
diskann-wide/src/arch/aarch64/u16x8_.rs	Neon `u16x8` implementation + tests.
diskann-wide/src/arch/aarch64/u32x4_.rs	Neon `u32x4` implementation + `udot` dot-product + reductions/select + tests.
diskann-wide/src/arch/aarch64/u64x2_.rs	Neon `u64x2` implementation with emulated ops where intrinsics are missing + tests.
diskann-wide/src/arch/aarch64/i8x8_.rs	Neon `i8x8` implementation + tests.
diskann-wide/src/arch/aarch64/i8x16_.rs	Neon `i8x16` implementation + tests + split/join.
diskann-wide/src/arch/aarch64/i16x8_.rs	Neon `i16x8` implementation + widening conversions + tests.
diskann-wide/src/arch/aarch64/i32x4_.rs	Neon `i32x4` implementation + dot products (incl. `sdot`) + conversions + tests.
diskann-wide/src/arch/aarch64/i64x2_.rs	Neon `i64x2` implementation with emulated ops + tests.
diskann-wide/src/arch/aarch64/f32x2_.rs	Neon `f32x2` implementation + tests.
diskann-wide/src/arch/aarch64/f32x4_.rs	Neon `f32x4` implementation incl. f16<->f32 via asm + reductions/select/minmax + tests.
diskann-wide/src/arch/aarch64/f16x4_.rs	Neon `f16x4` representation + load/store + tests.
diskann-wide/src/arch/aarch64/f16x8_.rs	Neon `f16x8` representation + load/store + split/join + tests.
diskann-wide/src/arch/aarch64/double.rs	Defines doubled and double-doubled Neon vector types and conversions + tests.
diskann-wide/compile-aarch64-on-x86.sh	Helper script to cross-compile tests for aarch64 with required target features.
diskann-vector/src/distance/implementations.rs	Makes fixed-dimension specialization available beyond x86_64 (for Neon too).
diskann-vector/src/distance/distance_provider.rs	Adds Neon specialization lists and makes specialization machinery available on AArch64.
diskann-vector/src/conversion.rs	Adds Neon SIMD slice conversion paths for f16<->f32 and broadens SIMD convert helpers.
diskann-quantization/src/spherical/iface.rs	Adds Neon dispatch mapping for spherical quantization compute paths.
diskann-quantization/src/bits/distances.rs	Adds Neon retargeting + expands tests to exercise Neon paths where available.
diskann-quantization/src/algorithms/hadamard.rs	Adds Neon implementation that retargets to scalar, plus Neon test inclusion.
diskann-providers/src/model/pq/distance/dynamic.rs	Adjusts PQ distance test tolerance for floating-point association differences.
diskann-benchmark-simd/src/lib.rs	Registers Neon kernels, refactors dispatch rules into `match_arch!`, improves mismatch diagnostics/scoring.
diskann-benchmark-simd/src/bin.rs	Selects architecture-specific integration test input (x86_64 vs aarch64).
.github/workflows/ci.yml	Adds Arm64 runner (`ubuntu-24.04-arm`) to CI matrices.
.cargo/config.toml	Enables `-C target-feature=+neon,+dotprod` by default on AArch64 targets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T17:16:27Z

diskann-vector/src/distance/implementations.rs

@@ -70,11 +70,9 @@ macro_rules! architecture_hook {
 }

 /// A utility for specializing distance computatiosn for fixed-length slices.


Spelling: “computatiosn” should be “computations”.

Suggested change

/// A utility for specializing distance computatiosn for fixed-length slices.

/// A utility for specializing distance computations for fixed-length slices.

Copilot · 2026-02-18T17:16:27Z

diskann-wide/src/arch/aarch64/mod.rs

+use crate::{
+    Architecture, SIMDVector,
+    arch::{
+        self, AddLifetime, Dispatched1, Dispatched2, Dispatched3, FTarget1, FTarget2, FTarget3,
+        Hidden, Scalar, Target, Target1, Target2, Target3,
+    },
+};


Architecture and SIMDVector are imported here but never used in this module, which will fail CI due to -Dwarnings. Please remove them or use them where intended.

Copilot · 2026-02-18T17:16:28Z

diskann-wide/src/arch/aarch64/macros.rs

+            fn to_array(self) -> [$scalar; $lanes] {
+                // SAFETY: Provided the scalar type is an integer or floating point,
+                // then all bit patterns are valid between source and destination types.
+                // (provided an x86 intrinsic is one of the transmuted types).
+                //


The safety comment in to_array/from_array mentions “x86 intrinsic”, but this is the AArch64/Neon backend. Please update the wording to be architecture-agnostic (or refer to AArch64) to avoid misleading documentation.

Mark Hildebrand added 4 commits February 15, 2026 21:47

Add a Neon backend.

c636a90

Address Clippy.

e4cd335

Enable neon and dotprod

5e8b87a

Fix benchmark-simd

ff0c5a4

Mark Hildebrand added 5 commits February 16, 2026 10:36

Checkpoint.

6a26bd6

Wrapping up!.

8f3aedc

Fix typo.

076a501

Here we gooo!

b762f79

Disable inclusion of x86_64 module when building rust-doc.

c18da52

hildebrandmw changed the title ~~Neon MVP.~~ Neon MVP Feb 16, 2026

backurs approved these changes Feb 16, 2026

View reviewed changes

Mark Hildebrand added 8 commits February 16, 2026 15:13

Fix an oops.

9ce00a7

Test something.

ac3d251

Try one thing.

53f9e94

Hmmm.

c47c5aa

Wrap up tests.

50752f2

Revert "Test something."

417ee07

This reverts commit ac3d251.

Revert "Hmmm."

4f251da

This reverts commit c47c5aa.

Fix docs.

890bce4

hildebrandmw marked this pull request as ready for review February 18, 2026 17:10

hildebrandmw requested review from a team and Copilot February 18, 2026 17:10

Copilot started reviewing on behalf of hildebrandmw February 18, 2026 17:11 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neon MVP#777

Neon MVP#777
hildebrandmw wants to merge 17 commits intomainfrom
mhildebr/neon

hildebrandmw commented Feb 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 16, 2026 •

edited

Loading

Uh oh!

hildebrandmw commented Feb 16, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

		@@ -70,11 +70,9 @@ macro_rules! architecture_hook {
		}

		/// A utility for specializing distance computatiosn for fixed-length slices.

Conversation

hildebrandmw commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

diskann-wide — Neon backend

diskann-vector — Neon distance kernels

diskann-quantization

diskann-benchmark-simd

Other changes

Design decisions

Suggested reviewing order

Uh oh!

codecov-commenter commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw commented Feb 16, 2026

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

hildebrandmw commented Feb 16, 2026 •

edited

Loading

`diskann-wide` — Neon backend

`diskann-vector` — Neon distance kernels

`diskann-quantization`

`diskann-benchmark-simd`

codecov-commenter commented Feb 16, 2026 •

edited

Loading