Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #777 +/- ##
==========================================
- Coverage 89.00% 88.98% -0.02%
==========================================
Files 428 428
Lines 78417 78563 +146
==========================================
+ Hits 69795 69913 +118
- Misses 8622 8650 +28
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
I can see that coverage on non-x86-64 architectures is going to be fun to deal with ... |
There was a problem hiding this comment.
Pull request overview
This PR adds an AArch64 Neon SIMD backend to diskann-wide and wires it into higher-level crates (diskann-vector, diskann-quantization, diskann-benchmark-simd) so Arm64 builds can use the same wide-SIMD abstractions and dispatch patterns as existing x86_64 backends.
Changes:
- Add
diskann-wide::arch::aarch64withNeonarchitecture token, masks, load/store (incl. optimized partial loads), and Neon implementations for core SIMD register types + doubled types. - Integrate Neon into distance dispatch/specialization (
diskann-vector), conversions (diskann-vector), and quantization distance dispatch/retargeting (diskann-quantization). - Register Neon kernels and improve architecture dispatch diagnostics in
diskann-benchmark-simd, plus enable Arm64 CI and default+neon,+dotprodrustflags for AArch64.
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-wide/tests/dispatch.rs | Extends dispatch test coverage to AArch64 Neon and refactors inner product loop into a shared helper. |
| diskann-wide/src/test_utils/ops.rs | Removes x86-only gating so SplitJoin test helpers/macros can be reused by Neon tests. |
| diskann-wide/src/test_utils/dot_product.rs | Adds additional expected-dot implementations + expands test coverage for new dot-product combinations. |
| diskann-wide/src/lib.rs | Broadens test-arch env var support to include AArch64 and adjusts internal module gating. |
| diskann-wide/src/helpers.rs | Extends conversion macro support and tightens cfg gating for x86-only shift helpers. |
| diskann-wide/src/emulated.rs | Adds missing emulated dot-product impls and corresponding tests. |
| diskann-wide/src/doubled.rs | Adds load_simd_first/store_simd_first for doubled vectors and inlines some doubled-mask ops. |
| diskann-wide/src/arch/mod.rs | Adds AArch64 aarch64 module and dispatch plumbing; adjusts x86 module cfg gating. |
| diskann-wide/src/arch/emulated/mod.rs | Adds a Level sanity check for Scalar in tests. |
| diskann-wide/src/arch/aarch64/mod.rs | Defines Neon architecture token, dispatch helpers, Current selection, and test gating (test_neon). |
| diskann-wide/src/arch/aarch64/macros.rs | Macro infrastructure for defining Neon SIMDVector types, bitops, comparisons, splat, and split/join. |
| diskann-wide/src/arch/aarch64/masks.rs | Implements Neon mask representations + move_mask/from_mask/keep_first for multiple lane widths. |
| diskann-wide/src/arch/aarch64/algorithms/mod.rs | AArch64 algorithm module root (partial loads). |
| diskann-wide/src/arch/aarch64/algorithms/load_first.rs | Optimized Neon partial-load primitives used by load_simd_first. |
| diskann-wide/src/arch/aarch64/u8x8_.rs | Neon u8x8 implementation + tests. |
| diskann-wide/src/arch/aarch64/u8x16_.rs | Neon u8x16 implementation + tests + split/join. |
| diskann-wide/src/arch/aarch64/u16x8_.rs | Neon u16x8 implementation + tests. |
| diskann-wide/src/arch/aarch64/u32x4_.rs | Neon u32x4 implementation + udot dot-product + reductions/select + tests. |
| diskann-wide/src/arch/aarch64/u64x2_.rs | Neon u64x2 implementation with emulated ops where intrinsics are missing + tests. |
| diskann-wide/src/arch/aarch64/i8x8_.rs | Neon i8x8 implementation + tests. |
| diskann-wide/src/arch/aarch64/i8x16_.rs | Neon i8x16 implementation + tests + split/join. |
| diskann-wide/src/arch/aarch64/i16x8_.rs | Neon i16x8 implementation + widening conversions + tests. |
| diskann-wide/src/arch/aarch64/i32x4_.rs | Neon i32x4 implementation + dot products (incl. sdot) + conversions + tests. |
| diskann-wide/src/arch/aarch64/i64x2_.rs | Neon i64x2 implementation with emulated ops + tests. |
| diskann-wide/src/arch/aarch64/f32x2_.rs | Neon f32x2 implementation + tests. |
| diskann-wide/src/arch/aarch64/f32x4_.rs | Neon f32x4 implementation incl. f16<->f32 via asm + reductions/select/minmax + tests. |
| diskann-wide/src/arch/aarch64/f16x4_.rs | Neon f16x4 representation + load/store + tests. |
| diskann-wide/src/arch/aarch64/f16x8_.rs | Neon f16x8 representation + load/store + split/join + tests. |
| diskann-wide/src/arch/aarch64/double.rs | Defines doubled and double-doubled Neon vector types and conversions + tests. |
| diskann-wide/compile-aarch64-on-x86.sh | Helper script to cross-compile tests for aarch64 with required target features. |
| diskann-vector/src/distance/implementations.rs | Makes fixed-dimension specialization available beyond x86_64 (for Neon too). |
| diskann-vector/src/distance/distance_provider.rs | Adds Neon specialization lists and makes specialization machinery available on AArch64. |
| diskann-vector/src/conversion.rs | Adds Neon SIMD slice conversion paths for f16<->f32 and broadens SIMD convert helpers. |
| diskann-quantization/src/spherical/iface.rs | Adds Neon dispatch mapping for spherical quantization compute paths. |
| diskann-quantization/src/bits/distances.rs | Adds Neon retargeting + expands tests to exercise Neon paths where available. |
| diskann-quantization/src/algorithms/hadamard.rs | Adds Neon implementation that retargets to scalar, plus Neon test inclusion. |
| diskann-providers/src/model/pq/distance/dynamic.rs | Adjusts PQ distance test tolerance for floating-point association differences. |
| diskann-benchmark-simd/src/lib.rs | Registers Neon kernels, refactors dispatch rules into match_arch!, improves mismatch diagnostics/scoring. |
| diskann-benchmark-simd/src/bin.rs | Selects architecture-specific integration test input (x86_64 vs aarch64). |
| .github/workflows/ci.yml | Adds Arm64 runner (ubuntu-24.04-arm) to CI matrices. |
| .cargo/config.toml | Enables -C target-feature=+neon,+dotprod by default on AArch64 targets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -70,11 +70,9 @@ macro_rules! architecture_hook { | |||
| } | |||
|
|
|||
| /// A utility for specializing distance computatiosn for fixed-length slices. | |||
There was a problem hiding this comment.
Spelling: “computatiosn” should be “computations”.
| /// A utility for specializing distance computatiosn for fixed-length slices. | |
| /// A utility for specializing distance computations for fixed-length slices. |
| use crate::{ | ||
| Architecture, SIMDVector, | ||
| arch::{ | ||
| self, AddLifetime, Dispatched1, Dispatched2, Dispatched3, FTarget1, FTarget2, FTarget3, | ||
| Hidden, Scalar, Target, Target1, Target2, Target3, | ||
| }, | ||
| }; |
There was a problem hiding this comment.
Architecture and SIMDVector are imported here but never used in this module, which will fail CI due to -Dwarnings. Please remove them or use them where intended.
| fn to_array(self) -> [$scalar; $lanes] { | ||
| // SAFETY: Provided the scalar type is an integer or floating point, | ||
| // then all bit patterns are valid between source and destination types. | ||
| // (provided an x86 intrinsic is one of the transmuted types). | ||
| // |
There was a problem hiding this comment.
The safety comment in to_array/from_array mentions “x86 intrinsic”, but this is the AArch64/Neon backend. Please update the wording to be architecture-agnostic (or refer to AArch64) to avoid misleading documentation.
Adds a (mostly) complete AArch64 Neon backend to
diskann-wideand wires it throughdiskann-vector,diskann-quantization, anddiskann-benchmark-simd.This PR has existed in a largely completed state for quite a while now - but as usual the last 10% takes a considerable amount of work. So here it is.
diskann-wide— Neon backendNeon implementations for all SIMD types matching the existing x86_64 (V3/V4) backends:
u8x8,i8x8,f32x2,u8x16,i8x16,u16x8,i16x8,u32x4,i32x4,f32x4,u64x2,i64x2,f16x4,f16x8.f32x8,f32x16,u8x32,i8x32,i32x8, etc.) via the existingDoubledmachinery.move_mask,from_mask, and optimizedkeep_firstfor all 8 mask widths.Add,Sub,Mul, FMA,Abs,MinMax.SIMDPartialEqandSIMDPartialOrd.Not,And,Or,Xor,Shr,Shl(with Miri fallbacks for variable shifts).i16×i16→i32,u8×i8→i32,i8×u8→i32usingvdotq_s32(requires+dotprod).sum_treevia pairwise addition (vpaddq).f16↔f32(lossless and cast),u8→i16,i8→i16,i32→f32, split/join for all appropriate types.Optimized
load_simd_first(algorithms/load_first.rs):Rather than falling back to scalar
Emulatedelement-by-element loads, partial loads use Neon-native primitives:vld1_u8loads combined withvqtbl1q_u8(TBL shuffle). Includes a Miri shim since Miri does not supportvqtbl1q_u8.vld1_lane/vcombine.The
aarch64_define_loadstore!macro accepts a$load_firstfunction, andf16x4/f16x8delegate to theu16x4/u16x8primitives respectively.Doubledtypes implementload_simd_first/store_simd_firstbranchlessly by passing the full count to the first half andfirst.saturating_sub(HALF)to the second.Test infrastructure:
test_neon()helper withWIDE_TEST_MIN_ARCHenv-var support, matching the x86_64test_arch_number()pattern. Supports"all"/"neon"(panics if unavailable) and"scalar"(skips).if let Some(arch) = test_neon() { ... }— graceful skip when Neon is unavailable, hard failure when explicitly requested.diskann-vector— Neon distance kernels14
SIMDSchemaimplementations covering:f32,f16,u8,i8.f32andf16.diskann-quantizationretarget()).retarget().diskann-benchmark-simdDispatchRuleimpls into amatch_arch!macro.test-aarch64.jsonand architecture-aware integration test selection.Other changes
.cargo/config.toml: Enables+neon,+dotprodforaarch64targets..github/workflows/ci.yml: Addedaarch64-unknown-linux-gnuto cross-compilation targets.diskann-providers: Relaxed a PQ distance test tolerance (6e-7→6.3e-7) for the different floating opint association used by theNeonimplementations.Design decisions
Neonbackend uses a compile-time token rather than runtime feature detection. Neon is mandatory on AArch64.Runtime dispatch can be added later if needed.
+dotprodrequired. Needed forvdotqin dot-product kernels. This excludes pre-2018 cores but shoud covers mainstream server and desktop targets (Graviton 2+, Apple M1+, Ampere Altra). ARMv8.4+ mandates it.diskann-vector. The SIMD epilogues could useload_simd_firstfor a potential win on i8/u8 cosine where the masked load cost is amortized across multiple operations, but real Arm64 benchmarking is needed first.Suggested reviewing order
diskann-wide/src/arch/aarch64/mod.rs— Architecture definition,Neontoken, dispatch,test_neon().diskann-wide/src/arch/aarch64/macros.rs— The macro infrastructure that all type files build on.diskann-wide/src/arch/aarch64/masks.rs— Mask representations and operations (move_mask,from_mask,keep_first).diskann-wide/src/arch/aarch64/algorithms/load_first.rs— Optimized partial load primitives. Read bottom-up: impl functions first, then wrappers.f32x4_.rsfor 128-bit float, ori32x4_.rsfor dot products) — the rest are structurally identical.diskann-wide/src/arch/aarch64/double.rsanddiskann-wide/src/doubled.rs— Doubled types and branchless partial load/store.diskann-vector/src/distance/simd.rs— Neon distance kernels.diskann-benchmark-simd/src/lib.rs—match_arch!refactor and Neon registration.diskann-quantization/— Neon test paths (mechanical).