Optimize `take_fixed_size_binary` For Predefined Value Lengths by tobixdev · Pull Request #9535 · apache/arrow-rs

tobixdev · 2026-03-11T12:03:53Z

Which issue does this PR close?

Related to Performance improvements for take #279

Rationale for this change

The take kernel is very important for many operations (e.g., HashJoin in DataFusion IIRC). Currently, there is a gap between the performance of the take kernel for primitive arrays (e.g., DataType::UInt32) and fixed size binary arrays of the same length (e.g., FixedSizeBinary<4>).

In our case this lead to a performance reduction when moving from an integer-based id column to a fixed-size-binary-based id column. This PR aims to address parts of this gap.

The 16-bytes case would especially benefit operations on UUID columns.

What changes are included in this PR?

Add take_fixed_size that can be called for set of predefined fsb-lengths that we want to support. This is a "flat buffer" version of the take_native kernel.

Are these changes tested?

I've added another test that still exercises the non-optimized code path.

Are there any user-facing changes?

No

tobixdev · 2026-03-11T12:08:46Z

run benchmark take_kernels

alamb-ghbot · 2026-03-11T12:08:56Z

🤖 Hi @tobixdev, thanks for the request (#9535 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

tobixdev · 2026-03-11T12:10:31Z

I assumed that it would not work but it was worth a shot. Would appreciate someone running the benchmark. I saw approximately -80% on my machine.

alamb · 2026-03-11T18:44:22Z

run benchmark take_kernels

alamb-ghbot · 2026-03-11T18:44:31Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

alamb-ghbot · 2026-03-11T18:55:07Z

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1328.7±5.30ns        ? ?/sec    1.00   1328.3±2.79ns        ? ?/sec
take bool 512                                                             1.00    728.2±8.58ns        ? ?/sec    1.00    727.1±2.58ns        ? ?/sec
take bool null indices 1024                                               1.00  1113.3±13.46ns        ? ?/sec    1.11  1232.8±24.14ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.1±0.07µs        ? ?/sec    1.42      2.9±0.04µs        ? ?/sec
take check bounds i32 1024                                                1.10   930.9±68.94ns        ? ?/sec    1.00    844.0±4.14ns        ? ?/sec
take check bounds i32 512                                                 1.00    515.4±4.77ns        ? ?/sec    1.14    588.5±3.92ns        ? ?/sec
take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec
take i32 512                                                              1.00    441.8±1.44ns        ? ?/sec    1.00    442.8±1.33ns        ? ?/sec
take i32 null indices 1024                                                1.00    995.6±2.39ns        ? ?/sec    1.02  1015.2±126.82ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.02µs        ? ?/sec    1.00      2.0±0.03µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.9±0.21µs        ? ?/sec    1.00     20.8±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.2±0.06µs        ? ?/sec    1.01     11.3±0.16µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.02      5.5±0.06µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.04µs        ? ?/sec    1.00      7.9±0.05µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.01µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
take str null values 1024                                                 1.00      8.7±0.03µs        ? ?/sec    1.00      8.6±0.07µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
take stringview 1024                                                      1.00    809.3±8.35ns        ? ?/sec    1.10    893.3±7.40ns        ? ?/sec
take stringview 512                                                       1.00    579.9±4.53ns        ? ?/sec    1.01    587.5±8.20ns        ? ?/sec
take stringview null indices 1024                                         1.00  1440.7±21.77ns        ? ?/sec    1.01   1451.7±4.98ns        ? ?/sec
take stringview null indices 512                                          1.00   718.1±19.57ns        ? ?/sec    1.11    800.5±0.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.01µs        ? ?/sec    1.06      2.4±0.04µs        ? ?/sec

tobixdev · 2026-03-11T19:13:38Z

take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec

It's not quite 80% but still significant. Maybe this difference is due to Aarch64 vs. x86 (my PC).

take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec

We're also close to the primitive kernel! Parts of the gap could be explained by the entry size (4 vs. 12 bytes).

alamb · 2026-03-11T20:29:39Z

run benchmark take_kernels

alamb · 2026-03-11T20:29:50Z

Will run once more to ensure we can reproduce the results. They look good

alamb-ghbot · 2026-03-11T20:29:51Z

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

alamb · 2026-03-11T20:30:48Z

arrow-select/src/take.rs

-        }
-    }
+    let result_buffer = match size_usize {
+        1 => take_fixed_size::<IndexType, 1>(values.values(), indices),


as I read this it results in 7 copies of the code which is probably ok here but we do have to be careful in general to avoid too much bloat

Yes that's definetely a drawback (for compile time and binary size).

alamb-ghbot · 2026-03-11T20:40:09Z

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1327.8±4.95ns        ? ?/sec    1.00  1329.2±12.95ns        ? ?/sec
take bool 512                                                             1.00    726.9±3.22ns        ? ?/sec    1.00   727.1±14.49ns        ? ?/sec
take bool null indices 1024                                               1.00   1097.8±5.99ns        ? ?/sec    1.11  1215.4±31.46ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00  1977.1±21.64ns        ? ?/sec    1.48      2.9±0.05µs        ? ?/sec
take check bounds i32 1024                                                1.08    917.4±7.58ns        ? ?/sec    1.00   846.9±12.66ns        ? ?/sec
take check bounds i32 512                                                 1.00   516.9±21.67ns        ? ?/sec    1.14    587.1±1.14ns        ? ?/sec
take i32 1024                                                             1.00    712.5±1.52ns        ? ?/sec    1.01    717.9±2.41ns        ? ?/sec
take i32 512                                                              1.00    441.9±0.76ns        ? ?/sec    1.01    444.9±9.14ns        ? ?/sec
take i32 null indices 1024                                                1.00    993.5±2.18ns        ? ?/sec    1.00    993.2±2.50ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     21.0±0.24µs        ? ?/sec    1.00     21.0±0.10µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.09µs        ? ?/sec    1.00     11.2±0.11µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.02      5.5±0.09µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.20µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
take str null values 1024                                                 1.00      8.6±0.13µs        ? ?/sec    1.00      8.6±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.9±0.06µs        ? ?/sec    1.02      7.0±0.09µs        ? ?/sec
take stringview 1024                                                      1.00    811.9±1.46ns        ? ?/sec    1.09   887.2±12.82ns        ? ?/sec
take stringview 512                                                       1.00    582.0±7.36ns        ? ?/sec    1.01    588.3±2.38ns        ? ?/sec
take stringview null indices 1024                                         1.00  1436.1±22.29ns        ? ?/sec    1.01  1449.0±24.56ns        ? ?/sec
take stringview null indices 512                                          1.00    727.5±2.41ns        ? ?/sec    1.10    801.0±4.78ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.05µs        ? ?/sec    1.00      2.1±0.02µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.02µs        ? ?/sec    1.05      2.4±0.03µs        ? ?/sec

alamb · 2026-03-11T21:59:38Z

take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec

🚀

tobixdev added 2 commits March 11, 2026 12:49

Optimize take_fixed_size_binary for a set of predefined value lengths

601c63a

Cleanup

7aee3d9

github-actions bot added the arrow Changes to the arrow crate label Mar 11, 2026

alamb reviewed Mar 11, 2026

View reviewed changes

Conversation

tobixdev commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

tobixdev commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

tobixdev commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

tobixdev commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

alamb Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

tobixdev Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Mar 11, 2026

Uh oh!

alamb commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tobixdev commented Mar 11, 2026 •

edited

Loading

tobixdev commented Mar 11, 2026 •

edited

Loading