Skip to content

Optimize take_fixed_size_binary For Predefined Value Lengths#9535

Open
tobixdev wants to merge 2 commits intoapache:mainfrom
tobixdev:improve-take-fsb
Open

Optimize take_fixed_size_binary For Predefined Value Lengths#9535
tobixdev wants to merge 2 commits intoapache:mainfrom
tobixdev:improve-take-fsb

Conversation

@tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Mar 11, 2026

Which issue does this PR close?

Rationale for this change

The take kernel is very important for many operations (e.g., HashJoin in DataFusion IIRC). Currently, there is a gap between the performance of the take kernel for primitive arrays (e.g., DataType::UInt32) and fixed size binary arrays of the same length (e.g., FixedSizeBinary<4>).

In our case this lead to a performance reduction when moving from an integer-based id column to a fixed-size-binary-based id column. This PR aims to address parts of this gap.

The 16-bytes case would especially benefit operations on UUID columns.

What changes are included in this PR?

  • Add take_fixed_size that can be called for set of predefined fsb-lengths that we want to support. This is a "flat buffer" version of the take_native kernel.

Are these changes tested?

I've added another test that still exercises the non-optimized code path.

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 11, 2026
@tobixdev
Copy link
Contributor Author

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 Hi @tobixdev, thanks for the request (#9535 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

@tobixdev
Copy link
Contributor Author

I assumed that it would not work but it was worth a shot. Would appreciate someone running the benchmark. I saw approximately -80% on my machine.

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

run benchmark take_kernels

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1328.7±5.30ns        ? ?/sec    1.00   1328.3±2.79ns        ? ?/sec
take bool 512                                                             1.00    728.2±8.58ns        ? ?/sec    1.00    727.1±2.58ns        ? ?/sec
take bool null indices 1024                                               1.00  1113.3±13.46ns        ? ?/sec    1.11  1232.8±24.14ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00      2.1±0.07µs        ? ?/sec    1.42      2.9±0.04µs        ? ?/sec
take check bounds i32 1024                                                1.10   930.9±68.94ns        ? ?/sec    1.00    844.0±4.14ns        ? ?/sec
take check bounds i32 512                                                 1.00    515.4±4.77ns        ? ?/sec    1.14    588.5±3.92ns        ? ?/sec
take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec
take i32 512                                                              1.00    441.8±1.44ns        ? ?/sec    1.00    442.8±1.33ns        ? ?/sec
take i32 null indices 1024                                                1.00    995.6±2.39ns        ? ?/sec    1.02  1015.2±126.82ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.02µs        ? ?/sec    1.00      2.0±0.03µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.08µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     20.9±0.21µs        ? ?/sec    1.00     20.8±0.09µs        ? ?/sec
take str 1024                                                             1.00     11.2±0.06µs        ? ?/sec    1.01     11.3±0.16µs        ? ?/sec
take str 512                                                              1.00      5.4±0.03µs        ? ?/sec    1.02      5.5±0.06µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.04µs        ? ?/sec    1.00      7.9±0.05µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.01µs        ? ?/sec    1.00      3.8±0.03µs        ? ?/sec
take str null values 1024                                                 1.00      8.7±0.03µs        ? ?/sec    1.00      8.6±0.07µs        ? ?/sec
take str null values null indices 1024                                    1.00      7.0±0.03µs        ? ?/sec    1.00      7.0±0.08µs        ? ?/sec
take stringview 1024                                                      1.00    809.3±8.35ns        ? ?/sec    1.10    893.3±7.40ns        ? ?/sec
take stringview 512                                                       1.00    579.9±4.53ns        ? ?/sec    1.01    587.5±8.20ns        ? ?/sec
take stringview null indices 1024                                         1.00  1440.7±21.77ns        ? ?/sec    1.01   1451.7±4.98ns        ? ?/sec
take stringview null indices 512                                          1.00   718.1±19.57ns        ? ?/sec    1.11    800.5±0.89ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.00µs        ? ?/sec    1.00      2.1±0.00µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.01µs        ? ?/sec    1.06      2.4±0.04µs        ? ?/sec

@tobixdev
Copy link
Contributor Author

tobixdev commented Mar 11, 2026

take primitive fsb value len: 12, indices: 1024                           1.00  1057.2±10.66ns        ? ?/sec    3.33      3.5±0.17µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.03µs        ? ?/sec    2.01      4.9±0.08µs        ? ?/sec

It's not quite 80% but still significant. Maybe this difference is due to Aarch64 vs. x86 (my PC).

take i32 1024                                                             1.00    712.3±2.60ns        ? ?/sec    1.01    716.9±6.95ns        ? ?/sec

We're also close to the primitive kernel! Parts of the gap could be explained by the entry size (4 vs. 12 bytes).

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

run benchmark take_kernels

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

Will run once more to ensure we can reproduce the results. They look good

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing improve-take-fsb (7aee3d9) to 33aed33 diff
BENCH_NAME=take_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench take_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=improve-take-fsb
Results will be posted here when complete

}
}
let result_buffer = match size_usize {
1 => take_fixed_size::<IndexType, 1>(values.values(), indices),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I read this it results in 7 copies of the code which is probably ok here but we do have to be careful in general to avoid too much bloat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's definetely a drawback (for compile time and binary size).

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                                                     improve-take-fsb                       main
-----                                                                     ----------------                       ----
take bool 1024                                                            1.00   1327.8±4.95ns        ? ?/sec    1.00  1329.2±12.95ns        ? ?/sec
take bool 512                                                             1.00    726.9±3.22ns        ? ?/sec    1.00   727.1±14.49ns        ? ?/sec
take bool null indices 1024                                               1.00   1097.8±5.99ns        ? ?/sec    1.11  1215.4±31.46ns        ? ?/sec
take bool null values 1024                                                1.00      2.6±0.02µs        ? ?/sec    1.00      2.6±0.01µs        ? ?/sec
take bool null values null indices 1024                                   1.00  1977.1±21.64ns        ? ?/sec    1.48      2.9±0.05µs        ? ?/sec
take check bounds i32 1024                                                1.08    917.4±7.58ns        ? ?/sec    1.00   846.9±12.66ns        ? ?/sec
take check bounds i32 512                                                 1.00   516.9±21.67ns        ? ?/sec    1.14    587.1±1.14ns        ? ?/sec
take i32 1024                                                             1.00    712.5±1.52ns        ? ?/sec    1.01    717.9±2.41ns        ? ?/sec
take i32 512                                                              1.00    441.9±0.76ns        ? ?/sec    1.01    444.9±9.14ns        ? ?/sec
take i32 null indices 1024                                                1.00    993.5±2.18ns        ? ?/sec    1.00    993.2±2.50ns        ? ?/sec
take i32 null values 1024                                                 1.01      2.0±0.00µs        ? ?/sec    1.00      2.0±0.00µs        ? ?/sec
take i32 null values null indices 1024                                    1.00      2.1±0.02µs        ? ?/sec    1.05      2.2±0.02µs        ? ?/sec
take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec
take primitive run logical len: 1024, physical len: 512, indices: 1024    1.00     21.0±0.24µs        ? ?/sec    1.00     21.0±0.10µs        ? ?/sec
take str 1024                                                             1.01     11.3±0.09µs        ? ?/sec    1.00     11.2±0.11µs        ? ?/sec
take str 512                                                              1.00      5.4±0.02µs        ? ?/sec    1.02      5.5±0.09µs        ? ?/sec
take str null indices 1024                                                1.00      7.9±0.20µs        ? ?/sec    1.00      7.8±0.02µs        ? ?/sec
take str null indices 512                                                 1.00      3.8±0.02µs        ? ?/sec    1.00      3.8±0.01µs        ? ?/sec
take str null values 1024                                                 1.00      8.6±0.13µs        ? ?/sec    1.00      8.6±0.03µs        ? ?/sec
take str null values null indices 1024                                    1.00      6.9±0.06µs        ? ?/sec    1.02      7.0±0.09µs        ? ?/sec
take stringview 1024                                                      1.00    811.9±1.46ns        ? ?/sec    1.09   887.2±12.82ns        ? ?/sec
take stringview 512                                                       1.00    582.0±7.36ns        ? ?/sec    1.01    588.3±2.38ns        ? ?/sec
take stringview null indices 1024                                         1.00  1436.1±22.29ns        ? ?/sec    1.01  1449.0±24.56ns        ? ?/sec
take stringview null indices 512                                          1.00    727.5±2.41ns        ? ?/sec    1.10    801.0±4.78ns        ? ?/sec
take stringview null values 1024                                          1.00      2.1±0.05µs        ? ?/sec    1.00      2.1±0.02µs        ? ?/sec
take stringview null values null indices 1024                             1.00      2.3±0.02µs        ? ?/sec    1.05      2.4±0.03µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Mar 11, 2026

take primitive fsb value len: 12, indices: 1024                           1.00   1056.9±6.85ns        ? ?/sec    3.28      3.5±0.02µs        ? ?/sec
take primitive fsb value len: 12, null values, indices: 1024              1.00      2.4±0.04µs        ? ?/sec    1.99      4.9±0.01µs        ? ?/sec

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants