Skip to content

perf: Reuse page buffers across data pages in column writer#9521

Open
asuresh8 wants to merge 4 commits intoapache:mainfrom
asuresh8:reuse-page-buffers
Open

perf: Reuse page buffers across data pages in column writer#9521
asuresh8 wants to merge 4 commits intoapache:mainfrom
asuresh8:reuse-page-buffers

Conversation

@asuresh8
Copy link

@asuresh8 asuresh8 commented Mar 6, 2026

Which issue does this PR close?

N/A - Performance optimization

Rationale

GenericColumnWriter::add_data_page() allocates fresh Vec<u8> buffers on every page flush: one for page assembly (buffer) and one for compression output (compressed_buf). For workloads that produce many pages (e.g., TPC-H SF1 lineitem: 17 columns x 49 row groups x ~70 pages/column = ~58K pages), this creates significant allocator overhead and cache pollution from touching fresh memory on every page.

Additionally, the V1 compression path calls compressed_buf.shrink_to_fit() immediately before consuming the buffer, which triggers an unnecessary realloc+copy that is immediately discarded when the buffer is converted to Bytes.

What changes are included in this PR?

Add two persistent Vec<u8> fields (page_buf and compressed_buf) to GenericColumnWriter that are cleared and reused across data pages instead of allocating fresh buffers per page.

Changes:

  1. Add page_buf: Vec<u8> and compressed_buf: Vec<u8> fields to GenericColumnWriter
  2. In add_data_page(), replace let mut buffer = vec![] with self.page_buf.clear() and reuse
  3. Replace let mut compressed_buf = Vec::with_capacity(...) with self.compressed_buf.clear() and reuse
  4. Use Bytes::copy_from_slice() to create the final Bytes from the reused buffer (since Bytes::from(Vec<u8>) would consume the allocation)
  5. Remove shrink_to_fit() after compression (counterproductive with buffer reuse)
  6. Update memory_size() to include the capacity of the reusable buffers (partial improvement to an already-approximate metric)
  7. Add 3 multi-page roundtrip tests with compression (V1, V2, nullable) that validate the .clear() invariant across page boundaries

Not changed:

  • Dictionary page code (only ~17 allocations total per workload, not worth the diff)
  • LevelEncoder buffer allocation (small buffers, would require API changes)
  • Any public API

Trade-off: Bytes::copy_from_slice adds one memcpy per page vs the previous zero-copy Bytes::from(Vec<u8>). However, the eliminated allocations (2-3 per page), removed shrink_to_fit realloc, and cache warming from reused buffers provide a net benefit. The copy operates on the smaller compressed data when compression is enabled.

Are these changes tested?

Yes. All 88 existing tests pass unchanged. 3 new tests added:

  • test_multi_page_roundtrip_with_compression - V1 path, 5 pages, Snappy
  • test_multi_page_roundtrip_with_compression_v2 - V2 path, 5 pages, Snappy
  • test_multi_page_roundtrip_nullable_with_compression - V1 path with nullable columns, 5 pages, Snappy

These tests write enough data to trigger multiple data pages (via set_data_page_row_count_limit(1000) with 5000 values), then read back and verify all values match. This validates that .clear() properly prevents stale data leakage between pages.

Are there any user-facing changes?

No. All changes are to private fields and internal methods.

Benchmark Results

cargo bench -p parquet --bench arrow_writer on Apple M-series:

Category Improved (>2%) Regressed (>2%) Neutral
Count 13 0 42

Aggregate: -0.76% (excluding float_with_nans noise, confirmed by re-run)

Notable improvements:

  • primitive_non_null/zstd_parquet_2: -4.8%
  • primitive_non_null/bloom_filter: -4.0%
  • bool_non_null/zstd_parquet_2: -3.4%
  • string/zstd_parquet_2: -3.4%
  • string/parquet_2: -3.0%
  • string_dictionary/default: -2.7%

No confirmed regressions (initial float_with_nans regression was measurement noise, confirmed by subsequent run with p > 0.05).

Add persistent page_buf and compressed_buf fields to GenericColumnWriter.
Instead of allocating fresh Vec<u8> per page (~3 allocations per page),
clear and reuse the existing buffers. This eliminates allocator overhead
and improves cache locality for the compression path.

The shrink_to_fit() call after compression is also removed since the
buffer is now reused (shrinking would be counterproductive).

Changes:
- Add page_buf and compressed_buf fields to GenericColumnWriter
- Reuse buffers via .clear() + extend in both V1 and V2 data page paths
- Use Bytes::copy_from_slice() to create final page Bytes from reused buffer
- Remove shrink_to_fit() after compression (subsumes ARS-4)
- Update memory_size() to include reusable buffer capacity
- Add 3 multi-page roundtrip tests with compression (V1, V2, nullable)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 6, 2026
Validates that buffer reuse does not regress on the memory bloat
scenario from issue apache#8526 (highly compressible data producing ~1MB
uncompressed / ~20KB compressed pages). Memory stays bounded at
O(1) buffers regardless of page count.

Three new tests:
- test_memory_bounded_with_highly_compressible_data: 100 pages of
  1024-byte repeated strings with Snappy compression (V1 path)
- test_memory_bounded_with_highly_compressible_data_v2: Same for V2
- test_dictionary_fallback_memory_bounded: High cardinality data
  triggers dict fallback, then compressible data continues writing

The shrink_to_fit() removal (ARS-4) is safe because buffers are
reused (cleared each page) rather than accumulated. Each CompressedPage
gets a right-sized Bytes::copy_from_slice, so queued pages hold only
the compressed data, not the full uncompressed buffer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Dandandan
Copy link
Contributor

run benchmark arrow_writer

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reuse-page-buffers (ad766a4) to 5ba4515 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=reuse-page-buffers
Results will be posted here when complete

@apache apache deleted a comment from alamb-ghbot Mar 7, 2026
@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                     main                                   reuse-page-buffers
-----                                     ----                                   ------------------
bool/bloom_filter                         1.00    121.3±2.40µs     8.7 MB/sec    1.01    122.4±1.70µs     8.7 MB/sec
bool/default                              1.07     55.1±6.73µs    19.3 MB/sec    1.00     51.7±0.16µs    20.5 MB/sec
bool/parquet_2                            1.00     67.7±1.23µs    15.7 MB/sec    1.00     67.6±1.06µs    15.7 MB/sec
bool/zstd                                 1.04     65.9±1.16µs    16.1 MB/sec    1.00     63.4±0.98µs    16.7 MB/sec
bool/zstd_parquet_2                       1.01     79.2±2.10µs    13.4 MB/sec    1.00     78.3±1.00µs    13.5 MB/sec
bool_non_null/bloom_filter                1.00     98.4±1.10µs     5.8 MB/sec    1.01     99.4±0.78µs     5.8 MB/sec
bool_non_null/default                     1.12     21.9±0.43µs    26.1 MB/sec    1.00     19.6±0.24µs    29.2 MB/sec
bool_non_null/parquet_2                   1.13     39.4±0.19µs    14.5 MB/sec    1.00     34.8±0.17µs    16.4 MB/sec
bool_non_null/zstd                        1.09     31.6±0.34µs    18.1 MB/sec    1.00     29.1±0.16µs    19.7 MB/sec
bool_non_null/zstd_parquet_2              1.11     49.9±0.43µs    11.5 MB/sec    1.00     45.1±0.18µs    12.7 MB/sec
float_with_nans/bloom_filter              1.00   858.1±19.72µs    64.1 MB/sec    1.00    858.8±6.65µs    64.0 MB/sec
float_with_nans/default                   1.00    502.8±5.79µs   109.3 MB/sec    1.00    504.8±1.57µs   108.9 MB/sec
float_with_nans/parquet_2                 1.00   740.7±13.53µs    74.2 MB/sec    1.00    744.2±5.90µs    73.9 MB/sec
float_with_nans/zstd                      1.00    668.7±2.16µs    82.2 MB/sec    1.00    669.8±4.60µs    82.1 MB/sec
float_with_nans/zstd_parquet_2            1.00    911.2±7.90µs    60.3 MB/sec    1.00    915.2±4.40µs    60.1 MB/sec
list_primitive/bloom_filter               1.02      2.4±0.05ms   877.7 MB/sec    1.00      2.4±0.02ms   892.2 MB/sec
list_primitive/default                    1.00  1697.0±10.40µs  1256.3 MB/sec    1.01  1709.7±30.12µs  1247.0 MB/sec
list_primitive/parquet_2                  1.01  1783.6±14.89µs  1195.3 MB/sec    1.00  1762.7±18.00µs  1209.5 MB/sec
list_primitive/zstd                       1.01      2.9±0.02ms   735.6 MB/sec    1.00      2.9±0.01ms   740.7 MB/sec
list_primitive/zstd_parquet_2             1.02      3.0±0.04ms   722.1 MB/sec    1.00      2.9±0.01ms   737.6 MB/sec
list_primitive_non_null/bloom_filter      1.00      2.8±0.03ms   753.7 MB/sec    1.75      5.0±0.09ms   429.5 MB/sec
list_primitive_non_null/default           1.00  1803.8±26.57µs  1179.4 MB/sec    1.79      3.2±0.04ms   659.9 MB/sec
list_primitive_non_null/parquet_2         1.00  1995.1±20.93µs  1066.3 MB/sec    1.79      3.6±0.04ms   596.2 MB/sec
list_primitive_non_null/zstd              1.00      3.9±0.06ms   543.8 MB/sec    1.29      5.0±0.04ms   423.1 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      4.0±0.06ms   537.3 MB/sec    1.36      5.4±0.06ms   395.4 MB/sec
primitive/bloom_filter                    1.00      4.3±0.09ms    40.9 MB/sec    1.01      4.4±0.80ms    40.3 MB/sec
primitive/default                         1.00    756.7±4.08µs   232.5 MB/sec    1.01    760.9±5.80µs   231.2 MB/sec
primitive/parquet_2                       1.08   821.2±74.02µs   214.2 MB/sec    1.00    763.7±6.80µs   230.4 MB/sec
primitive/zstd                            1.00   1065.1±8.02µs   165.2 MB/sec    1.00   1065.0±8.86µs   165.2 MB/sec
primitive/zstd_parquet_2                  1.01   1022.6±2.70µs   172.0 MB/sec    1.00   1011.0±6.76µs   174.0 MB/sec
primitive_non_null/bloom_filter           1.00  1669.5±34.60µs   103.3 MB/sec    1.02  1695.6±51.53µs   101.7 MB/sec
primitive_non_null/default                1.01    610.8±2.19µs   282.4 MB/sec    1.00    605.8±7.46µs   284.8 MB/sec
primitive_non_null/parquet_2              1.01   616.4±10.27µs   279.9 MB/sec    1.00    612.4±6.95µs   281.7 MB/sec
primitive_non_null/zstd                   1.00    892.3±4.99µs   193.3 MB/sec    1.00   895.7±11.30µs   192.6 MB/sec
primitive_non_null/zstd_parquet_2         1.00   897.8±12.48µs   192.2 MB/sec    1.00    897.9±5.05µs   192.1 MB/sec
string/bloom_filter                       1.06  1351.3±21.17µs  1515.6 MB/sec    1.00  1276.0±13.53µs  1605.0 MB/sec
string/default                            1.07    839.1±6.52µs     2.4 GB/sec    1.00    781.0±4.01µs     2.6 GB/sec
string/parquet_2                          1.08    846.2±8.56µs     2.4 GB/sec    1.00    783.5±6.34µs     2.6 GB/sec
string/zstd                               1.02      2.3±0.02ms   907.1 MB/sec    1.00      2.2±0.04ms   922.3 MB/sec
string/zstd_parquet_2                     1.04      2.3±0.05ms   889.0 MB/sec    1.00      2.2±0.07ms   924.9 MB/sec
string_and_binary_view/bloom_filter       1.00    600.1±4.44µs   210.3 MB/sec    1.01    603.9±8.56µs   208.9 MB/sec
string_and_binary_view/default            1.00    363.2±1.92µs   347.4 MB/sec    1.00    362.3±5.44µs   348.3 MB/sec
string_and_binary_view/parquet_2          1.01    366.3±4.06µs   344.5 MB/sec    1.00    364.2±3.58µs   346.5 MB/sec
string_and_binary_view/zstd               1.00   599.3±18.95µs   210.6 MB/sec    1.00    599.8±6.77µs   210.4 MB/sec
string_and_binary_view/zstd_parquet_2     1.00   582.5±13.96µs   216.7 MB/sec    1.01    589.8±7.96µs   213.9 MB/sec
string_dictionary/bloom_filter            1.04    642.3±5.42µs  1606.7 MB/sec    1.00    615.0±6.73µs  1678.2 MB/sec
string_dictionary/default                 1.07    417.5±3.95µs     2.4 GB/sec    1.00    391.9±6.45µs     2.6 GB/sec
string_dictionary/parquet_2               1.08    421.0±3.88µs     2.4 GB/sec    1.00    391.0±3.16µs     2.6 GB/sec
string_dictionary/zstd                    1.03   1135.2±6.37µs   909.1 MB/sec    1.00   1103.0±7.58µs   935.6 MB/sec
string_dictionary/zstd_parquet_2          1.03  1128.4±11.55µs   914.6 MB/sec    1.00  1094.3±15.79µs   943.1 MB/sec
string_non_null/bloom_filter              1.02  1823.1±20.33µs  1122.9 MB/sec    1.00  1787.5±26.75µs  1145.3 MB/sec
string_non_null/default                   1.04  1181.5±15.27µs  1732.6 MB/sec    1.00   1133.6±9.11µs  1805.9 MB/sec
string_non_null/parquet_2                 1.05  1197.9±13.42µs  1708.9 MB/sec    1.00  1142.0±30.77µs  1792.5 MB/sec
string_non_null/zstd                      1.02      3.1±0.02ms   658.2 MB/sec    1.00      3.1±0.03ms   670.1 MB/sec
string_non_null/zstd_parquet_2            1.03      3.2±0.02ms   649.8 MB/sec    1.00      3.0±0.02ms   671.5 MB/sec

@asuresh8
Copy link
Author

asuresh8 commented Mar 7, 2026

oh no! It looks like list primitive no null cases have regressed quite a bit. I'll take a further look at what's going on there.

…ssion

The Bytes::copy_from_slice approach for buffer reuse caused a regression
on large uncompressed pages (e.g., ~627KB list pages after dictionary
fallback) due to O(page_size) memcpy.

Fix: Use Bytes::from(std::mem::take(&mut self.page_buf)) for uncompressed
pages (zero-cost ownership transfer) and keep Bytes::copy_from_slice for
compressed pages (smaller data, buffer reuse saves allocation).

For V1: uncompressed path uses mem::take, compressed path copies from
compressed_buf (already a copy, buffer reuse still helps).

For V2: split on is_compressed flag — compressed keeps copy_from_slice
for buffer reuse, uncompressed uses mem::take.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@asuresh8
Copy link
Author

asuresh8 commented Mar 7, 2026

run benchmark arrow_writer

@alamb-ghbot
Copy link

🤖 Hi @asuresh8, thanks for the request (#9521 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Jefffrey, Omega359, adriangb, alamb, comphead, etseidl, gabotechs, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

@Dandandan
Copy link
Contributor

run benchmark arrow_writer

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reuse-page-buffers (0f778f6) to 5ba4515 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=reuse-page-buffers
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                     main                                   reuse-page-buffers
-----                                     ----                                   ------------------
bool/bloom_filter                         1.02    122.2±0.73µs     8.7 MB/sec    1.00    120.0±1.08µs     8.8 MB/sec
bool/default                              1.02     53.8±0.31µs    19.7 MB/sec    1.00     52.8±0.60µs    20.1 MB/sec
bool/parquet_2                            1.00     68.6±0.28µs    15.5 MB/sec    1.00     68.3±0.78µs    15.5 MB/sec
bool/zstd                                 1.03     66.2±0.56µs    16.0 MB/sec    1.00     64.2±0.59µs    16.5 MB/sec
bool/zstd_parquet_2                       1.02     80.4±4.19µs    13.2 MB/sec    1.00     78.6±0.66µs    13.5 MB/sec
bool_non_null/bloom_filter                1.04    101.4±2.81µs     5.6 MB/sec    1.00     97.4±2.12µs     5.9 MB/sec
bool_non_null/default                     1.15     22.5±0.08µs    25.4 MB/sec    1.00     19.5±0.05µs    29.4 MB/sec
bool_non_null/parquet_2                   1.17     41.0±0.47µs    13.9 MB/sec    1.00     35.2±0.96µs    16.3 MB/sec
bool_non_null/zstd                        1.10     32.0±0.27µs    17.9 MB/sec    1.00     29.1±0.15µs    19.7 MB/sec
bool_non_null/zstd_parquet_2              1.12     50.9±0.29µs    11.3 MB/sec    1.00     45.5±1.79µs    12.6 MB/sec
float_with_nans/bloom_filter              1.00    850.8±6.78µs    64.6 MB/sec    1.01    857.9±9.28µs    64.1 MB/sec
float_with_nans/default                   1.00   496.3±10.55µs   110.7 MB/sec    1.02    504.8±9.64µs   108.9 MB/sec
float_with_nans/parquet_2                 1.00    730.9±2.40µs    75.2 MB/sec    1.02    743.2±6.38µs    74.0 MB/sec
float_with_nans/zstd                      1.00    663.2±4.63µs    82.9 MB/sec    1.01   670.7±14.14µs    81.9 MB/sec
float_with_nans/zstd_parquet_2            1.00   901.4±10.41µs    61.0 MB/sec    1.01    914.9±6.90µs    60.1 MB/sec
list_primitive/bloom_filter               1.01      2.4±0.03ms   889.8 MB/sec    1.00      2.4±0.04ms   898.7 MB/sec
list_primitive/default                    1.00   1689.6±8.84µs  1261.8 MB/sec    1.01  1701.6±34.38µs  1252.9 MB/sec
list_primitive/parquet_2                  1.02  1780.9±13.93µs  1197.1 MB/sec    1.00  1737.7±17.49µs  1226.9 MB/sec
list_primitive/zstd                       1.00      2.9±0.02ms   737.3 MB/sec    1.00      2.9±0.11ms   733.9 MB/sec
list_primitive/zstd_parquet_2             1.01      2.9±0.02ms   724.0 MB/sec    1.00      2.9±0.06ms   730.0 MB/sec
list_primitive_non_null/bloom_filter      1.00      2.8±0.04ms   765.0 MB/sec    1.02      2.8±0.08ms   749.3 MB/sec
list_primitive_non_null/default           1.00  1759.3±13.95µs  1209.2 MB/sec    1.02  1786.5±21.72µs  1190.8 MB/sec
list_primitive_non_null/parquet_2         1.02  1993.9±78.17µs  1066.9 MB/sec    1.00  1945.8±17.81µs  1093.3 MB/sec
list_primitive_non_null/zstd              1.00      3.8±0.03ms   555.0 MB/sec    1.06      4.1±0.06ms   522.8 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      3.9±0.03ms   538.9 MB/sec    1.02      4.0±0.05ms   528.3 MB/sec
primitive/bloom_filter                    2.78      4.6±0.27ms    38.4 MB/sec    1.00  1648.1±20.24µs   106.7 MB/sec
primitive/default                         1.00    757.9±5.60µs   232.1 MB/sec    1.01    766.6±7.74µs   229.5 MB/sec
primitive/parquet_2                       1.07   820.2±73.65µs   214.5 MB/sec    1.00   768.2±14.38µs   229.0 MB/sec
primitive/zstd                            1.01  1064.1±10.31µs   165.3 MB/sec    1.00   1056.0±7.66µs   166.6 MB/sec
primitive/zstd_parquet_2                  1.01   1013.9±7.21µs   173.5 MB/sec    1.00   999.5±15.15µs   176.0 MB/sec
primitive_non_null/bloom_filter           1.02  1643.2±22.97µs   105.0 MB/sec    1.00  1603.3±29.74µs   107.6 MB/sec
primitive_non_null/default                1.01    612.8±4.54µs   281.5 MB/sec    1.00    609.5±4.59µs   283.0 MB/sec
primitive_non_null/parquet_2              1.01   617.8±18.43µs   279.3 MB/sec    1.00    612.3±6.18µs   281.8 MB/sec
primitive_non_null/zstd                   1.00    890.7±7.03µs   193.7 MB/sec    1.00    887.7±9.44µs   194.3 MB/sec
primitive_non_null/zstd_parquet_2         1.02    903.8±7.50µs   190.9 MB/sec    1.00    882.2±3.39µs   195.5 MB/sec
string/bloom_filter                       1.05  1336.7±16.52µs  1532.2 MB/sec    1.00  1271.9±25.88µs  1610.3 MB/sec
string/default                            1.09    836.7±9.17µs     2.4 GB/sec    1.00   769.3±11.90µs     2.6 GB/sec
string/parquet_2                          1.08   836.7±10.03µs     2.4 GB/sec    1.00   776.1±10.74µs     2.6 GB/sec
string/zstd                               1.01      2.2±0.01ms   915.6 MB/sec    1.00      2.2±0.03ms   925.2 MB/sec
string/zstd_parquet_2                     1.04      2.3±0.03ms   892.2 MB/sec    1.00      2.2±0.02ms   928.8 MB/sec
string_and_binary_view/bloom_filter       1.00    597.6±2.63µs   211.2 MB/sec    1.01    601.7±6.01µs   209.7 MB/sec
string_and_binary_view/default            1.00    362.0±1.77µs   348.6 MB/sec    1.00    363.3±3.96µs   347.4 MB/sec
string_and_binary_view/parquet_2          1.01   368.4±10.35µs   342.6 MB/sec    1.00    364.8±1.58µs   345.9 MB/sec
string_and_binary_view/zstd               1.02   601.7±17.57µs   209.7 MB/sec    1.00    590.3±1.65µs   213.8 MB/sec
string_and_binary_view/zstd_parquet_2     1.01    581.2±3.55µs   217.1 MB/sec    1.00    576.8±3.31µs   218.8 MB/sec
string_dictionary/bloom_filter            1.05    642.5±5.12µs  1606.3 MB/sec    1.00    609.3±6.55µs  1693.8 MB/sec
string_dictionary/default                 1.09    418.4±8.67µs     2.4 GB/sec    1.00    382.9±2.46µs     2.6 GB/sec
string_dictionary/parquet_2               1.09    421.2±4.67µs     2.4 GB/sec    1.00    387.2±7.93µs     2.6 GB/sec
string_dictionary/zstd                    1.01   1130.9±7.27µs   912.5 MB/sec    1.00  1123.1±79.58µs   918.9 MB/sec
string_dictionary/zstd_parquet_2          1.02   1125.3±7.22µs   917.1 MB/sec    1.00  1099.6±23.23µs   938.5 MB/sec
string_non_null/bloom_filter              1.04  1830.2±25.99µs  1118.5 MB/sec    1.00  1765.3±22.61µs  1159.7 MB/sec
string_non_null/default                   1.03   1166.2±5.59µs  1755.4 MB/sec    1.00  1132.6±21.91µs  1807.4 MB/sec
string_non_null/parquet_2                 1.07   1202.6±9.53µs  1702.2 MB/sec    1.00  1128.0±10.60µs  1814.8 MB/sec
string_non_null/zstd                      1.02      3.1±0.03ms   653.7 MB/sec    1.00      3.1±0.01ms   668.8 MB/sec
string_non_null/zstd_parquet_2            1.02      3.1±0.02ms   650.4 MB/sec    1.00      3.1±0.04ms   665.0 MB/sec

@Dandandan
Copy link
Contributor

run benchmark arrow_writer

@alamb-ghbot
Copy link

🤖 ./gh_compare_arrow.sh gh_compare_arrow.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing reuse-page-buffers (0f778f6) to 5ba4515 diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
BENCH_BRANCH_NAME=reuse-page-buffers
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

group                                     main                                    reuse-page-buffers
-----                                     ----                                    ------------------
bool/bloom_filter                         1.01    121.3±1.16µs     8.7 MB/sec     1.00    120.2±0.62µs     8.8 MB/sec
bool/default                              1.03     54.3±0.75µs    19.5 MB/sec     1.00     52.5±0.24µs    20.2 MB/sec
bool/parquet_2                            1.01     68.9±0.44µs    15.4 MB/sec     1.00     68.0±0.25µs    15.6 MB/sec
bool/zstd                                 1.07     67.7±3.38µs    15.7 MB/sec     1.00     63.6±0.16µs    16.7 MB/sec
bool/zstd_parquet_2                       1.01     79.9±0.37µs    13.3 MB/sec     1.00     79.0±2.63µs    13.4 MB/sec
bool_non_null/bloom_filter                1.04    100.1±0.85µs     5.7 MB/sec     1.00     96.5±0.69µs     5.9 MB/sec
bool_non_null/default                     1.16     22.7±0.22µs    25.2 MB/sec     1.00     19.5±0.06µs    29.3 MB/sec
bool_non_null/parquet_2                   1.18     41.2±0.25µs    13.9 MB/sec     1.00     35.0±0.99µs    16.3 MB/sec
bool_non_null/zstd                        1.09     31.9±0.15µs    18.0 MB/sec     1.00     29.1±0.40µs    19.6 MB/sec
bool_non_null/zstd_parquet_2              1.14     51.2±2.04µs    11.2 MB/sec     1.00     45.0±0.42µs    12.7 MB/sec
float_with_nans/bloom_filter              1.00    858.2±7.08µs    64.0 MB/sec     1.00    859.5±8.48µs    63.9 MB/sec
float_with_nans/default                   1.00    499.1±2.07µs   110.1 MB/sec     1.01    504.4±2.83µs   109.0 MB/sec
float_with_nans/parquet_2                 1.00    732.0±5.50µs    75.1 MB/sec     1.02    744.8±5.76µs    73.8 MB/sec
float_with_nans/zstd                      1.00    663.5±5.49µs    82.8 MB/sec     1.01    669.1±9.49µs    82.1 MB/sec
float_with_nans/zstd_parquet_2            1.00    905.5±8.22µs    60.7 MB/sec     1.01   917.2±11.54µs    59.9 MB/sec
list_primitive/bloom_filter               1.00      2.3±0.02ms   917.6 MB/sec     1.03      2.4±0.01ms   893.1 MB/sec
list_primitive/default                    1.00  1684.2±19.83µs  1265.8 MB/sec     1.01   1694.3±8.81µs  1258.3 MB/sec
list_primitive/parquet_2                  1.00  1721.5±10.64µs  1238.4 MB/sec     1.03   1767.4±7.47µs  1206.3 MB/sec
list_primitive/zstd                       1.00      2.9±0.01ms   744.0 MB/sec     1.03      2.9±0.01ms   723.2 MB/sec
list_primitive/zstd_parquet_2             1.00      2.9±0.01ms   738.0 MB/sec     1.01      2.9±0.01ms   729.2 MB/sec
list_primitive_non_null/bloom_filter      1.00      2.7±0.04ms   781.4 MB/sec     1.00      2.7±0.02ms   780.9 MB/sec
list_primitive_non_null/default           1.00  1792.5±11.30µs  1186.8 MB/sec     1.02  1823.7±19.53µs  1166.5 MB/sec
list_primitive_non_null/parquet_2         1.00  1943.3±19.31µs  1094.7 MB/sec     1.00  1935.2±18.15µs  1099.3 MB/sec
list_primitive_non_null/zstd              1.00      3.8±0.02ms   559.0 MB/sec     1.03      3.9±0.04ms   542.8 MB/sec
list_primitive_non_null/zstd_parquet_2    1.00      3.9±0.04ms   549.3 MB/sec     1.02      3.9±0.02ms   540.5 MB/sec
primitive/bloom_filter                    1.09  1754.5±580.76µs   100.3 MB/sec    1.00  1613.5±18.99µs   109.0 MB/sec
primitive/default                         1.00    755.5±5.36µs   232.9 MB/sec     1.01    762.8±5.01µs   230.6 MB/sec
primitive/parquet_2                       1.00    761.2±3.05µs   231.1 MB/sec     1.01   765.8±13.17µs   229.7 MB/sec
primitive/zstd                            1.00   1057.3±4.82µs   166.4 MB/sec     1.00   1055.6±7.49µs   166.7 MB/sec
primitive/zstd_parquet_2                  1.00   1002.1±5.85µs   175.6 MB/sec     1.01  1011.1±49.11µs   174.0 MB/sec
primitive_non_null/bloom_filter           1.00  1565.8±12.11µs   110.2 MB/sec     1.02  1601.3±34.73µs   107.7 MB/sec
primitive_non_null/default                1.00    603.5±2.73µs   285.8 MB/sec     1.01    610.2±4.92µs   282.7 MB/sec
primitive_non_null/parquet_2              1.00    607.4±6.18µs   284.0 MB/sec     1.01    612.6±3.23µs   281.6 MB/sec
primitive_non_null/zstd                   1.00    885.9±5.22µs   194.7 MB/sec     1.00    888.7±6.27µs   194.1 MB/sec
primitive_non_null/zstd_parquet_2         1.00    887.5±8.79µs   194.4 MB/sec     1.00    885.2±4.98µs   194.9 MB/sec
string/bloom_filter                       1.00  1267.7±12.51µs  1615.7 MB/sec     1.00  1263.9±21.85µs  1620.5 MB/sec
string/default                            1.00    775.6±7.81µs     2.6 GB/sec     1.00    771.8±6.09µs     2.6 GB/sec
string/parquet_2                          1.00    778.8±4.37µs     2.6 GB/sec     1.00    777.1±8.05µs     2.6 GB/sec
string/zstd                               1.00      2.2±0.01ms   932.5 MB/sec     1.01      2.2±0.02ms   921.2 MB/sec
string/zstd_parquet_2                     1.00      2.2±0.01ms   940.1 MB/sec     1.02      2.2±0.02ms   924.1 MB/sec
string_and_binary_view/bloom_filter       1.01    604.1±6.21µs   208.9 MB/sec     1.00    600.8±4.91µs   210.0 MB/sec
string_and_binary_view/default            1.00    362.3±0.88µs   348.3 MB/sec     1.00    362.7±2.68µs   348.0 MB/sec
string_and_binary_view/parquet_2          1.00    364.5±3.33µs   346.2 MB/sec     1.01    366.6±3.93µs   344.3 MB/sec
string_and_binary_view/zstd               1.01    597.0±6.35µs   211.4 MB/sec     1.00   593.2±15.89µs   212.7 MB/sec
string_and_binary_view/zstd_parquet_2     1.02    586.8±3.65µs   215.1 MB/sec     1.00    576.7±4.37µs   218.8 MB/sec
string_dictionary/bloom_filter            1.00   611.3±11.01µs  1688.2 MB/sec     1.01   615.7±18.11µs  1676.1 MB/sec
string_dictionary/default                 1.02    393.6±4.67µs     2.6 GB/sec     1.00    385.7±4.38µs     2.6 GB/sec
string_dictionary/parquet_2               1.01    391.6±2.99µs     2.6 GB/sec     1.00    389.2±2.39µs     2.6 GB/sec
string_dictionary/zstd                    1.00   1112.3±5.30µs   927.8 MB/sec     1.00  1116.0±23.96µs   924.7 MB/sec
string_dictionary/zstd_parquet_2          1.01   1103.4±5.71µs   935.3 MB/sec     1.00  1097.6±12.23µs   940.2 MB/sec
string_non_null/bloom_filter              1.00  1772.7±10.59µs  1154.8 MB/sec     1.00  1776.0±31.77µs  1152.7 MB/sec
string_non_null/default                   1.00  1141.0±20.54µs  1794.2 MB/sec     1.00  1139.0±15.68µs  1797.4 MB/sec
string_non_null/parquet_2                 1.01  1144.0±22.25µs  1789.5 MB/sec     1.00  1133.2±11.91µs  1806.5 MB/sec
string_non_null/zstd                      1.00      3.0±0.04ms   671.4 MB/sec     1.00      3.0±0.02ms   671.3 MB/sec
string_non_null/zstd_parquet_2            1.00      3.1±0.01ms   671.0 MB/sec     1.00      3.1±0.03ms   668.4 MB/sec

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort, @asuresh8, but I'm not sure I see a compelling case for change here. It seems your original concept was to trade an allocation for a memcpy, but now you simply take the tmp buffer and reallocate a new one in the uncompressed case. The V2 cases similarly move where the allocations occur, but don't reduce the number. It's only in the case of compressed V1 pages that there's a clear potential reduction, but I wonder if we could instead just allow the overallocation of space and drop the shrink_to_fit call. Perhaps replacing the initial buffer allocation with let mut buffer = Vec::with_capacity(values_data.buf.len()); would help.

If you can produce some harder evidence that this change is beneficial I'd of course be willing to change my mind.

} else {
// Zero-cost ownership transfer instead of memcpy.
// page_buf will regrow on next page (one alloc, same as pre-reuse code).
Bytes::from(std::mem::take(&mut self.page_buf))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for N uncompressed pages, this will actually do N+1 empty vec allocations, and the same for V2 pages.

)[..],
);
}
// Encode levels into locals first to avoid borrow conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to allocate up to 2 temp buffers to store the level data. In the original there are still 2 allocations, but there's a chance that the second allocation will reuse memory just freed up from the first.

let uncompressed_size = self.page_buf.len();

let page_bytes = if let Some(ref mut cmpr) = self.compressor {
self.compressed_buf.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only clear win I see here. We're replacing up to 2 allocations with 1. For poorly compressed data it's possible the shrink_to_fit will do nothing.

}
};

let page_bytes = if is_compressed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For V2 pages, there's really no win here. Uncompressed still does one extra alloc, and compressed trades one allocation of the tmp buffer for one for the copy_from_slice.


let page_bytes = if is_compressed {
// Compressed: copy smaller data, reuse buffer for next page
Bytes::copy_from_slice(&self.page_buf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing an alloc+copy, why not just do Bytes::from(std::mem::take(&mut self.page_buf)) for both cases (or copy for both)?

Also, the compression buffer is never used for V2 pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants