perf: Reuse page buffers across data pages in column writer#9521
perf: Reuse page buffers across data pages in column writer#9521asuresh8 wants to merge 4 commits intoapache:mainfrom
Conversation
Add persistent page_buf and compressed_buf fields to GenericColumnWriter. Instead of allocating fresh Vec<u8> per page (~3 allocations per page), clear and reuse the existing buffers. This eliminates allocator overhead and improves cache locality for the compression path. The shrink_to_fit() call after compression is also removed since the buffer is now reused (shrinking would be counterproductive). Changes: - Add page_buf and compressed_buf fields to GenericColumnWriter - Reuse buffers via .clear() + extend in both V1 and V2 data page paths - Use Bytes::copy_from_slice() to create final page Bytes from reused buffer - Remove shrink_to_fit() after compression (subsumes ARS-4) - Update memory_size() to include reusable buffer capacity - Add 3 multi-page roundtrip tests with compression (V1, V2, nullable) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Validates that buffer reuse does not regress on the memory bloat scenario from issue apache#8526 (highly compressible data producing ~1MB uncompressed / ~20KB compressed pages). Memory stays bounded at O(1) buffers regardless of page count. Three new tests: - test_memory_bounded_with_highly_compressible_data: 100 pages of 1024-byte repeated strings with Snappy compression (V1 path) - test_memory_bounded_with_highly_compressible_data_v2: Same for V2 - test_dictionary_fallback_memory_bounded: High cardinality data triggers dict fallback, then compressible data continues writing The shrink_to_fit() removal (ARS-4) is safe because buffers are reused (cleared each page) rather than accumulated. Each CompressedPage gets a right-sized Bytes::copy_from_slice, so queued pages hold only the compressed data, not the full uncompressed buffer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
oh no! It looks like list primitive no null cases have regressed quite a bit. I'll take a further look at what's going on there. |
…ssion The Bytes::copy_from_slice approach for buffer reuse caused a regression on large uncompressed pages (e.g., ~627KB list pages after dictionary fallback) due to O(page_size) memcpy. Fix: Use Bytes::from(std::mem::take(&mut self.page_buf)) for uncompressed pages (zero-cost ownership transfer) and keep Bytes::copy_from_slice for compressed pages (smaller data, buffer reuse saves allocation). For V1: uncompressed path uses mem::take, compressed path copies from compressed_buf (already a copy, buffer reuse still helps). For V2: split on is_compressed flag — compressed keeps copy_from_slice for buffer reuse, uncompressed uses mem::take. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
run benchmark arrow_writer |
|
🤖 Hi @asuresh8, thanks for the request (#9521 (comment)). |
|
run benchmark arrow_writer |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark arrow_writer |
|
🤖 |
|
🤖: Benchmark completed Details
|
etseidl
left a comment
There was a problem hiding this comment.
Thanks for the effort, @asuresh8, but I'm not sure I see a compelling case for change here. It seems your original concept was to trade an allocation for a memcpy, but now you simply take the tmp buffer and reallocate a new one in the uncompressed case. The V2 cases similarly move where the allocations occur, but don't reduce the number. It's only in the case of compressed V1 pages that there's a clear potential reduction, but I wonder if we could instead just allow the overallocation of space and drop the shrink_to_fit call. Perhaps replacing the initial buffer allocation with let mut buffer = Vec::with_capacity(values_data.buf.len()); would help.
If you can produce some harder evidence that this change is beneficial I'd of course be willing to change my mind.
| } else { | ||
| // Zero-cost ownership transfer instead of memcpy. | ||
| // page_buf will regrow on next page (one alloc, same as pre-reuse code). | ||
| Bytes::from(std::mem::take(&mut self.page_buf)) |
There was a problem hiding this comment.
So for N uncompressed pages, this will actually do N+1 empty vec allocations, and the same for V2 pages.
| )[..], | ||
| ); | ||
| } | ||
| // Encode levels into locals first to avoid borrow conflict |
There was a problem hiding this comment.
Here we need to allocate up to 2 temp buffers to store the level data. In the original there are still 2 allocations, but there's a chance that the second allocation will reuse memory just freed up from the first.
| let uncompressed_size = self.page_buf.len(); | ||
|
|
||
| let page_bytes = if let Some(ref mut cmpr) = self.compressor { | ||
| self.compressed_buf.clear(); |
There was a problem hiding this comment.
This is the only clear win I see here. We're replacing up to 2 allocations with 1. For poorly compressed data it's possible the shrink_to_fit will do nothing.
| } | ||
| }; | ||
|
|
||
| let page_bytes = if is_compressed { |
There was a problem hiding this comment.
For V2 pages, there's really no win here. Uncompressed still does one extra alloc, and compressed trades one allocation of the tmp buffer for one for the copy_from_slice.
|
|
||
| let page_bytes = if is_compressed { | ||
| // Compressed: copy smaller data, reuse buffer for next page | ||
| Bytes::copy_from_slice(&self.page_buf) |
There was a problem hiding this comment.
If we're doing an alloc+copy, why not just do Bytes::from(std::mem::take(&mut self.page_buf)) for both cases (or copy for both)?
Also, the compression buffer is never used for V2 pages.
Which issue does this PR close?
N/A - Performance optimization
Rationale
GenericColumnWriter::add_data_page()allocates freshVec<u8>buffers on every page flush: one for page assembly (buffer) and one for compression output (compressed_buf). For workloads that produce many pages (e.g., TPC-H SF1 lineitem: 17 columns x 49 row groups x ~70 pages/column = ~58K pages), this creates significant allocator overhead and cache pollution from touching fresh memory on every page.Additionally, the V1 compression path calls
compressed_buf.shrink_to_fit()immediately before consuming the buffer, which triggers an unnecessary realloc+copy that is immediately discarded when the buffer is converted toBytes.What changes are included in this PR?
Add two persistent
Vec<u8>fields (page_bufandcompressed_buf) toGenericColumnWriterthat are cleared and reused across data pages instead of allocating fresh buffers per page.Changes:
page_buf: Vec<u8>andcompressed_buf: Vec<u8>fields toGenericColumnWriteradd_data_page(), replacelet mut buffer = vec![]withself.page_buf.clear()and reuselet mut compressed_buf = Vec::with_capacity(...)withself.compressed_buf.clear()and reuseBytes::copy_from_slice()to create the finalBytesfrom the reused buffer (sinceBytes::from(Vec<u8>)would consume the allocation)shrink_to_fit()after compression (counterproductive with buffer reuse)memory_size()to include the capacity of the reusable buffers (partial improvement to an already-approximate metric).clear()invariant across page boundariesNot changed:
LevelEncoderbuffer allocation (small buffers, would require API changes)Trade-off:
Bytes::copy_from_sliceadds one memcpy per page vs the previous zero-copyBytes::from(Vec<u8>). However, the eliminated allocations (2-3 per page), removedshrink_to_fitrealloc, and cache warming from reused buffers provide a net benefit. The copy operates on the smaller compressed data when compression is enabled.Are these changes tested?
Yes. All 88 existing tests pass unchanged. 3 new tests added:
test_multi_page_roundtrip_with_compression- V1 path, 5 pages, Snappytest_multi_page_roundtrip_with_compression_v2- V2 path, 5 pages, Snappytest_multi_page_roundtrip_nullable_with_compression- V1 path with nullable columns, 5 pages, SnappyThese tests write enough data to trigger multiple data pages (via
set_data_page_row_count_limit(1000)with 5000 values), then read back and verify all values match. This validates that.clear()properly prevents stale data leakage between pages.Are there any user-facing changes?
No. All changes are to private fields and internal methods.
Benchmark Results
cargo bench -p parquet --bench arrow_writeron Apple M-series:Aggregate: -0.76% (excluding float_with_nans noise, confirmed by re-run)
Notable improvements:
primitive_non_null/zstd_parquet_2: -4.8%primitive_non_null/bloom_filter: -4.0%bool_non_null/zstd_parquet_2: -3.4%string/zstd_parquet_2: -3.4%string/parquet_2: -3.0%string_dictionary/default: -2.7%No confirmed regressions (initial
float_with_nansregression was measurement noise, confirmed by subsequent run with p > 0.05).