PR4: arrow scan benchmark 3036 by sumedhsakdeo · Pull Request #3045 · apache/iceberg-python

sumedhsakdeo · 2026-02-14T23:03:26Z

Rationale for this change

Summary

Adds a read throughput micro-benchmark to measure records/sec and peak Arrow memory
across streaming and concurrent_files configurations introduced in PRs 0-2.

Synthetic Data

Files: 32 Parquet files, 500,000 rows each (16M total rows)
Schema: 5 columns — id (int64), value (float64), label (string), flag (bool), ts (timestamp[us, tz=UTC])
Batch size: PyArrow default of 131,072 rows per batch (~4 batches per file)
Setup: Session-scoped fixture creates a SqlCatalog + table, writes and appends all 32 files once
Memory tracking: pa.total_allocated_bytes() (PyArrow C++ memory pool, not Python heap via tracemalloc)
Runs: 3 iterations per config, reports mean ± stdev

Configurations (6 parameterized tests)

All tests use PyArrow's default batch_size=131,072. The variable under test is the concurrency model:

ID	Mode	Description
default	`streaming=False`	Current behavior — executor.map + list(), up to max_workers files in parallel (default: min(32, cpu_count+4))
streaming-cf1	`streaming=True, concurrent_files=1`	Sequential streaming, one file at a time
streaming-cf2	`streaming=True, concurrent_files=2`	Bounded concurrent streaming, 2 files
streaming-cf4	`streaming=True, concurrent_files=4`	Bounded concurrent streaming, 4 files
streaming-cf8	`streaming=True, concurrent_files=8`	Bounded concurrent streaming, 8 files
streaming-cf16	`streaming=True, concurrent_files=16`	Bounded concurrent streaming, 16 files

Benchmark Results (local SSD, macOS, 16-core, Python 3.13)

Config	Throughput (rows/s)	Time (s)	Peak Arrow Mem (MB)
`default` (executor.map, all files parallel)	196M	0.08 ± 0.02	637
`streaming, concurrent_files=1`	60M	0.27 ± 0.00	10
`streaming, concurrent_files=2`	107M	0.15 ± 0.00	42
`streaming, concurrent_files=4`	178M	0.09 ± 0.00	114
`streaming, concurrent_files=8`	225M	0.07 ± 0.00	269
`streaming, concurrent_files=16`	222M	0.07 ± 0.00	479

Key observations

concurrent_files=1 reduces peak memory 63x (637 MB → 10 MB) — processes one file at a time, ideal for memory-constrained environments
concurrent_files=4 matches default throughput (178M vs 196M rows/s) at 82% less memory (114 MB vs 637 MB)
concurrent_files=8 beats default by 15% (225M vs 196M rows/s) at 58% less memory (269 MB vs 637 MB) — the sweet spot on this hardware
concurrent_files=16 plateaus at concurrent_files=8 — on local SSD, GIL contention and memory bandwidth become the bottleneck rather than IO. On network storage (S3/GCS) where IO latency dominates, higher concurrency values are
expected to scale further
Memory scales linearly with concurrent_files, giving users a predictable knob to trade memory for throughput

How to run

uv run pytest tests/benchmark/test_read_benchmark.py -v -s -m benchmark

Are these changes tested?

Yes — this PR is a benchmark test itself (6 parameterized test cases).

Are there any user-facing changes?

No — benchmark infrastructure only

Add batch_size parameter to _task_to_record_batches, _record_batches_from_scan_tasks_and_deletes, ArrowScan.to_record_batches, and DataScan.to_arrow_batch_reader so users can control the number of rows per RecordBatch returned by PyArrow's Scanner. Closes partially apache#3036 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When streaming=True, batches are yielded as they are produced by PyArrow without materializing entire files into memory. Files are still processed sequentially, preserving file ordering. The inner method handles the global limit correctly when called with all tasks, avoiding double-counting. This addresses the OOM issue in apache#3036 for single-file-at-a-time streaming. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add _bounded_concurrent_batches() with proper lock discipline: - Queue backpressure caps memory (scan.max-buffered-batches, default 16) - Semaphore limits concurrent file reads (concurrent_files param) - Cancel event with timeouts on all blocking ops (no lock over IO) - Error propagation and early termination support When streaming=True and concurrent_files > 1, batches are yielded as they arrive from parallel file reads. File ordering is not guaranteed (documented). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Measures records/sec and peak memory across streaming, concurrent_files, and batch_size configurations to validate performance characteristics of the new scan modes introduced for apache#3036. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sumedhsakdeo and others added 4 commits February 14, 2026 13:28

sumedhsakdeo closed this Feb 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR4: arrow scan benchmark 3036#3045

PR4: arrow scan benchmark 3036#3045
sumedhsakdeo wants to merge 4 commits intoapache:mainfrom
sumedhsakdeo:fix/arrow-scan-benchmark-3036

sumedhsakdeo commented Feb 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sumedhsakdeo commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Summary

Synthetic Data

Configurations (6 parameterized tests)

Benchmark Results (local SSD, macOS, 16-core, Python 3.13)

Key observations

How to run

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sumedhsakdeo commented Feb 14, 2026 •

edited

Loading