Skip to content

PR0: forward batch_size parameter to PyArrow Scanner#3044

Closed
sumedhsakdeo wants to merge 1 commit intoapache:mainfrom
sumedhsakdeo:fix/arrow-scan-batch-size-3036
Closed

PR0: forward batch_size parameter to PyArrow Scanner#3044
sumedhsakdeo wants to merge 1 commit intoapache:mainfrom
sumedhsakdeo:fix/arrow-scan-batch-size-3036

Conversation

@sumedhsakdeo
Copy link

@sumedhsakdeo sumedhsakdeo commented Feb 14, 2026

Rationale for this change

Closes partially #3036

Summary

  • Forward batch_size parameter to PyArrow's ds.Scanner.from_fragment() to control rows per RecordBatch
  • Propagated through _task_to_record_batches_record_batches_from_scan_tasks_and_deletesArrowScan.to_record_batchesDataScan.to_arrow_batch_reader

PR Stack

This is PR 1 of 3 for #3036:

  1. PR 0 (this): batch_size forwarding
  2. PR 1: streaming flag — stop materializing entire files
  3. PR 2: concurrent_files — bounded concurrent streaming

Are these changes tested?

Yes — unit tests for batch_size=100 and batch_size=None in test_pyarrow.py.

Are there any user-facing changes?

Yes — new batch_size param on to_arrow_batch_reader().

Add batch_size parameter to _task_to_record_batches,
_record_batches_from_scan_tasks_and_deletes, ArrowScan.to_record_batches,
and DataScan.to_arrow_batch_reader so users can control the number of
rows per RecordBatch returned by PyArrow's Scanner.

Closes partially apache#3036

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant