PR0: forward batch_size parameter to PyArrow Scanner#3044
Closed
sumedhsakdeo wants to merge 1 commit intoapache:mainfrom
Closed
PR0: forward batch_size parameter to PyArrow Scanner#3044sumedhsakdeo wants to merge 1 commit intoapache:mainfrom
sumedhsakdeo wants to merge 1 commit intoapache:mainfrom
Conversation
Add batch_size parameter to _task_to_record_batches, _record_batches_from_scan_tasks_and_deletes, ArrowScan.to_record_batches, and DataScan.to_arrow_batch_reader so users can control the number of rows per RecordBatch returned by PyArrow's Scanner. Closes partially apache#3036 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Closes partially #3036
Summary
batch_sizeparameter to PyArrow'sds.Scanner.from_fragment()to control rows per RecordBatch_task_to_record_batches→_record_batches_from_scan_tasks_and_deletes→ArrowScan.to_record_batches→DataScan.to_arrow_batch_readerPR Stack
This is PR 1 of 3 for #3036:
batch_sizeforwardingstreamingflag — stop materializing entire filesconcurrent_files— bounded concurrent streamingAre these changes tested?
Yes — unit tests for batch_size=100 and batch_size=None in test_pyarrow.py.
Are there any user-facing changes?
Yes — new batch_size param on to_arrow_batch_reader().