Skip to content

Comments

Add lazy read capabilities#43

Open
d33bs wants to merge 6 commits intoWayScience:mainfrom
d33bs:lazy-reader
Open

Add lazy read capabilities#43
d33bs wants to merge 6 commits intoWayScience:mainfrom
d33bs:lazy-reader

Conversation

@d33bs
Copy link
Member

@d33bs d33bs commented Feb 19, 2026

Description

This PR implements lazy reader capabilities, allowing one to load and lazily read portions of source data.

Closes #18

What kind of change(s) are included?

  • Documentation (changes docs or other related content)
  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (these changes would cause existing functionality to not work as expected).

Checklist

Please ensure that all boxes are checked before indicating that this pull request is ready for review.

  • I have read and followed the CONTRIBUTING.md guidelines.
  • I have searched for existing content to ensure this is not a duplicate.
  • I have performed a self-review of these additions (including spelling, grammar, and related).
  • These changes pass all pre-commit checks.
  • I have added comments to my code to help provide understanding
  • I have added a test which covers the code changes found within this PR
  • I have deleted all non-relevant text in this pull request template.

Summary by CodeRabbit

  • New Features

    • Lazy-loading API: scan(), slice_lazy(), and collect() for deferred reads and queued cropping.
    • LazyTensorView: build lazy tensor plans, chain selections/layouts, and convert on demand (NumPy/DLPack/torch/jax).
  • Documentation

    • Added lazy scan-style and slicing examples; documented chunk_policy and channel_policy options.
  • Tests

    • Added tests for lazy scan, chained slices, materialization semantics, and LazyTensorView behavior.
  • Chores

    • Updated pre-commit tooling hook version.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Adds Polars-style lazy planning and execution: OMEArrow.scan(...) creates deferred source plans, slice_lazy(...) queues slices, tensor_view can return LazyTensorView, and collect() materializes queued work and executes the plan.

Changes

Cohort / File(s) Summary
Docs & Examples
README.md, docs/src/dlpack.md
Added “Lazy scan-style” examples showing OMEArrow.scan, slice_lazy, collect, lazy tensor_view, and advanced options (chunk_policy, channel_policy).
Public API re-exports
src/ome_arrow/__init__.py
Re-exported LazyTensorView alongside TensorView.
Core lazy planning & materialization
src/ome_arrow/core.py
Introduced _LazySourceSpec, _LazySliceSpec; added lazy parameter to OMEArrow.__init__, scan() classmethod, is_lazy/data properties, _load_from_string_source, _ensure_materialized, _tensor_source, collect(), slice_lazy(); materialize-before in export/info/view/slice/tensor_view/repr_html; tensor_view may return LazyTensorView.
Lazy tensor wrapper
src/ome_arrow/tensor.py
Added LazyTensorView class (stores loader + selection kwargs, supports collect(), select(), with_layout(), delegates properties/conversions/iterators); added _UNSET sentinel and typing updates.
Tests
tests/test_core.py, tests/test_tensor.py
Added tests for scan/collect, chained slice_lazy, eager fallback, lazy-mode rejection of stack pattern, and LazyTensorView behavior (deferred materialization, select/layout semantics, source validation).
Tooling
.pre-commit-config.yaml
Bumped ruff hook from v0.15.1 to v0.15.2.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant OMEArrow as OMEArrow (lazy)
    participant LazySpec as LazySpecs
    participant Materializer as Materializer
    participant DataSource as DataSource

    User->>OMEArrow: scan("path/to/data", ...)
    OMEArrow->>LazySpec: create _LazySourceSpec / store plan
    User->>OMEArrow: slice_lazy(x_min,...)
    OMEArrow->>LazySpec: append _LazySliceSpec
    User->>OMEArrow: tensor_view(...)
    OMEArrow-->>User: return LazyTensorView
    User->>OMEArrow: collect()
    OMEArrow->>Materializer: _ensure_materialized()
    Materializer->>DataSource: load underlying source (parquet/zarr/...)
    DataSource-->>Materializer: return StructScalar/Array
    Materializer->>Materializer: apply queued slices/transformations
    Materializer-->>OMEArrow: materialized data
    OMEArrow-->>User: non-lazy OMEArrow / TensorView available
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • MikeLippincott

Poem

🐰 I scanned a path and queued a slice,
I waited quiet—deferred and nice.
Collect pulls threads from lazy seams,
Tensors wake and dance in streams.
Hop, hop, arrow bright—data dreams!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add lazy read capabilities' directly and accurately summarizes the main change: implementing lazy reader capabilities for deferred loading.
Linked Issues check ✅ Passed The PR implements lazy scanning and slice_lazy methods enabling selective slice loading from Parquet without loading entire images, directly addressing issue #18's requirement to investigate partial slice-based reads.
Out of Scope Changes check ✅ Passed All changes relate to lazy loading: OMEArrow.scan, slice_lazy, LazyTensorView, and supporting infrastructure. Documentation updates demonstrate usage. Only unrelated change is .pre-commit-config.yaml version bump.
Docstring Coverage ✅ Passed Docstring coverage is 84.78% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@d33bs d33bs marked this pull request as ready for review February 19, 2026 18:47
Copilot AI review requested due to automatic review settings February 19, 2026 18:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a “scan/collect” lazy-loading workflow to OMEArrow, enabling deferred ingestion from string sources and deferred creation/execution of tensor views and queued spatial/index slicing.

Changes:

  • Introduces OMEArrow.scan(...), OMEArrow.collect(), OMEArrow.is_lazy, and OMEArrow.slice_lazy(...) to support deferred planning and materialization.
  • Adds LazyTensorView to defer TensorView materialization until execution (e.g., to_numpy() / to_dlpack()).
  • Updates tests and documentation/README with the new lazy usage patterns.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_tensor.py Adds coverage for lazy tensor view execution/materialization behavior.
tests/test_core.py Adds coverage for scan(), collect(), and slice_lazy() chaining and fallback behavior.
src/ome_arrow/tensor.py Introduces LazyTensorView that defers TensorView creation until first execution.
src/ome_arrow/core.py Implements lazy source/slice planning, materialization logic, and returns LazyTensorView from tensor_view() when applicable.
src/ome_arrow/__init__.py Exposes LazyTensorView at the package top level.
docs/src/dlpack.md Documents lazy scan-style slicing and lazy tensor selection usage.
README.md Adds a quickstart example for the new Polars-like lazy workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
src/ome_arrow/core.py (3)

805-874: slice_lazy semantics: self is unchanged; callers working with chained lazy plans share no state.

slice_lazy creates and returns a new OMEArrow scan object and leaves self entirely unmodified. The _lazy_slices on the returned plan is a fresh list (not a reference to self._lazy_slices). This is sound and the chaining tests verify it correctly.

The one latent footgun is the same double-load documented in the README/dlpack.md review: code that calls both oa.slice_lazy(...).collect() and then oa.tensor_view(...) on the same oa will load the source file twice. A short note in the docstring would help users avoid this:

 Returns:
-    OMEArrow: Lazy plan when source is lazy; eager slice result otherwise.
+    OMEArrow: A new lazy plan (does not mutate ``self``) when the source is lazy;
+        an eager slice result otherwise. Note: collecting multiple independent
+        plans derived from the same scan will load the source file once per plan.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/core.py` around lines 805 - 874, The docstring for slice_lazy
should explicitly state that slice_lazy returns a new OMEArrow plan and does not
modify self (the original OMEArrow), so chained lazy plans do not share mutable
state: reference the slice_lazy method, the returned OMEArrow from
OMEArrow.scan, and the _lazy_slices list being a fresh list. Also add a brief
warning about the "double-load" footgun: calling oa.slice_lazy(...).collect()
and then oa.tensor_view(...) (or otherwise materializing the same original oa)
will reload the source twice; advise users to operate on the returned plan (the
value from slice_lazy) if they want a single-load workflow.

315-344: _ensure_materialized is correct; note that lazy slicing drops _struct_array.

The sequential slice application, lazy-state clearing, and error-recovery behaviour (lazy state retained if _load_from_string_source raises) are all correct. One consequence worth documenting: when _lazy_slices is non-empty, self._struct_array is set to None (line 340), which means Arrow-backed zero-copy (mode='arrow' in tensor_view) is unavailable after a lazy slice materializes. This is observable but currently undocumented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/core.py` around lines 315 - 344, The function
_ensure_materialized currently clears _struct_array when _lazy_slices is
non-empty which prevents Arrow zero-copy access (e.g., tensor_view with
mode='arrow') after lazy slicing; update the project documentation/comment near
_ensure_materialized (or public docs for tensor_view) to explicitly note that
applying lazy slices via _lazy_slices -> slice_ome_arrow will materialize data
and set self._struct_array = None, thereby disabling Arrow-backed zero-copy
access, and mention that this behavior is intentional and retained on exceptions
from _load_from_string_source to preserve lazy state.

124-124: Ruff TRY003: long inline exception messages flagged by static analysis.

Lines 124, 226, 301–304, and 306–313 all carry multi-sentence messages directly in raise statements. The project's own Ruff setup (v0.15.1) flags these as TRY003. Consider extracting them into dedicated exception classes or condensing the messages, or suppressing the rule with # noqa: TRY003 if the verbosity is intentional.

Also applies to: 226-226, 301-313

tests/test_tensor.py (1)

90-109: LGTM on new tests; select() and with_layout() chaining have no coverage.

The two new tests correctly exercise lazy-to-eager transition and the side effect on oa.is_lazy. However, LazyTensorView.select() and with_layout() have no tests — particularly the behaviour of select() when called with a subset of parameters (discussed in the tensor.py review below). Adding at least one test that calls select(z=0) on a lazy view that already has t=2 set would catch the full-replacement bug.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tensor.py` around lines 90 - 109, Add tests covering
LazyTensorView.select() and with_layout() chaining: create a lazy OMEArrow via
OMEArrow.scan(), get a LazyTensorView with some dimensions already set (e.g.,
view = oa.tensor_view(t=2)), then call view.select(z=0) and assert the resulting
view has the correct combined dimensions (t remains 2, z becomes 0) and does not
accidentally replace other set dims; also add a test that calls with_layout() on
a lazy view and verifies layout is updated without forcing materialization
(oa.is_lazy remains True) until to_numpy()/execution. Use LazyTensorView.select,
with_layout, tensor_view, and OMEArrow in assertions to catch the
full-replacement bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/dlpack.md`:
- Around line 44-57: The example incorrectly chains operations on obj causing a
second full load: replace the second block to operate on the already-sliced
result (use cropped.tensor_view(...) instead of obj.tensor_view(...)) or split
into two independent examples with a clear note; specifically update the snippet
to use the OMEArrow instance names lazy_crop/cropped when calling tensor_view
and to_numpy so tensor_view operates on the sliced object (methods: slice_lazy,
collect, tensor_view, to_numpy) rather than reusing obj which still represents
the unsliced scan.

In `@README.md`:
- Around line 100-113: The example currently implies that slice_lazy/collect
mutate oa and that later tensor_view uses the cropped data, but slice_lazy
returns a new independent OMEArrow plan and collect does not change oa, so the
file will be reloaded when calling oa.tensor_view(...).to_numpy(); update the
README to either (A) clearly state that slice_lazy returns a new plan and that
collect does not mutate oa and present the crop and tensor-view snippets as
independent examples (add an explicit note that each call loads the file
independently), or (B) show the correct chained usage by building the tensor
view from the new plan returned by slice_lazy (e.g., use the returned variable
from slice_lazy/collect when calling tensor_view/to_numpy) so readers see how to
reuse the cropped plan; reference slice_lazy, collect, tensor_view, to_numpy and
OMEArrow in the revised text.

In `@src/ome_arrow/core.py`:
- Around line 122-131: When lazy=True the constructor stores the string in
_LazySourceSpec without rejecting Bio-Formats stack patterns, causing
_ensure_materialized to call _load_from_string_source which lacks pattern
handling; update the lazy path in __init__ to detect Bio-Formats pattern strings
(strings containing '<', '>' or '*') and raise TypeError or route them to
from_stack_pattern_path instead of accepting them into _LazySourceSpec, so
_ensure_materialized never forwards a pattern string into
_load_from_string_source; reference _LazySourceSpec, __init__,
_ensure_materialized, _load_from_string_source and from_stack_pattern_path when
making the change.

In `@src/ome_arrow/tensor.py`:
- Around line 107-118: select() currently passes all selection parameters
(including explicit None) to _spawn, which overwrites existing entries in
self._kwargs; change select() so it uses a unique sentinel (e.g., _UNSET) for
its default parameter values or otherwise builds a kwargs dict that only
includes keys actually provided by the caller, then call self._spawn(**kwargs)
so existing selections in _kwargs are preserved when arguments are omitted
(retain None as a valid "select all" value); see select, _spawn, with_layout and
_kwargs to implement this behavior.

---

Nitpick comments:
In `@src/ome_arrow/core.py`:
- Around line 805-874: The docstring for slice_lazy should explicitly state that
slice_lazy returns a new OMEArrow plan and does not modify self (the original
OMEArrow), so chained lazy plans do not share mutable state: reference the
slice_lazy method, the returned OMEArrow from OMEArrow.scan, and the
_lazy_slices list being a fresh list. Also add a brief warning about the
"double-load" footgun: calling oa.slice_lazy(...).collect() and then
oa.tensor_view(...) (or otherwise materializing the same original oa) will
reload the source twice; advise users to operate on the returned plan (the value
from slice_lazy) if they want a single-load workflow.
- Around line 315-344: The function _ensure_materialized currently clears
_struct_array when _lazy_slices is non-empty which prevents Arrow zero-copy
access (e.g., tensor_view with mode='arrow') after lazy slicing; update the
project documentation/comment near _ensure_materialized (or public docs for
tensor_view) to explicitly note that applying lazy slices via _lazy_slices ->
slice_ome_arrow will materialize data and set self._struct_array = None, thereby
disabling Arrow-backed zero-copy access, and mention that this behavior is
intentional and retained on exceptions from _load_from_string_source to preserve
lazy state.

In `@tests/test_tensor.py`:
- Around line 90-109: Add tests covering LazyTensorView.select() and
with_layout() chaining: create a lazy OMEArrow via OMEArrow.scan(), get a
LazyTensorView with some dimensions already set (e.g., view =
oa.tensor_view(t=2)), then call view.select(z=0) and assert the resulting view
has the correct combined dimensions (t remains 2, z becomes 0) and does not
accidentally replace other set dims; also add a test that calls with_layout() on
a lazy view and verifies layout is updated without forcing materialization
(oa.is_lazy remains True) until to_numpy()/execution. Use LazyTensorView.select,
with_layout, tensor_view, and OMEArrow in assertions to catch the
full-replacement bug.

@d33bs d33bs changed the title add lazy read capabilities Add lazy read capabilities Feb 19, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
src/ome_arrow/core.py (1)

361-363: _tensor_source calls _ensure_materialized twice.

The explicit self._ensure_materialized() call already handles materialization; self.data (property getter) then calls it again as a no-op. Access self._data directly after the guard to skip the redundant property call.

♻️ Proposed refactor
 def _tensor_source(self) -> pa.StructScalar | pa.StructArray:
     self._ensure_materialized()
-    return self._struct_array if self._struct_array is not None else self.data
+    return self._struct_array if self._struct_array is not None else self._data  # type: ignore[return-value]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/core.py` around lines 361 - 363, The _tensor_source method
redundantly calls self._ensure_materialized() and then accesses the data
property (self.data) which triggers _ensure_materialized() again; after the
guard call in _tensor_source, return self._struct_array if present else access
the underlying storage directly via self._data (not self.data) to avoid the
duplicate materialization call and extra property overhead.
tests/test_tensor.py (2)

143-143: Test probes private _kwargs attribute directly.

view_hw._kwargs["layout"] inspects internal plan state. Consider asserting observable behavior instead: after collect(), view_hw.collect()._layout_override == "HW" — or simply validate the output shape, which is the behavior the test ultimately cares about.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tensor.py` at line 143, The test is inspecting private internal
state via view_hw._kwargs["layout"]; instead, change the assertion to check
observable behavior — call view_hw.collect() and assert either that the returned
object's _layout_override equals "HW" (i.e., view_hw.collect()._layout_override
== "HW") or validate the output shape that reflects the HW layout; replace the
direct _kwargs access with one of these public/observable assertions on view_hw
or its collect() result.

90-102: New lazy-collection tests look correct overall.

The trace from view.to_numpy()LazyTensorView.collect()OMEArrow._tensor_source()_ensure_materialized() correctly clears _lazy_source, so assert not oa.is_lazy after to_numpy() is a valid postcondition check.

One note: pytest.warns(UserWarning, ...) requires the warning to be emitted — if the fixture file is updated to carry an "ome_arrow" column, the test will fail. Wrapping with recwarn or pytest.warns(..., match=..., match_required=False) (depending on pytest version) would make the test tolerant of that future change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_tensor.py` around lines 90 - 102, The test
test_lazy_tensor_view_collects_on_execution currently uses
pytest.warns(UserWarning, match="Requested column 'ome_arrow'") which will fail
if the warning is not emitted in future; update the assertion to be tolerant by
either using pytest.warns(UserWarning, match="Requested column 'ome_arrow'",
match_required=False) if your pytest supports it, or replace the warns context
with the recwarn fixture (inspect warnings via recwarn after calling
view.to_numpy) so the call chain view.to_numpy -> LazyTensorView.collect ->
OMEArrow._tensor_source -> _ensure_materialized can still be validated and the
final assert not oa.is_lazy remains intact.
src/ome_arrow/tensor.py (1)

134-157: Properties silently trigger full materialization — undocumented side effect.

dtype, device, layout, shape, and strides each call self.collect(), which invokes the loader and materializes the source (file I/O). Callers typically expect property access to be cheap, but these can trigger a full Parquet/TIFF read. Adding a one-liner to each docstring would set correct expectations.

📝 Example docstring addition
 `@property`
 def dtype(self) -> np.dtype:
-    """Return the tensor dtype."""
+    """Return the tensor dtype. Triggers materialization on first access."""
     return self.collect().dtype
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/tensor.py` around lines 134 - 157, The properties dtype,
device, layout, shape, and strides call self.collect() and therefore materialize
the underlying data (triggering loader/file I/O); update each property's
docstring (for methods dtype, device, layout, shape, strides) to include a
one-line note warning that accessing the property will materialize the tensor
via collect() and may be expensive (i.e., may perform a full Parquet/TIFF read),
so callers should avoid using these properties in hot paths or should call
collect() explicitly when they expect heavy I/O.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/src/dlpack.md`:
- Around line 51-56: The README example is invalid because slice_lazy(0, 512, 0,
512).slice_lazy(64, 256, 64, 256) produces a 192×192 image, yet roi=(128, 128,
256, 256) requests pixels outside that 192×192 bounds causing ValueError; also
collect() materializes the slice so the variable name lazy_view is misleading.
Fix by either adjusting the slices to produce at least 256×256 before calling
tensor_view (e.g., change slice_lazy ranges) or shrink the roi to fit within the
actual 192×192 extent (ensure x+w and y+h ≤ 192), and rename lazy_view to a
concrete name like tensor_view or concrete_view to reflect that
cropped.tensor_view(...) returns a concrete TensorView after collect(); update
the example accordingly (references: slice_lazy, collect, tensor_view, roi).

In `@README.md`:
- Around line 107-113: The example uses slice_lazy -> collect -> tensor_view
with an out-of-bounds roi and a misleading variable name: after calling
slice_lazy(0, 512, 0, 512).slice_lazy(64, 256, 64, 256) the image is 192×192 so
passing roi=(0, 0, 512, 512) to tensor_view(...) will raise ValueError; change
the ROI to the correct dimensions (e.g., roi=(0, 0, 192, 192) or compute
width/height from the sliced object's shape) and rename lazy_view to something
like view or tensor_view_result since collect() returns self and tensor_view
returns a concrete TensorView (not a LazyTensorView). Ensure references to
slice_lazy, collect, and tensor_view are updated accordingly.

---

Nitpick comments:
In `@src/ome_arrow/core.py`:
- Around line 361-363: The _tensor_source method redundantly calls
self._ensure_materialized() and then accesses the data property (self.data)
which triggers _ensure_materialized() again; after the guard call in
_tensor_source, return self._struct_array if present else access the underlying
storage directly via self._data (not self.data) to avoid the duplicate
materialization call and extra property overhead.

In `@src/ome_arrow/tensor.py`:
- Around line 134-157: The properties dtype, device, layout, shape, and strides
call self.collect() and therefore materialize the underlying data (triggering
loader/file I/O); update each property's docstring (for methods dtype, device,
layout, shape, strides) to include a one-line note warning that accessing the
property will materialize the tensor via collect() and may be expensive (i.e.,
may perform a full Parquet/TIFF read), so callers should avoid using these
properties in hot paths or should call collect() explicitly when they expect
heavy I/O.

In `@tests/test_tensor.py`:
- Line 143: The test is inspecting private internal state via
view_hw._kwargs["layout"]; instead, change the assertion to check observable
behavior — call view_hw.collect() and assert either that the returned object's
_layout_override equals "HW" (i.e., view_hw.collect()._layout_override == "HW")
or validate the output shape that reflects the HW layout; replace the direct
_kwargs access with one of these public/observable assertions on view_hw or its
collect() result.
- Around line 90-102: The test test_lazy_tensor_view_collects_on_execution
currently uses pytest.warns(UserWarning, match="Requested column 'ome_arrow'")
which will fail if the warning is not emitted in future; update the assertion to
be tolerant by either using pytest.warns(UserWarning, match="Requested column
'ome_arrow'", match_required=False) if your pytest supports it, or replace the
warns context with the recwarn fixture (inspect warnings via recwarn after
calling view.to_numpy) so the call chain view.to_numpy -> LazyTensorView.collect
-> OMEArrow._tensor_source -> _ensure_materialized can still be validated and
the final assert not oa.is_lazy remains intact.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/ome_arrow/tensor.py (1)

134-182: Consider returning planned values from _kwargs without materializing for layout.

Properties like layout could return self._kwargs["layout"] when it's set, avoiding an expensive collect() call just to read metadata the user already provided. The docstrings do warn about materialization cost, so this is not a bug — just a potential optimization.

♻️ Example for `layout`
     `@property`
     def layout(self) -> str:
-        """Return the effective tensor layout.
-
-        Note:
-            Accessing this property calls ``collect()`` and may materialize data
-            from source files (for example Parquet/TIFF), which can be expensive.
-        """
-        return self.collect().layout
+        """Return the effective tensor layout.
+
+        When a layout was explicitly set, this returns it without
+        materializing. Otherwise, delegates to the concrete TensorView.
+        """
+        if self._kwargs.get("layout") is not None:
+            return self._kwargs["layout"]
+        return self.collect().layout
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/tensor.py` around lines 134 - 182, The layout property
currently always calls self.collect().layout and may materialize data; modify
the layout `@property` in the Tensor class to first check for a pre-specified
value in self._kwargs (e.g., if "layout" in self._kwargs: return
self._kwargs["layout"]) and only call self.collect().layout as a fallback,
preserving existing behavior and docstring; reference the layout property and
the internal _kwargs storage when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/ome_arrow/tensor.py`:
- Around line 134-182: The layout property currently always calls
self.collect().layout and may materialize data; modify the layout `@property` in
the Tensor class to first check for a pre-specified value in self._kwargs (e.g.,
if "layout" in self._kwargs: return self._kwargs["layout"]) and only call
self.collect().layout as a fallback, preserving existing behavior and docstring;
reference the layout property and the internal _kwargs storage when making the
change.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/ome_arrow/tensor.py (2)

108-132: _UNSET sentinel correctly implemented; type annotations on select() parameters are loosened to Any.

The sentinel logic addresses the previously flagged issue — only provided values are merged into _kwargs. However, using Any for all six parameters sacrifices the type-checker enforcement that the __init__ signature provides (int | slice | Sequence[int] | None, etc.). Static analysis tools will silently accept view.select(t="oops").

♻️ Restore specific type annotations alongside the sentinel

Define a typed sentinel class to allow proper union annotations:

+class _UnsetType:
+    """Singleton sentinel for distinguishing 'not provided' from None."""
+    _instance: "_UnsetType | None" = None
+    def __new__(cls) -> "_UnsetType":
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+
-_UNSET = object()
+_UNSET = _UnsetType()

Then in select():

 def select(
     self,
     *,
-    t: Any = _UNSET,
-    z: Any = _UNSET,
-    c: Any = _UNSET,
-    roi: Any = _UNSET,
-    roi3d: Any = _UNSET,
-    tile: Any = _UNSET,
+    t: int | slice | Sequence[int] | None | _UnsetType = _UNSET,
+    z: int | slice | Sequence[int] | None | _UnsetType = _UNSET,
+    c: int | slice | Sequence[int] | None | _UnsetType = _UNSET,
+    roi: tuple[int, int, int, int] | None | _UnsetType = _UNSET,
+    roi3d: tuple[int, int, int, int, int, int] | None | _UnsetType = _UNSET,
+    tile: tuple[int, int] | None | _UnsetType = _UNSET,
 ) -> "LazyTensorView":
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/tensor.py` around lines 108 - 132, The select method removed
precise parameter types by using Any for t,z,c,roi,roi3d,tile which lets callers
pass invalid types; restore the original typed annotations (e.g., int | slice |
Sequence[int] | None or whatever each parameter expects) and keep the sentinel
by defining a typed sentinel class (e.g., class _Unset: pass; _UNSET: _Unset =
_Unset()) so you can annotate parameters as Union[ExpectedType, _Unset]; update
the select signature to use these unions and keep the existing sentinel checks
and _spawn(**updates) logic in select and the sentinel definition near
LazyTensorView/__init__ so static type checkers enforce correct argument types
while preserving optionality.

98-102: collect() correctly caches _resolved; optional: guard against duplicate invocations under concurrency.

The lazy memoization pattern is correct for single-threaded usage. If collect() is ever called concurrently (e.g., in a thread-pool map), both threads could observe _resolved is None simultaneously, resulting in loader() being invoked twice. Given the current codebase shows no evidence of concurrent LazyTensorView usage this is low-risk, but worth documenting.

🔒 Optional: make collect() idempotent under basic concurrent access
+import threading
+
 class LazyTensorView:
     def __init__(self, ...) -> None:
         ...
         self._resolved: TensorView | None = None
+        self._lock = threading.Lock()

     def collect(self) -> "TensorView":
         """Materialize this lazy plan into a concrete TensorView."""
-        if self._resolved is None:
-            self._resolved = TensorView(self._loader(), **self._kwargs)
-        return self._resolved
+        if self._resolved is not None:
+            return self._resolved
+        with self._lock:
+            if self._resolved is None:
+                self._resolved = TensorView(self._loader(), **self._kwargs)
+        return self._resolved
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ome_arrow/tensor.py` around lines 98 - 102, The collect() method
currently lazily sets self._resolved via TensorView(self._loader(),
**self._kwargs) but can call _loader twice if invoked concurrently; fix by
adding a lightweight concurrency guard: introduce a per-instance lock (e.g.,
threading.Lock) used in collect() with a short double-checked pattern so threads
first check if self._resolved is None, acquire the lock, check again, and only
then call self._loader() and assign self._resolved; reference the collect
method, self._resolved, self._loader, self._kwargs, and TensorView when making
the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ome_arrow/tensor.py`:
- Around line 144-152: The device property currently always calls collect(),
causing unnecessary materialization; change its implementation to mirror the
layout short-circuit pattern: if the tensor is lazy (e.g. has a loader/_loader
or is a TensorView whose TensorView.device is hardcoded to "cpu") return "cpu"
immediately, otherwise call and return self.collect().device; update the device
property to check the lazy/loader condition first and only call collect() when
not lazy.

---

Nitpick comments:
In `@src/ome_arrow/tensor.py`:
- Around line 108-132: The select method removed precise parameter types by
using Any for t,z,c,roi,roi3d,tile which lets callers pass invalid types;
restore the original typed annotations (e.g., int | slice | Sequence[int] | None
or whatever each parameter expects) and keep the sentinel by defining a typed
sentinel class (e.g., class _Unset: pass; _UNSET: _Unset = _Unset()) so you can
annotate parameters as Union[ExpectedType, _Unset]; update the select signature
to use these unions and keep the existing sentinel checks and _spawn(**updates)
logic in select and the sentinel definition near LazyTensorView/__init__ so
static type checkers enforce correct argument types while preserving
optionality.
- Around line 98-102: The collect() method currently lazily sets self._resolved
via TensorView(self._loader(), **self._kwargs) but can call _loader twice if
invoked concurrently; fix by adding a lightweight concurrency guard: introduce a
per-instance lock (e.g., threading.Lock) used in collect() with a short
double-checked pattern so threads first check if self._resolved is None, acquire
the lock, check again, and only then call self._loader() and assign
self._resolved; reference the collect method, self._resolved, self._loader,
self._kwargs, and TensorView when making the change.

@d33bs d33bs requested a review from MikeLippincott February 20, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore how OME Arrow slices perform from files

1 participant