Skip to content

Fix/full znh5md support#16

Merged
PythonFZ merged 6 commits intomainfrom
fix/full-znh5md-support
Mar 10, 2026
Merged

Fix/full znh5md support#16
PythonFZ merged 6 commits intomainfrom
fix/full-znh5md-support

Conversation

@PythonFZ
Copy link
Copy Markdown
Member

@PythonFZ PythonFZ commented Mar 10, 2026

Summary by CodeRabbit

  • New Features

    • Better detection and support for H5MD content inside .h5 files for more reliable format selection.
    • Improved per-atom data detection for richer atom attribute handling.
  • Performance

    • Faster and more reliable path resolution via a new path-caching mechanism.
  • Tests

    • Added comprehensive round-trip interoperability tests to validate cross-format fidelity.

PythonFZ and others added 4 commits March 10, 2026 19:30
Parametrized tests over all s22_* fixtures verify:
- znh5md write -> asebytes read (catches H5MD path mapping bug with
  per-atom calc results like energies, stresses, magmoms)
- asebytes write -> znh5md read
- Full bidirectional round-trip
- *.h5 extension with H5MD content (catches registry routing bug)

9 tests fail RED, exposing two bugs:
1. H5MDStore path mapping loses actual H5MD paths on round-trip
2. Registry sends *.h5 H5MD files to RaggedColumnarBackend

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add _is_h5md_file() helper that checks for "h5md" group in HDF5 files
- Insert sniffing logic in resolve_backend to redirect *.h5 files
  containing H5MD metadata to the H5MD backend

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…roperties

Properties like energies, stresses, magmoms placed by znh5md in
/particles/ were lost on read because _column_to_h5() statically
mapped them to /observables/. Now _walk_elements() caches the actual
H5 path during discovery, and _get_ds() checks the cache first.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In _discover(), columns from list_arrays() are known to exist, so the
has_array() check was redundant and broke per-atom detection for columns
whose _column_to_h5() mapping disagreed with the actual H5 path (e.g.
per-atom energies placed in /particles/ by znh5md).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1d0871b-1b05-436d-b81d-f35bc0350c08

📥 Commits

Reviewing files that changed from the base of the PR and between 5d95065 and 20ce0a0.

📒 Files selected for processing (1)
  • tests/contract/test_znh5md_roundtrip.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/contract/test_znh5md_roundtrip.py

📝 Walkthrough

Walkthrough

Detect H5MD metadata inside .h5 files to select H5MD-specific backends, add a path cache mapping column names to HDF5 paths, change per-atom detection in the H5MD backend discovery to use HDF5 path/ndim, and add bidirectional znh5md ↔ asebytes round-trip tests.

Changes

Cohort / File(s) Summary
Registry: H5MD sniffing
src/asebytes/_registry.py
Added internal _is_h5md_file(path) that tries to open the file with h5py and checks for an h5md key. resolve_backend() now, for object-layer no-scheme URIs with *.h5 candidates, uses this sniff to switch candidates to *.h5md when H5MD content is detected.
H5MD backend discovery
src/asebytes/h5md/_backend.py
Adjusted _discover() per-atom detection: infer per-atom status from HDF5 path prefix (/particles/) and ndim >= 2 (excluding cell and pbc) instead of relying on has_array() and shape[1] checks.
H5MD store path caching & resolution
src/asebytes/h5md/_store.py
Added path_cache mapping column ↔ HDF5 path, introduced _resolve_h5_path(key) to consult the cache before _column_to_h5, updated _get_ds() and has_array() to use the resolver, and populate path_cache during _walk_elements() when discovering datasets.
Contract tests: znh5md interoperability
tests/contract/test_znh5md_roundtrip.py
New tests exercising znh5md ↔ asebytes round-trips across fixtures: read znh5md with ASEIO, write with ASEIO and verify via znh5md.IO, and full bidirectional roundtrip checks (positions, counts, numbers) across frames.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Registry as resolve_backend()
    participant Detector as _is_h5md_file()
    participant h5py
    participant Backend as BackendCandidates

    User->>Registry: resolve(path, layer, patterns)
    activate Registry
    note over Registry: If no URI scheme and layer == "object"\nand pattern candidates include *.h5
    Registry->>Detector: _is_h5md_file(path)
    activate Detector
    Detector->>h5py: open(path)
    alt HDF5 with h5md key
        h5py-->>Detector: returns file with 'h5md'
        Detector-->>Registry: True
    else Not HDF5 or no h5md / error
        h5py-->>Detector: exception or no key
        Detector-->>Registry: False
    end
    deactivate Detector

    alt H5MD detected
        Registry->>Backend: select *.h5md candidates
        Backend-->>Registry: H5MD-specific candidates
    else keep original
        Backend-->>Registry: original candidates
    end
    Registry-->>User: resolved backend
    deactivate Registry
Loading
sequenceDiagram
    participant Discovery as _walk_elements()
    participant PathCache as path_cache
    participant Store as _get_ds()
    participant Resolver as _column_to_h5()

    Discovery->>Discovery: traverse HDF5 groups
    loop for each discovered dataset group
        Discovery->>Store: compute h5_path
        Store->>PathCache: cache mapping (column ← h5_path)
    end

    Store->>PathCache: lookup(column)
    alt cache hit
        PathCache-->>Store: return cached h5_path
    else cache miss
        PathCache-->>Store: miss
        Store->>Resolver: _column_to_h5(column)
        Resolver-->>Store: h5_path
    end
    Store-->>Caller: dataset at h5_path
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I sniffed the H5 with eager nose,

found h5md where the metadata grows,
cached the paths and mapped each name,
round-trips hop back—numbers match the same,
a little rabbit cheers the interoperable game!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix/full znh5md support' accurately reflects the main objective of the pull request, which adds comprehensive H5MD support including H5MD detection, improved per-atom column detection, path caching, and interoperability tests with znh5md.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/full-znh5md-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/asebytes/h5md/_store.py`:
- Line 111: The path cache (_path_cache) is only used by _get_ds(), causing
has_array() and other probes using _column_to_h5() to miss previously discovered
foreign paths; centralize path resolution into a single helper (e.g.,
_resolve_column_path or update _column_to_h5) that looks up and populates
self._path_cache for a given column name and always returns the actual HDF5
path, then change _get_ds(), has_array(), and any other callers (references
around _get_ds, has_array, and the blocks at lines ~218-225 and ~452-456) to
call this resolved-path helper instead of directly using the old _column_to_h5()
logic so existence checks and reads use the same cached path. Ensure the helper
is thread-safe relative to _path_cache mutation if applicable.

In `@tests/contract/test_znh5md_roundtrip.py`:
- Around line 75-83: The test currently slices result to n before comparing,
which masks leaks of padded atoms; change the checks to first assert len(result)
== n (e.g., using assert len(result) == n or
np.testing.assert_equal(len(result), n)) and then compare the full arrays (call
np.testing.assert_allclose(result.positions, expected.positions, ...) and
np.testing.assert_array_equal(result.numbers, expected.numbers, ...)) instead of
result[:n]; make the same update for the analogous block later in the file where
positions/numbers are compared.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28b39aca-79f9-476b-8536-f9948807373d

📥 Commits

Reviewing files that changed from the base of the PR and between cd81592 and 229b6ba.

📒 Files selected for processing (4)
  • src/asebytes/_registry.py
  • src/asebytes/h5md/_backend.py
  • src/asebytes/h5md/_store.py
  • tests/contract/test_znh5md_roundtrip.py

PythonFZ and others added 2 commits March 10, 2026 21:02
- Use H5 path (/particles/ vs /observables/) for per-atom detection
  instead of shape heuristic that misclassified frame-level vectors
- Centralize path resolution with _resolve_h5_path helper used by
  both _get_ds and has_array to fix foreign column lookups
- Remove test slicing ([:n]) that masked potential padding leaks and
  add explicit atom count assertions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace weak positions/numbers-only assertions in TestAsebytesToZnH5MD
and TestBidirectionalRoundtrip with assert_atoms_equal to catch
regressions in cell, pbc, info, arrays, and calc properties. Merge
TestH5ExtensionWithH5MDContent into TestZnH5MDToAsebytes as a
parametrized extension (.h5md/.h5) to remove duplication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PythonFZ PythonFZ merged commit 79c1b63 into main Mar 10, 2026
5 checks passed
@PythonFZ PythonFZ deleted the fix/full-znh5md-support branch March 10, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant