Skip to content

Rework save system with ZarrV3 two-tier storage#6

Open
JoeyBF wants to merge 6 commits intomasterfrom
zarrs-save-rework
Open

Rework save system with ZarrV3 two-tier storage#6
JoeyBF wants to merge 6 commits intomasterfrom
zarrs-save-rework

Conversation

@JoeyBF
Copy link
Copy Markdown
Owner

@JoeyBF JoeyBF commented Apr 7, 2026

Summary

Replaces the ad-hoc save format with a ZarrV3-backed layout via the zarrs crate. Clean break from the old format — no migration script.

The full design doc is in the commit message; this PR description is just the headline points.

Why

Problem Master This PR
File count 720K files at stem 200; OOM on 40 TB at stem 300 Sharding consolidates the long tail; well below master
Compression None at write time, external zstd zstd in-format; CRC32C on every chunk
Integrity Adler32 tail CRC32C per chunk
Format Hand-rolled to_bytes/from_bytes plumbing Structured zarr arrays matching the actual data shape

Architecture

Two tiers within one zarr store:

Shard tier (small kinds)

kernel, differential, augmentation_qi, nassau_differential, secondary_*, chain_map, chain_homotopy. One sharded vlen-bytes array per kind, shape (N_SPAN, S_SPAN[, IDX_SPAN]) = (4096, 1024[, 256]), shard (8, 8[, 8]), inner chunk [1, 1[, 1]]. CRC32C per shard, no zstd. Tens of thousands of zarr elements collapse into hundreds of shard files per kind.

Stream tier (large kinds, structured)

res_qi and nassau_qi get groups, not single arrays:

  • res_qi/: pivots/ (1D i64) + rows/ (2D u8 [image_dim, num_limbs * 8], chunked over rows). ResQiReader walks pivots and fetches matrix chunks on demand.
  • nassau_qi/: commands/ (1D vlen-bytes, one element per state-machine command — signature change, fix, or pivot operation with embedded lift+image). Header (target_dim, zero_mask_dim, subalgebra_profile, num_commands, finished) lives in group attributes. NassauQiReader yields a typed NassauCommand enum one at a time.

The finished group attribute is the source of truth for atomicity: writers dropped before finish() leave it false and the matching reader treats the QI as missing so callers recompute. Memory during read or write is bounded by chunk shape, regardless of total payload size — the multi-GB nassau_qi case never materialises.

Coordinates

Shard arrays are indexed by (n, s) (= MultiDegree<2>::coords()), not (s, t). Tighter square bound, generalises to higher-N gradings. Negative n is supported via a hidden N_MIN = -1024 offset (zarr v3 has no native negative indices); the bound is generous enough that no caller needs to know about it.

Public coords API is the const-generic SaveCoords<const N: usize> trait, with Bidegree: SaveCoords<2> and BidegreeGenerator: SaveCoords<3> — methods that only make sense in one dimension reject the other type at compile time.

Subgroups for named homomorphisms

Named ResolutionHomomorphism saves into products/{name}/; named ChainHomotopy saves into homotopies/{left}__{right}/. Anonymous endpoints disable saving entirely (matching master and fixing a latent collision bug from examples/massey.rs). The four secondary kinds therefore appear in three places: at the root for the main resolution, under products/{name}/ for a named hom's secondary lift, and under homotopies/{l}__{r}/ for a chain homotopy's secondary lift. Subgroups share the underlying FilesystemStore via Arc clone — they're just a path prefix, not a separate store.

Concurrency

zarrs::Array::store_array_subset documents (since 0.14) that callers must serialise concurrent invocations on regions sharing chunks — the per-chunk locks were removed because the old default deadlocked. We honour the contract with a per-SaveKind Arc<Mutex<()>>. We also pass CodecOptions::with_concurrent_target(1) on sharded writes, because the sharding codec uses into_par_iter internally and a rayon worker holding our std::sync::Mutex could otherwise deadlock via work-stealing.

Verified by reproduction: without these two fixes, secondary_massey --features nassau,concurrent panics on resume from a persistent save dir within a few iterations with Expected header with 1 elements, got 0 from the vlen-bytes codec. With them, 10 consecutive runs produce identical md5s.

Other notes

  • bitcode (with the serde feature) replaces hand-rolled to_bytes/from_bytes for everything in the shard tier; the relevant fp types get #[derive(Serialize, Deserialize)]. aligned-vec gains its serde feature.
  • FpVector::num_limbs becomes pub so structured row buffers can be sized without poking at private internals.
  • QuasiInverse::stream_quasi_inverse, MilnorSubalgebra::{to,from}_bytes, the Magic enum, the SaveDirectory::Split HPC workaround, and ZarrWriter/ZarrReader (the io::Write/io::Read blob shims) are all gone.

Test plan

  • All 6 tests/save_load_resolution.rs tests pass (debug + release)
  • Full cargo test is green across the workspace
  • Manual reproduction at S_2 stem 30/60 shows file count well below master
  • secondary_massey --features nassau,concurrent against a persistent save dir, 10 consecutive runs → identical md5s
  • Production run at stem 200+ to confirm the 40 TB scenario is fixed
  • Production-scale nassau_qi (multi-GB) read/write end-to-end

🤖 Generated with Claude Code

@JoeyBF JoeyBF force-pushed the zarrs-save-rework branch 3 times, most recently from 53e6ce6 to 8cd9ff1 Compare April 7, 2026 20:38
JoeyBF and others added 2 commits April 7, 2026 23:08
Replaces the ad-hoc save format with a ZarrV3-backed layout via the
`zarrs` crate. Motivations:

- **Small file problem**: at stem 200 master produces ~720K files
  totalling ~780 GB, almost all <1 KB. Stem 300 overflows our 40 TB
  allocation. Sharding consolidates the long tail.
- **No write-time compression**: master writes uncompressed and relies
  on external zstd. We want compression in-format.
- **Weak integrity**: master uses an Adler32 tail. We want CRC32C on
  every chunk.
- **Ad-hoc framing**: hand-rolled `to_bytes`/`from_bytes` plumbing
  scattered across the codebase. The new layout is structured zarr
  arrays the data shape actually matches.

This is a clean break — no migration from the old format.

The save store is one zarr v3 store on a `FilesystemStore`. Inside it,
each kind lives in one of two tiers depending on its size profile.

Used for: `kernel`, `differential`, `augmentation_qi`,
`nassau_differential`, `secondary_composite`, `secondary_intermediate`,
`secondary_homotopy`, `chain_map`, `chain_homotopy`. These are all
under ~1 MB per element with a long tail of <1 KB entries.

One sharded vlen-bytes array per kind at the top of the store. Shape
is `(N_SPAN, S_SPAN[, IDX_SPAN])` = `(4096, 1024[, 256])`, shard shape
`(SHARD_N, SHARD_S[, SHARD_IDX])` = `(8, 8[, 8])`, inner chunk
`[1, 1[, 1]]`. CRC32C on every shard, no zstd (data is small enough
that compression doesn't help and the per-shard read-modify-write is
the dominant cost). Tens of thousands of zarr elements collapse into
hundreds of shard files per kind.

`SecondaryComposite` and `SecondaryIntermediate` are 3D — the third
coordinate is the intra-bidegree basis index. Everything else is 2D.

Used for: `res_qi`, `nassau_qi`. These can reach multi-GB per
bidegree. They live in *groups* — not single arrays — at
`qi/n{n}_s{s}/{kind}/`, with kind-specific sub-arrays:

- **`res_qi/`**: `pivots/` (1D `i64`, single chunk) + `rows/` (2D
  `u8` `[image_dim, num_limbs * 8]`, chunked over rows). The
  `ResQiReader` walks pivots and fetches matrix chunks on demand.
- **`nassau_qi/`**: `commands/` (1D vlen-bytes, one element per
  command of the underlying state machine: signature change, fix,
  or pivot operation with embedded lift+image). Header
  (`target_dim`, `zero_mask_dim`, `subalgebra_profile`,
  `num_commands`, `finished`) lives in the group's `zarr.json`
  attributes. The `NassauQiReader` yields `NassauCommand` enum
  values one at a time.

The `finished` group attribute is the source of truth for atomicity:
a writer dropped before `finish()` leaves `finished = false` and the
matching reader treats the QI as missing so callers recompute. Memory
during read or write is bounded by the chunk shape (up to ~10 MB per
chunk for nassau_qi commands, or `CHUNK_RES_QI_ROWS * row_bytes` for
res_qi rows), regardless of total payload size.

Shard arrays use `(n, s)` (= `MultiDegree<2>::coords()`), not
`(s, t)`. This matches the natural display convention, generalises
to `MultiDegree<N>` for higher-N gradings, and lets us use a tighter
square bound. n can be negative (RP_-k_inf, A-mod-…[-k]); zarr v3
has no native support for negative chunk indices, so internally we
shift by a fixed `N_MIN = -1024` offset before using n as a zarr
coordinate. The offset is hidden from callers and the bound is
extremely generous; sparse zarr arrays cost essentially nothing for
the empty negative regions.

The public coords API is the `SaveCoords<const N: usize>` trait:
`Bidegree: SaveCoords<2>` and `BidegreeGenerator: SaveCoords<3>`, so
methods that only make sense in one dimension can take
`impl SaveCoords<2>` and reject the other type at compile time.

A named `ResolutionHomomorphism` (and any `SecondaryResolution-
Homomorphism` derived from it) saves into a per-name subgroup at
`products/{name}/`. The subgroup is created by
`ZarrSaveStore::subgroup(name)`, which shares the underlying
filesystem store via `Arc` clone and just prepends a path prefix.
Anonymous homs (`name.is_empty()`) get `SaveDirectory::None` and
skip saving entirely, matching master's behaviour. This fixes a
latent collision bug where anonymous homs in `examples/massey.rs`
would have shared the same on-disk slot.

`ChainHomotopy` between two named homs `left` and `right` gets its
own top-level subgroup at `homotopies/{left.name}__{right.name}/`,
not nested under either constituent map. Anonymous endpoints disable
saving here too.

The four secondary kinds therefore appear in three places:

- `secondary_composite/`, `secondary_intermediate/`,
  `secondary_homotopy/` at the root → main resolution's secondary lift
- `products/{name}/secondary_*` → secondary lift of a named hom
- `homotopies/{l}__{r}/secondary_*` → secondary lift of a chain
  homotopy

`zarrs::Array::store_array_subset` documents (since 0.14) that
callers must serialise concurrent invocations on regions sharing
chunks — the per-chunk locks were removed because the old default
implementation deadlocked. We honour that contract with a per-`SaveKind`
`Arc<Mutex<()>>` in `ZarrSaveStore`. The lock is held only across the
`store_array_subset_opt` call.

We also pass `CodecOptions::with_concurrent_target(1)` on these
sharded writes. The sharding codec uses `into_par_iter` internally,
and a rayon worker holding our `std::sync::Mutex` would otherwise
join on inner tasks and could be reassigned to another locked task
on the same kind, deadlocking. Sequential codec execution avoids
the join entirely. (Verified by reproducing — without these two
fixes, the `secondary_massey` example with `--features
nassau,concurrent` panics on resume from a persistent save dir
inside a few iterations with `Expected header with 1 elements,
got 0` from the vlen-bytes codec.)

Shard-tier arrays are created lazily on first write (per
`SaveKind` per group), so subgroups only contain the kinds they
actually use — no empty `kernel/` directory inside `products/foo/`.
The "already created" set and the per-kind write locks live in
`DashMap`/`DashSet` for lock-free reads on the hot path.

`bitcode` (with the `serde` feature) replaces hand-rolled
`to_bytes`/`from_bytes` for everything in the shard tier. The fp
crate types (`Fp<P>`, `Matrix`, `Subspace`, `QuasiInverse`,
`FqVector`) get `#[derive(Serialize, Deserialize)]`; `aligned-vec`
gains its `serde` feature; `FpVector`'s manual impl is rewritten
to round-trip via `FqVector<Fp<ValidPrime>>` (the previous impl
panicked on deserialise).

For the stream tier, the structured zarr layout replaces the old
`to_bytes` framing entirely. `QuasiInverse::stream_quasi_inverse`,
`MilnorSubalgebra::{to,from}_bytes`, the `Magic` enum, and the
`SaveDirectory::Split` variant (an HPC workaround) are all gone.

`FpVector::num_limbs(p, len)` is now `pub` so external callers
can size raw row buffers without poking at the private
`field_internal::FieldInternal` trait.

- All 6 `tests/save_load_resolution.rs` tests pass (debug + release)
- `cargo test` is green across the workspace
- Manual reproduction at S_2 stem 30/60 shows the file count is
  bounded and well below master's

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* `save.rs`: move the two-tier architecture comment into a proper
  module-level doc, and reorder the file so `ZarrSaveStore` and the
  reader/writer types come first, with the lower-level `SaveCoords` /
  `SaveKind` / `SaveDirectory` types at the end. Apply the 1-line
  summary + blank line + body doc convention and wrap at 100 cols.
* `nassau.rs`: extract the body of `apply_quasi_inverse` into a
  fallible `apply_quasi_inverse_fallible` helper that returns
  `anyhow::Result<bool>`, so `?` propagates errors from the structured
  zarr reader and `FpVector::update_from_bytes` cleanly. The
  `ChainComplex::apply_quasi_inverse` trait method is now a one-line
  delegate that panics with a descriptive message if the fallible
  variant fails.
@JoeyBF JoeyBF force-pushed the zarrs-save-rework branch from 8cd9ff1 to 5f8d841 Compare April 8, 2026 03:21
JoeyBF added 4 commits April 8, 2026 00:27
Previously, sseq_gui's wasm build broke because ext pulls in zarrs →
zarrs_filesystem → positioned-io::RandomAccessFile, which is gated to
`cfg(any(windows, unix))` and doesn't compile for wasm32-unknown-unknown.

Fix by declaring `zarrs` with target-specific feature sets:

  * native: `filesystem` + sharding + crc32c + zstd (unchanged)
  * wasm32: just sharding + crc32c + zstd (no `filesystem`, no
    zarrs_filesystem in the dep graph)

and cfg-selecting the concrete store in `ZarrSaveStore::create`:

  * native: `FilesystemStore::new(&path)` — real on-disk persistence
  * wasm32: `MemoryStore::new()` — in-memory sink, dropped at session end

The wasm frontend has no filesystem to persist to anyway, so the memory
store is a no-op sink. All other code paths (shard tier, stream tier,
subgroups, readers, writers) are identical on both targets — no stubs,
no conditional compilation at call sites.
The previous wasm swap to `MemoryStore` was necessary but incomplete.
Three further issues showed up once `cargo clippy --target
wasm32-unknown-unknown` got past the build stage:

  1. `zarrs_filesystem` wasn't the only wasm-hostile dep: `zstd-sys`'s
     C build expects POSIX `qsort_r`, which wasm's libc shim doesn't
     provide. Drop the `zstd` feature from `zarrs` on wasm and route
     the three `ZstdCodec::new` call sites through a new
     `stream_tier_codecs()` helper that returns `[zstd, crc32c]` on
     native and `[crc32c]` alone on wasm. The WASM memory store is
     ephemeral so skipping compression doesn't matter.

  2. On wasm, zarrs's storage trait objects use `MaybeSend + MaybeSync`
     (no-ops on wasm), so `Arc<dyn ReadableWritableListableStorageTraits>`
     is not `Send`/`Sync`. `ChainComplex: Send + Sync` then rejected
     `MuResolution<_>` because it transitively contains that `Arc`. The
     principled fix (a `MaybeSend`/`MaybeSync` pattern matching
     zarrs/zarrs#242) would ripple through `SecondaryLift`,
     `ChainHomotopy`, and many `+ Sync` bounds in `ext`. Instead: force
     `Send + Sync` on `ZarrSaveStore` with an `unsafe impl` gated to
     `cfg(target_arch = "wasm32")`. Sound because wasm32 is
     single-threaded, so the cross-thread guarantees are vacuously
     satisfied.

  3. Several zarrs error types (`ArrayError`, `CodecError`) contain
     `Arc<dyn DataTypeTraits>` that similarly lack `Send + Sync` on
     wasm. `anyhow::Error` requires `Send + Sync`, so `?`-converting
     them via the blanket `From` impl fails. Route those errors through
     a new `zarr_err` helper that formats via `Display` — loses the
     source chain on wasm but preserves the message, and `.map_err(
     zarr_err)?` works identically on both targets.

Verified: `cargo clippy --lib --target wasm32-unknown-unknown` passes,
`cargo build --lib --target wasm32-unknown-unknown --release` builds,
native `nix run ./ext#test` stays green (6/6 save_load_resolution).
bitcode is only used by ext/src/resolution.rs (where it's already
declared in ext/Cargo.toml). It snuck into fp's manifest as part of an
earlier zarr-branch experiment and survived the rebase onto upstream
master only because the auto-merge couldn't tell it conflicted with
the upstream removal. fp doesn't reference bitcode anywhere, so dropping
the dep is a no-op.
The pre-zarr save system stamped each file's header with
`algebra.magic()` so a later resume couldn't accidentally load Adem
data with the Milnor algebra (or vice versa). The zarr rewrite dropped
that check entirely; this commit puts it back at the store level instead
of per-file.

* `ZarrSaveStore::bind_to_algebra(magic, prime, prefix)` writes
  `algebra_magic`, `prime`, and `algebra_prefix` to the root group's
  attributes on first use, and on every subsequent use compares the
  stored magic against the caller's. A mismatch returns an
  `anyhow::Error` mentioning both magics, both algebra prefixes, and
  the store path. The arguments are raw values (not `&dyn Algebra`) so
  `save.rs` stays decoupled from the `algebra` crate.
* `MuResolution::new_with_save` and `nassau::Resolution::new_with_save`
  call `bind_to_algebra` immediately after constructing the
  `SaveDirectory`. Subgroups (`products/{name}`, `homotopies/{l}__{r}`)
  share the same underlying store and root attributes, so they inherit
  the check without a second call.
* Resurrects the `wrong_algebra` regression test from the pre-zarr
  `save_load_resolution.rs` (Adem first, then Milnor over the same
  dir, asserts the mismatch panic via `should_panic(expected = "different algebra")`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant