Rework save system with ZarrV3 two-tier storage by JoeyBF · Pull Request #6 · JoeyBF/sseq

JoeyBF · 2026-04-07T00:26:42Z

Summary

Replaces the ad-hoc save format with a ZarrV3-backed layout via the zarrs crate. Clean break from the old format — no migration script.

The full design doc is in the commit message; this PR description is just the headline points.

Why

Problem	Master	This PR
File count	720K files at stem 200; OOM on 40 TB at stem 300	Sharding consolidates the long tail; well below master
Compression	None at write time, external zstd	zstd in-format; CRC32C on every chunk
Integrity	Adler32 tail	CRC32C per chunk
Format	Hand-rolled `to_bytes`/`from_bytes` plumbing	Structured zarr arrays matching the actual data shape

Architecture

Two tiers within one zarr store:

Shard tier (small kinds)

kernel, differential, augmentation_qi, nassau_differential, secondary_*, chain_map, chain_homotopy. One sharded vlen-bytes array per kind, shape (N_SPAN, S_SPAN[, IDX_SPAN]) = (4096, 1024[, 256]), shard (8, 8[, 8]), inner chunk [1, 1[, 1]]. CRC32C per shard, no zstd. Tens of thousands of zarr elements collapse into hundreds of shard files per kind.

Stream tier (large kinds, structured)

res_qi and nassau_qi get groups, not single arrays:

res_qi/: pivots/ (1D i64) + rows/ (2D u8 [image_dim, num_limbs * 8], chunked over rows). ResQiReader walks pivots and fetches matrix chunks on demand.
nassau_qi/: commands/ (1D vlen-bytes, one element per state-machine command — signature change, fix, or pivot operation with embedded lift+image). Header (target_dim, zero_mask_dim, subalgebra_profile, num_commands, finished) lives in group attributes. NassauQiReader yields a typed NassauCommand enum one at a time.

The finished group attribute is the source of truth for atomicity: writers dropped before finish() leave it false and the matching reader treats the QI as missing so callers recompute. Memory during read or write is bounded by chunk shape, regardless of total payload size — the multi-GB nassau_qi case never materialises.

Coordinates

Shard arrays are indexed by (n, s) (= MultiDegree<2>::coords()), not (s, t). Tighter square bound, generalises to higher-N gradings. Negative n is supported via a hidden N_MIN = -1024 offset (zarr v3 has no native negative indices); the bound is generous enough that no caller needs to know about it.

Public coords API is the const-generic SaveCoords<const N: usize> trait, with Bidegree: SaveCoords<2> and BidegreeGenerator: SaveCoords<3> — methods that only make sense in one dimension reject the other type at compile time.

Subgroups for named homomorphisms

Named ResolutionHomomorphism saves into products/{name}/; named ChainHomotopy saves into homotopies/{left}__{right}/. Anonymous endpoints disable saving entirely (matching master and fixing a latent collision bug from examples/massey.rs). The four secondary kinds therefore appear in three places: at the root for the main resolution, under products/{name}/ for a named hom's secondary lift, and under homotopies/{l}__{r}/ for a chain homotopy's secondary lift. Subgroups share the underlying FilesystemStore via Arc clone — they're just a path prefix, not a separate store.

Concurrency

zarrs::Array::store_array_subset documents (since 0.14) that callers must serialise concurrent invocations on regions sharing chunks — the per-chunk locks were removed because the old default deadlocked. We honour the contract with a per-SaveKind Arc<Mutex<()>>. We also pass CodecOptions::with_concurrent_target(1) on sharded writes, because the sharding codec uses into_par_iter internally and a rayon worker holding our std::sync::Mutex could otherwise deadlock via work-stealing.

Verified by reproduction: without these two fixes, secondary_massey --features nassau,concurrent panics on resume from a persistent save dir within a few iterations with Expected header with 1 elements, got 0 from the vlen-bytes codec. With them, 10 consecutive runs produce identical md5s.

Other notes

bitcode (with the serde feature) replaces hand-rolled to_bytes/from_bytes for everything in the shard tier; the relevant fp types get #[derive(Serialize, Deserialize)]. aligned-vec gains its serde feature.
FpVector::num_limbs becomes pub so structured row buffers can be sized without poking at private internals.
QuasiInverse::stream_quasi_inverse, MilnorSubalgebra::{to,from}_bytes, the Magic enum, the SaveDirectory::Split HPC workaround, and ZarrWriter/ZarrReader (the io::Write/io::Read blob shims) are all gone.

Test plan

All 6 tests/save_load_resolution.rs tests pass (debug + release)
Full cargo test is green across the workspace
Manual reproduction at S_2 stem 30/60 shows file count well below master
secondary_massey --features nassau,concurrent against a persistent save dir, 10 consecutive runs → identical md5s
Production run at stem 200+ to confirm the 40 TB scenario is fixed
Production-scale nassau_qi (multi-GB) read/write end-to-end

🤖 Generated with Claude Code

Replaces the ad-hoc save format with a ZarrV3-backed layout via the `zarrs` crate. Motivations: - **Small file problem**: at stem 200 master produces ~720K files totalling ~780 GB, almost all <1 KB. Stem 300 overflows our 40 TB allocation. Sharding consolidates the long tail. - **No write-time compression**: master writes uncompressed and relies on external zstd. We want compression in-format. - **Weak integrity**: master uses an Adler32 tail. We want CRC32C on every chunk. - **Ad-hoc framing**: hand-rolled `to_bytes`/`from_bytes` plumbing scattered across the codebase. The new layout is structured zarr arrays the data shape actually matches. This is a clean break — no migration from the old format. The save store is one zarr v3 store on a `FilesystemStore`. Inside it, each kind lives in one of two tiers depending on its size profile. Used for: `kernel`, `differential`, `augmentation_qi`, `nassau_differential`, `secondary_composite`, `secondary_intermediate`, `secondary_homotopy`, `chain_map`, `chain_homotopy`. These are all under ~1 MB per element with a long tail of <1 KB entries. One sharded vlen-bytes array per kind at the top of the store. Shape is `(N_SPAN, S_SPAN[, IDX_SPAN])` = `(4096, 1024[, 256])`, shard shape `(SHARD_N, SHARD_S[, SHARD_IDX])` = `(8, 8[, 8])`, inner chunk `[1, 1[, 1]]`. CRC32C on every shard, no zstd (data is small enough that compression doesn't help and the per-shard read-modify-write is the dominant cost). Tens of thousands of zarr elements collapse into hundreds of shard files per kind. `SecondaryComposite` and `SecondaryIntermediate` are 3D — the third coordinate is the intra-bidegree basis index. Everything else is 2D. Used for: `res_qi`, `nassau_qi`. These can reach multi-GB per bidegree. They live in *groups* — not single arrays — at `qi/n{n}_s{s}/{kind}/`, with kind-specific sub-arrays: - **`res_qi/`**: `pivots/` (1D `i64`, single chunk) + `rows/` (2D `u8` `[image_dim, num_limbs * 8]`, chunked over rows). The `ResQiReader` walks pivots and fetches matrix chunks on demand. - **`nassau_qi/`**: `commands/` (1D vlen-bytes, one element per command of the underlying state machine: signature change, fix, or pivot operation with embedded lift+image). Header (`target_dim`, `zero_mask_dim`, `subalgebra_profile`, `num_commands`, `finished`) lives in the group's `zarr.json` attributes. The `NassauQiReader` yields `NassauCommand` enum values one at a time. The `finished` group attribute is the source of truth for atomicity: a writer dropped before `finish()` leaves `finished = false` and the matching reader treats the QI as missing so callers recompute. Memory during read or write is bounded by the chunk shape (up to ~10 MB per chunk for nassau_qi commands, or `CHUNK_RES_QI_ROWS * row_bytes` for res_qi rows), regardless of total payload size. Shard arrays use `(n, s)` (= `MultiDegree<2>::coords()`), not `(s, t)`. This matches the natural display convention, generalises to `MultiDegree<N>` for higher-N gradings, and lets us use a tighter square bound. n can be negative (RP_-k_inf, A-mod-…[-k]); zarr v3 has no native support for negative chunk indices, so internally we shift by a fixed `N_MIN = -1024` offset before using n as a zarr coordinate. The offset is hidden from callers and the bound is extremely generous; sparse zarr arrays cost essentially nothing for the empty negative regions. The public coords API is the `SaveCoords<const N: usize>` trait: `Bidegree: SaveCoords<2>` and `BidegreeGenerator: SaveCoords<3>`, so methods that only make sense in one dimension can take `impl SaveCoords<2>` and reject the other type at compile time. A named `ResolutionHomomorphism` (and any `SecondaryResolution- Homomorphism` derived from it) saves into a per-name subgroup at `products/{name}/`. The subgroup is created by `ZarrSaveStore::subgroup(name)`, which shares the underlying filesystem store via `Arc` clone and just prepends a path prefix. Anonymous homs (`name.is_empty()`) get `SaveDirectory::None` and skip saving entirely, matching master's behaviour. This fixes a latent collision bug where anonymous homs in `examples/massey.rs` would have shared the same on-disk slot. `ChainHomotopy` between two named homs `left` and `right` gets its own top-level subgroup at `homotopies/{left.name}__{right.name}/`, not nested under either constituent map. Anonymous endpoints disable saving here too. The four secondary kinds therefore appear in three places: - `secondary_composite/`, `secondary_intermediate/`, `secondary_homotopy/` at the root → main resolution's secondary lift - `products/{name}/secondary_*` → secondary lift of a named hom - `homotopies/{l}__{r}/secondary_*` → secondary lift of a chain homotopy `zarrs::Array::store_array_subset` documents (since 0.14) that callers must serialise concurrent invocations on regions sharing chunks — the per-chunk locks were removed because the old default implementation deadlocked. We honour that contract with a per-`SaveKind` `Arc<Mutex<()>>` in `ZarrSaveStore`. The lock is held only across the `store_array_subset_opt` call. We also pass `CodecOptions::with_concurrent_target(1)` on these sharded writes. The sharding codec uses `into_par_iter` internally, and a rayon worker holding our `std::sync::Mutex` would otherwise join on inner tasks and could be reassigned to another locked task on the same kind, deadlocking. Sequential codec execution avoids the join entirely. (Verified by reproducing — without these two fixes, the `secondary_massey` example with `--features nassau,concurrent` panics on resume from a persistent save dir inside a few iterations with `Expected header with 1 elements, got 0` from the vlen-bytes codec.) Shard-tier arrays are created lazily on first write (per `SaveKind` per group), so subgroups only contain the kinds they actually use — no empty `kernel/` directory inside `products/foo/`. The "already created" set and the per-kind write locks live in `DashMap`/`DashSet` for lock-free reads on the hot path. `bitcode` (with the `serde` feature) replaces hand-rolled `to_bytes`/`from_bytes` for everything in the shard tier. The fp crate types (`Fp<P>`, `Matrix`, `Subspace`, `QuasiInverse`, `FqVector`) get `#[derive(Serialize, Deserialize)]`; `aligned-vec` gains its `serde` feature; `FpVector`'s manual impl is rewritten to round-trip via `FqVector<Fp<ValidPrime>>` (the previous impl panicked on deserialise). For the stream tier, the structured zarr layout replaces the old `to_bytes` framing entirely. `QuasiInverse::stream_quasi_inverse`, `MilnorSubalgebra::{to,from}_bytes`, the `Magic` enum, and the `SaveDirectory::Split` variant (an HPC workaround) are all gone. `FpVector::num_limbs(p, len)` is now `pub` so external callers can size raw row buffers without poking at the private `field_internal::FieldInternal` trait. - All 6 `tests/save_load_resolution.rs` tests pass (debug + release) - `cargo test` is green across the workspace - Manual reproduction at S_2 stem 30/60 shows the file count is bounded and well below master's Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* `save.rs`: move the two-tier architecture comment into a proper module-level doc, and reorder the file so `ZarrSaveStore` and the reader/writer types come first, with the lower-level `SaveCoords` / `SaveKind` / `SaveDirectory` types at the end. Apply the 1-line summary + blank line + body doc convention and wrap at 100 cols. * `nassau.rs`: extract the body of `apply_quasi_inverse` into a fallible `apply_quasi_inverse_fallible` helper that returns `anyhow::Result<bool>`, so `?` propagates errors from the structured zarr reader and `FpVector::update_from_bytes` cleanly. The `ChainComplex::apply_quasi_inverse` trait method is now a one-line delegate that panics with a descriptive message if the fallible variant fails.

Previously, sseq_gui's wasm build broke because ext pulls in zarrs → zarrs_filesystem → positioned-io::RandomAccessFile, which is gated to `cfg(any(windows, unix))` and doesn't compile for wasm32-unknown-unknown. Fix by declaring `zarrs` with target-specific feature sets: * native: `filesystem` + sharding + crc32c + zstd (unchanged) * wasm32: just sharding + crc32c + zstd (no `filesystem`, no zarrs_filesystem in the dep graph) and cfg-selecting the concrete store in `ZarrSaveStore::create`: * native: `FilesystemStore::new(&path)` — real on-disk persistence * wasm32: `MemoryStore::new()` — in-memory sink, dropped at session end The wasm frontend has no filesystem to persist to anyway, so the memory store is a no-op sink. All other code paths (shard tier, stream tier, subgroups, readers, writers) are identical on both targets — no stubs, no conditional compilation at call sites.

The previous wasm swap to `MemoryStore` was necessary but incomplete. Three further issues showed up once `cargo clippy --target wasm32-unknown-unknown` got past the build stage: 1. `zarrs_filesystem` wasn't the only wasm-hostile dep: `zstd-sys`'s C build expects POSIX `qsort_r`, which wasm's libc shim doesn't provide. Drop the `zstd` feature from `zarrs` on wasm and route the three `ZstdCodec::new` call sites through a new `stream_tier_codecs()` helper that returns `[zstd, crc32c]` on native and `[crc32c]` alone on wasm. The WASM memory store is ephemeral so skipping compression doesn't matter. 2. On wasm, zarrs's storage trait objects use `MaybeSend + MaybeSync` (no-ops on wasm), so `Arc<dyn ReadableWritableListableStorageTraits>` is not `Send`/`Sync`. `ChainComplex: Send + Sync` then rejected `MuResolution<_>` because it transitively contains that `Arc`. The principled fix (a `MaybeSend`/`MaybeSync` pattern matching zarrs/zarrs#242) would ripple through `SecondaryLift`, `ChainHomotopy`, and many `+ Sync` bounds in `ext`. Instead: force `Send + Sync` on `ZarrSaveStore` with an `unsafe impl` gated to `cfg(target_arch = "wasm32")`. Sound because wasm32 is single-threaded, so the cross-thread guarantees are vacuously satisfied. 3. Several zarrs error types (`ArrayError`, `CodecError`) contain `Arc<dyn DataTypeTraits>` that similarly lack `Send + Sync` on wasm. `anyhow::Error` requires `Send + Sync`, so `?`-converting them via the blanket `From` impl fails. Route those errors through a new `zarr_err` helper that formats via `Display` — loses the source chain on wasm but preserves the message, and `.map_err( zarr_err)?` works identically on both targets. Verified: `cargo clippy --lib --target wasm32-unknown-unknown` passes, `cargo build --lib --target wasm32-unknown-unknown --release` builds, native `nix run ./ext#test` stays green (6/6 save_load_resolution).

bitcode is only used by ext/src/resolution.rs (where it's already declared in ext/Cargo.toml). It snuck into fp's manifest as part of an earlier zarr-branch experiment and survived the rebase onto upstream master only because the auto-merge couldn't tell it conflicted with the upstream removal. fp doesn't reference bitcode anywhere, so dropping the dep is a no-op.

The pre-zarr save system stamped each file's header with `algebra.magic()` so a later resume couldn't accidentally load Adem data with the Milnor algebra (or vice versa). The zarr rewrite dropped that check entirely; this commit puts it back at the store level instead of per-file. * `ZarrSaveStore::bind_to_algebra(magic, prime, prefix)` writes `algebra_magic`, `prime`, and `algebra_prefix` to the root group's attributes on first use, and on every subsequent use compares the stored magic against the caller's. A mismatch returns an `anyhow::Error` mentioning both magics, both algebra prefixes, and the store path. The arguments are raw values (not `&dyn Algebra`) so `save.rs` stays decoupled from the `algebra` crate. * `MuResolution::new_with_save` and `nassau::Resolution::new_with_save` call `bind_to_algebra` immediately after constructing the `SaveDirectory`. Subgroups (`products/{name}`, `homotopies/{l}__{r}`) share the same underlying store and root attributes, so they inherit the check without a second call. * Resurrects the `wrong_algebra` regression test from the pre-zarr `save_load_resolution.rs` (Adem first, then Milnor over the same dir, asserts the mismatch panic via `should_panic(expected = "different algebra")`).

JoeyBF force-pushed the zarrs-save-rework branch 3 times, most recently from 53e6ce6 to 8cd9ff1 Compare April 7, 2026 20:38

JoeyBF and others added 2 commits April 7, 2026 23:08

JoeyBF force-pushed the zarrs-save-rework branch from 8cd9ff1 to 5f8d841 Compare April 8, 2026 03:21

JoeyBF added 4 commits April 8, 2026 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework save system with ZarrV3 two-tier storage#6

Rework save system with ZarrV3 two-tier storage#6
JoeyBF wants to merge 6 commits intomasterfrom
zarrs-save-rework

JoeyBF commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JoeyBF commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Architecture

Shard tier (small kinds)

Stream tier (large kinds, structured)

Coordinates

Subgroups for named homomorphisms

Concurrency

Other notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoeyBF commented Apr 7, 2026 •

edited

Loading