Skip to content
/ shape Public

shape is a structural comparability gate — deterministically answering whether two datasets can be compared at all by checking schema overlap, key uniqueness, granularity, and type shifts.

License

Notifications You must be signed in to change notification settings

cmdrvl/shape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shape

CI License: MIT GitHub release

Structural comparability gate — know whether two CSV datasets can be compared before you waste time trying.

No AI. No inference. Pure deterministic checks.

brew install cmdrvl/tap/shape

TL;DR

The Problem: Before you can compare two CSV exports, you need to know if comparison is even meaningful. Do the columns match? Is the key unique? Did the schema drift? Finding out mid-analysis wastes time and produces misleading results.

The Solution: One structural gate. shape checks schema overlap, key viability, row granularity, and type consistency — then gives a deterministic verdict before you run any analysis.

Why Use shape?

Feature What It Does
Four structural checks Schema overlap, key viability, row granularity, type consistency — all at once
Three clear outcomes COMPATIBLE, INCOMPATIBLE, or REFUSAL — never ambiguous
Concrete reasons When incompatible, tells you exactly what broke and why
Machine-readable --json output for pipelines and CI gates
Pairs with rvl Run shape first to validate structure, then rvl to explain numeric changes
Deterministic Same inputs always produce the same output — no models, no heuristics
Ambient witness ledger Every comparison is recorded for audit trails (opt-out with --no-witness)

Quick Example

$ shape nov.csv dec.csv --key loan_id
SHAPE

COMPATIBLE

Compared: nov.csv -> dec.csv
Key: loan_id (unique in both files)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none

Schema:    22 common / 22 total (100% overlap)
Key:       loan_id — unique in both, coverage=1.0
Rows:      3,214 old / 3,201 new (13 removed, 0 added, 3,201 overlap)
Types:     12 numeric columns, 0 type shifts

All four checks pass. These files are structurally compatible — safe to proceed with rvl, compare, or verify.

# Gate a pipeline (shape before rvl):
$ shape nov.csv dec.csv --key loan_id --json > shape.json \
    && rvl nov.csv dec.csv --key loan_id --json > rvl.json

# Exit code only (for scripts):
$ shape old.csv new.csv > /dev/null 2>&1
$ echo $?  # 0 = compatible, 1 = incompatible, 2 = refused

# Machine-readable:
$ shape old.csv new.csv --json | jq '.checks.schema_overlap'

The Three Outcomes

shape always produces exactly one of three outcomes. There are no partial results.

1. COMPATIBLE

All structural checks pass. These datasets can be meaningfully compared.

SHAPE

COMPATIBLE

Compared: nov.csv -> dec.csv
Key: loan_id (unique in both files)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none

Schema:    22 common / 22 total (100% overlap)
Key:       loan_id — unique in both, coverage=1.0
Rows:      3,214 old / 3,201 new (13 removed, 0 added, 3,201 overlap)
Types:     12 numeric columns, 0 type shifts

How to read this:

  • Schema — how many columns are shared between the two files.
  • Key — whether the key column is unique and non-null in both files.
  • Rows — row counts and key overlap (how many keys appear in both files).
  • Types — whether any columns changed from numeric to non-numeric or vice versa.

2. INCOMPATIBLE

One or more structural checks failed. The reasons field explains exactly what broke.

SHAPE

INCOMPATIBLE

Compared: nov.csv -> dec.csv
Key: loan_id (unique in both files)
Dialect(old): delimiter=, quote=" escape=none
Dialect(new): delimiter=, quote=" escape=none

Schema:    15 common / 17 total (88% overlap)
           old_only: [retired_field]
           new_only: [new_field]
Key:       loan_id — unique in both, coverage=1.0
Rows:      4,183 old / 4,201 new (33 removed, 51 added, 4,150 overlap)
Types:     12 numeric columns, 1 type shift
           balance: numeric -> non-numeric

Reasons:
  1. Type shift: balance changed from numeric to non-numeric

3. REFUSAL

When shape cannot parse or read the inputs. Always includes a concrete next step.

SHAPE ERROR (E_EMPTY)

Compared: nov.csv -> dec.csv
Dialect(old): delimiter=, quote=" escape=none

One or both files empty (no data rows after header)
Next: provide non-empty datasets.

The Four Checks

shape runs four independent structural checks. All must pass for COMPATIBLE.

Schema Overlap

Measures how many columns are shared between the two files.

  • Pass condition: at least 1 common column (overlap_ratio > 0)
  • Reports: columns_common, columns_old_only, columns_new_only, overlap_ratio

Key Viability

Checks whether the key column is suitable for row alignment.

  • Pass condition: key is unique in both files with no nulls
  • Only checked when --key is provided
  • Reports: key_column, unique_old, unique_new, coverage

Row Granularity

Reports row counts and key overlap. Does not gate — agents and policies interpret the counts.

  • Always passes — informational only
  • Reports: rows_old, rows_new, key_overlap, keys_old_only, keys_new_only

Type Consistency

Checks whether any common columns changed type between files.

  • Pass condition: no columns changed from numeric to non-numeric or vice versa
  • Only checked on columns common to both files
  • Reports: numeric_columns, type_shifts

How shape Compares

Capability shape Manual inspection csvkit pandas profiling
Schema overlap check ✅ Automated ❌ Eyeball headers ⚠️ csvstat per-file ⚠️ You write it
Key uniqueness validation ✅ Both files ❌ Manual ⚠️ Separate step ⚠️ You write it
Type shift detection ✅ Cross-file ⚠️ Per-file only
Single deterministic verdict
Machine-readable output --json ⚠️ Text
Audit trail (witness ledger) ✅ Built-in
Setup time brew install N/A ⚠️ pip install ⚠️ pip install + script

When to use shape:

  • Before running rvl — validate structure first, then explain numeric changes
  • Monthly reconciliation pipelines — catch schema drift before it corrupts results
  • CI gate — fail fast if upstream changed the export format

When shape might not be ideal:

  • You need content comparison (use rvl for that)
  • You need data profiling (distributions, outliers) — use pandas or Great Expectations
  • You're comparing non-CSV formats

Installation

Homebrew (Recommended)

brew install cmdrvl/tap/shape

Shell Script

curl -fsSL https://raw.githubusercontent.com/cmdrvl/shape/main/scripts/install.sh | bash

Windows (PowerShell)

Set-ExecutionPolicy -ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://raw.githubusercontent.com/cmdrvl/shape/main/scripts/install.ps1'))

From Source

cargo build --release
./target/release/shape --help

Prebuilt binaries are available for x86_64 and ARM64 on Linux, macOS, and Windows (x86_64). Each release includes SHA256 checksums, cosign signatures, and an SBOM.


CLI Reference

shape <old.csv> <new.csv> [OPTIONS]

Flags

Flag Type Default Description
--key <column> string (none) Key column to check for alignment viability (uniqueness, coverage).
--delimiter <delim> string (auto-detect) Force CSV delimiter for both files. See Delimiter.
--json flag false Emit a single JSON object on stdout instead of human-readable output.
--no-witness flag false Suppress ambient witness ledger recording for this compare run.
--capsule-dir <path> path (none) Write deterministic repro capsule artifacts (manifest.json, copied inputs, rendered output) to this directory.
--describe flag false Print the compiled-in operator.json to stdout and exit 0 without positional args.
Reserved v0 flags (parsed for schema stability, not yet enforced at runtime)
Flag Type Default Description
--profile <path> path (none) Profile for check scoping.
--profile-id <id> string (none) Echoed as profile_id in JSON output.
--lock <lockfile> path (none) Lock verification for inputs.
--max-rows <n> integer (unlimited) Row-limit refusal.
--max-bytes <n> integer (unlimited) Byte-limit refusal.

Exit Codes

Code Meaning
0 COMPATIBLE
1 INCOMPATIBLE
2 REFUSAL or CLI error

Output Routing

Mode COMPATIBLE INCOMPATIBLE REFUSAL
Human (default) stdout stdout stderr
--json stdout stdout stdout

In --json mode, stderr is reserved for process-level failures only (CLI parse errors, panics).


Repro Capsules

Use --capsule-dir to emit deterministic replay artifacts for a run without changing standard output behavior.

shape old.csv new.csv --key loan_id --json --no-witness --capsule-dir capsules/run-001

Generated layout:

capsules/run-001/
  manifest.json
  inputs/old.csv
  inputs/new.csv
  outputs/report.txt

Replay from the capsule directory:

cd capsules/run-001
shape inputs/old.csv inputs/new.csv --key loan_id --json --no-witness

manifest.json also stores replay args and a shell command under replay.argv and replay.shell.


Delimiter

Auto-Detection (default)

Each file's delimiter is detected independently. Candidate delimiters are evaluated in order: ,, \t, ;, |, ^.

If detection is ambiguous (or the winner yields a single-column parse), shape refuses with E_DIALECT and provides an actionable next_command.

sep= Directive

If the first line is exactly sep=<char>, that delimiter is used for that file and the sep= line is consumed (not treated as header data).

--delimiter still overrides sep= when both are present.

--delimiter (forced)

Accepted values:

Format Examples
Named comma, tab, semicolon, pipe, caret (case-insensitive)
Hex 0x2c (comma), 0x09 (tab)
Single ASCII char ,, ;, `

Rules:

  • Hex form must be exactly two digits after 0x.
  • Allowed bytes are ASCII, excluding " (0x22), \r, \n, NUL (0x00), and DEL (0x7f).
  • Invalid values fail as CLI argument errors (exit 2).

Agent / CI Integration

Both shape and rvl are designed to be consumed by agents and pipelines, not just humans.

Self-describing contract

An agent can learn how to invoke shape without reading docs:

$ shape --describe | jq '.exit_codes'
{
  "0": { "meaning": "COMPATIBLE", "domain": "positive" },
  "1": { "meaning": "INCOMPATIBLE", "domain": "negative" },
  "2": { "meaning": "REFUSAL / CLI error", "domain": "error" }
}

$ shape --describe | jq '.pipeline'
{
  "upstream": [],
  "downstream": ["rvl", "compare", "verify", "assess"]
}

Agent workflow: shape → rvl

# 1. Structural gate
shape old.csv new.csv --key id --json > shape.json
if [ $? -ne 0 ]; then
  # INCOMPATIBLE or REFUSAL — read .reasons or .refusal for why
  cat shape.json | jq '.reasons // .refusal'
  exit 1
fi

# 2. Numeric explanation (only if structurally compatible)
rvl old.csv new.csv --key id --json > rvl.json

# 3. Agent extracts the verdict
outcome=$(jq -r '.outcome' rvl.json)
if [ "$outcome" = "REAL_CHANGE" ]; then
  jq '.contributors[] | "\(.row_id).\(.column): \(.delta)"' rvl.json
fi

Everything an agent needs is in --json output: structured verdicts, exit codes for branching, and --describe for tool discovery.


Scripting Examples

Check if files are compatible (exit code only):

shape old.csv new.csv > /dev/null 2>&1
echo $?  # 0 = compatible, 1 = incompatible, 2 = refused

Extract schema overlap from JSON:

shape old.csv new.csv --json | jq '.checks.schema_overlap'

Get incompatibility reasons:

shape old.csv new.csv --json | jq '.reasons'

Gate a pipeline (shape before rvl):

shape nov.csv dec.csv --key loan_id --json > shape.json \
  && rvl nov.csv dec.csv --key loan_id --json > rvl.json

Refusal Codes

Every refusal includes the error code and a concrete next step.

Code Meaning Next Step
E_IO File read error Check file path and permissions
E_ENCODING Unsupported encoding (UTF-16/32 BOM or NUL bytes) Convert/re-export as UTF-8
E_CSV_PARSE CSV parse failure Re-export as standard RFC4180 CSV
E_EMPTY One or both files empty Provide non-empty datasets
E_HEADERS Missing header or duplicate headers Fix headers or re-export
E_DIALECT Delimiter ambiguous or undetectable Use --delimiter <delim>
Reserved refusal codes (defined for schema stability, not emitted in v0)
Code Meaning Next Step
E_AMBIGUOUS_PROFILE Both --profile and --profile-id provided Provide exactly one profile selector
E_INPUT_NOT_LOCKED Input not in any provided lockfile Re-run with correct --lock or lock inputs first
E_INPUT_DRIFT Input hash doesn't match locked member Use the locked file; regenerate lock if expected
E_TOO_LARGE Input exceeds --max-rows or --max-bytes Increase limit or split input

Troubleshooting

"E_EMPTY" — one or both files empty

Your file has a header row but no data rows. Check that the export actually produced data:

wc -l old.csv new.csv

"E_DIALECT" — delimiter detection failed

Your file uses an uncommon delimiter or has inconsistent field counts. Force the delimiter:

shape old.csv new.csv --delimiter pipe      # for |
shape old.csv new.csv --delimiter 0x09      # for tab
shape old.csv new.csv --delimiter semicolon # for ;

"E_HEADERS" — duplicate column names

Two or more columns share the same header name. Fix at the source, or rename duplicates before running shape.

Key viability fails but the column looks unique

Check for trailing whitespace, invisible characters, or encoding issues in key values. shape trims ASCII whitespace, but non-ASCII whitespace (e.g., NBSP) is preserved.

INCOMPATIBLE due to type shift — but the column looks numeric

A cell in the new file has a value that can't be parsed as a number (e.g., #REF!, a stray string, or locale-specific formatting). The type_shifts field in JSON shows exactly which columns changed.


Limitations

Limitation Detail
Structural only shape checks whether comparison is possible, not what changed. Use rvl for content diffs.
Two files only No multi-file or directory comparison.
In-memory Both files are loaded fully into memory. No streaming mode yet.
No column filtering All common columns are checked. You can't exclude specific columns in v0.
No content sampling shape doesn't look at data distributions or outliers — it checks structure only.
Profile/lock not enforced --profile, --lock, --max-rows, --max-bytes are parsed but have no runtime effect in v0.

FAQ

Why "shape"?

It checks the shape of your data — schema, keys, row counts, types — before you compare content. If the shapes don't match, comparison is meaningless.

How does shape relate to rvl?

shape validates structure. rvl explains numeric changes. Run shape first to confirm the files are comparable, then rvl to see what actually changed. They share delimiter detection and refusal patterns.

What's the witness ledger?

Every shape comparison is appended to a local JSONL file (~/.epistemic/witness.jsonl, or $EPISTEMIC_WITNESS). This gives you an audit trail of every structural check. Suppress with --no-witness.

Can I query past comparisons?

Yes, using witness subcommands. See Witness Subcommands below.

Can I use this in CI/CD?

Yes. Exit codes (0/1/2) and --json output are designed for automation. Gate on exit code, or parse the JSON for richer assertions.

What about non-CSV formats (Parquet, Excel)?

Not supported. Convert to CSV first.


Witness Subcommands

shape records every comparison to an ambient witness ledger. You can query this ledger:

# Query by tool, date range, or outcome
shape witness query --tool shape --since 2026-01-01 --outcome COMPATIBLE --json

# Get the most recent comparison
shape witness last --json

# Count comparisons matching a filter
shape witness count --since 2026-02-01

Subcommand Reference

shape witness query [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
  [--outcome <COMPATIBLE|INCOMPATIBLE|REFUSAL>] [--input-hash <substring>] \
  [--limit <n>] [--json]

shape witness last [--json]

shape witness count [--tool <name>] [--since <iso8601>] [--until <iso8601>] \
  [--outcome <COMPATIBLE|INCOMPATIBLE|REFUSAL>] [--input-hash <substring>] [--json]

Exit Codes (witness subcommands)

Code Meaning
0 One or more matching records returned
1 No matches (or empty ledger for last)
2 CLI parse error or witness internal error

Ledger Location

  • Default: ~/.epistemic/witness.jsonl
  • Override: set EPISTEMIC_WITNESS environment variable
  • Malformed ledger lines are skipped; valid lines continue to be processed.
JSON Output Reference

A single JSON object on stdout. If the process fails before domain evaluation (e.g., invalid CLI args), JSON may not be emitted.

{
  "version": "shape.v0",
  "outcome": "COMPATIBLE",                 // "COMPATIBLE" | "INCOMPATIBLE" | "REFUSAL"
  "profile_id": null,                      // echoes --profile-id when provided
  "profile_sha256": null,                  // reserved in v0 (currently null)
  "input_verification": null,              // reserved in v0 (currently null)
  "files": { "old": "nov.csv", "new": "dec.csv" },
  "checks": {
    "schema_overlap": {
      "status": "pass",                    // "pass" | "fail"
      "columns_common": 15,
      "columns_old_only": ["retired_field"],
      "columns_new_only": ["new_field"],
      "overlap_ratio": 0.88
    },
    "key_viability": {
      "status": "pass",
      "key_column": "u8:loan_id",
      "found_old": true,
      "found_new": true,
      "unique_old": true,
      "unique_new": true,
      "coverage": 1.0
    },
    "row_granularity": {
      "status": "pass",
      "rows_old": 4183,
      "rows_new": 4201,
      "key_overlap": 4150,
      "keys_old_only": 33,
      "keys_new_only": 51
    },
    "type_consistency": {
      "status": "pass",
      "numeric_columns": 12,
      "type_shifts": []
    }
  },
  "reasons": [],                           // non-empty when INCOMPATIBLE
  "refusal": null                          // non-null when REFUSAL
}

Nullable Field Rules

  • checks is null for REFUSAL.
  • reasons is [] for COMPATIBLE, non-empty for INCOMPATIBLE, and null for REFUSAL.
  • refusal is null unless outcome is REFUSAL.
  • profile_id echoes --profile-id when provided, otherwise null.
  • profile_sha256 and input_verification are reserved v0 contract fields and remain null in current runtime behavior.
  • key_viability is null when --key is not provided.
  • key_viability.unique_old / unique_new are null if the key column is missing in that file.
  • key_viability.coverage is null when key overlap is not computable.
  • row_granularity.key_overlap / keys_old_only / keys_new_only are null when key metrics are unavailable.

Identifier Encoding (JSON)

Column names in JSON use unambiguous encoding:

  • u8:<string> — valid UTF-8 with no ASCII control bytes (e.g., u8:loan_id)
  • hex:<hex-bytes> — anything else (e.g., hex:ff00ab)

Same convention as rvl.

NTM Auto-Proceed (for multi-agent sessions)

If you run multi-agent sessions and want periodic proceed nudges:

scripts/ntm_proceed_ctl.sh start --session codex53-high

This feature is off by default. When started with defaults, it:

  • Runs every 10m
  • Sends only during overnight hours (20:00 to 08:00, local time)
  • Sends only if there are open or in-progress beads

Check/stop it:

scripts/ntm_proceed_ctl.sh status
scripts/ntm_proceed_ctl.sh stop

Useful overrides:

# Enable during daytime too
scripts/ntm_proceed_ctl.sh start --session codex53-high --mode always

# Custom overnight window and interval
scripts/ntm_proceed_ctl.sh start --session codex53-high --overnight-start 21 --overnight-end 7 --interval 15m

Spec

The full specification is docs/PLAN.md. This README covers everything needed to use the tool; the spec adds implementation details, edge-case definitions, and testing requirements.

For canonical release/signoff docs, start at docs/README.md.

Development

cargo fmt --check
cargo clippy --all-targets -- -D warnings
cargo test

About

shape is a structural comparability gate — deterministically answering whether two datasets can be compared at all by checking schema overlap, key uniqueness, granularity, and type shifts.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •