Skip to content

Parquet input and output format support #98

@vmvarela

Description

@vmvarela

Part of #68
Depends on #95 (format plugin architecture)

Description

Add Apache Parquet as an input and output format. This requires finding a Zig-compatible Parquet library or C bindings (e.g., Apache Arrow C Data Interface or nanoarrow).

This is a size:l issue due to the library integration complexity. A spike may be needed first to confirm feasibility.

Acceptance Criteria

  • --input-format parquet reads a Parquet file from stdin or --input flag
  • --output-format parquet writes results as Parquet to stdout or --output flag
  • Column types are preserved (integers, floats, strings, timestamps)
  • Parquet schema is inferred from query result column types
  • Error message if Parquet library is not available at build time (optional feature flag)
  • Tested with files generated by pandas, DuckDB, and Apache Spark

Notes

  • Investigate: nanoarrow (C library, small, permissive license), parquet-go (not relevant), or building from scratch
  • Parquet is columnar — reading row-by-row may be inefficient; batch reads preferred
  • May want to gate this behind a compile-time feature flag (-Dparquet=true) to avoid mandatory C dependency
  • Consider de-scoping to just Parquet input first (output is harder)

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:lowNice to have, do when possiblesize:lLarge — 1 to 2 daysstatus:readyRefined and ready for sprint selectiontype:featureNew functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions