Skip to content

Spike: evaluate streaming approaches for large CSV datasets #90

@vmvarela

Description

@vmvarela

Part of #69

Description

Investigate two approaches for handling large CSV files that don't fit in memory:

  1. SQLite virtual table () — streaming CSV input via a virtual table interface
  2. Disk-backed temp storage (PRAGMA temp_store = FILE) — configure SQLite to spill to disk automatically

Produce a written recommendation (implementation notes, trade-offs, estimated effort) to guide sub-issues 2–5.

Acceptance Criteria

  • Both approaches prototyped or evaluated with a 1GB+ test file
  • Recommendation written as a comment on this issue: which approach to implement first and why
  • Memory usage measured for each approach
  • Known limitations documented (e.g. which SQL operations won't work)

Notes

  • The disk-backed approach may deliver 80% of the value with 20% of the complexity
  • Start with PRAGMA temp_store = FILE since it requires no query-semantic changes
  • Timebox to 4 hours max

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority:mediumShould be done soonsize:sSmall — 1 to 4 hoursstatus:readyRefined and ready for sprint selectiontype:spikeResearch or investigation (timeboxed)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions