feat(transaction): add OverwriteAction with CoW delete support#2185
feat(transaction): add OverwriteAction with CoW delete support#2185glitchy wants to merge 1 commit intoapache:mainfrom
Conversation
e8860f6 to
0b3158b
Compare
xanderbailey
left a comment
There was a problem hiding this comment.
Hey! This is exactly the PR I was about to implement but I was going to add the wiring for DataFusion also so you can do INSERT OVERWRITE I think it's not a lot of extra code to modify IcebergCommitExec to take InsertOp. I'm happy to add this is a follow-up PR if you were wanting to focus on the core iceberg part.
|
Gave it a go here glitchy#1 |
Thanks @xanderbailey. Yep I wanted to keep this PR focused on the core Iceberg transaction layer. A follow-up PR wiring it into DataFusion with |
Sounds good to me! |
timsaucer
left a comment
There was a problem hiding this comment.
This is a nice PR!
I did run into a problem when I tried using it with my code. Specifically I was getting a panic when I would delete more files than I added during an overwrite. I wrote up a PR targeting your branch with a proposed solution: glitchy#2
Which issue does this PR close?
Part of #2186
This is the first in a series of PRs working toward full Copy-on-Write (CoW) and Merge-on-Read (MoR) support. CoW comes first because it provides the foundation that MoR eventually depends on. This PR delivers complete CoW overwrite semantics. Subsequent PRs will add
RowDeltaActionfor writing position/equality delete files (MoR write path), scan-side delete file reconciliation (MoR read path), and compaction.Adds
OverwriteAction, a newTransactionActionthat produces snapshots withOperation::Overwritesemantics. It adds new data files and optionally removes existing data files by rewriting affected manifests with entries marked asManifestStatus::Deleted.Supporting changes:
ManifestWriter::add_deleted_entry()--the existingadd_entry()unconditionally sets status toAdded; there was no way to write deleted entriesSnapshotProducer::snapshot_id()getter--needed to stamp the snapshot ID on deleted manifest entriesFastAppendAction::existing_manifest()now preserves delete-only manifests so that deleted entries produced byOverwriteActionsurvive subsequent appendsAre these changes tested?
5 new unit tests in
transaction::overwrite::tests:test_empty_data_overwrite_action--error on empty file listtest_overwrite_snapshot_properties--custom properties flow to snapshot summarytest_overwrite_incompatible_partition_value--rejects mismatched partition typestest_overwrite_basic--verifies updates, requirements, operation type, manifest structure, sequence numberstest_overwrite_with_deleted_files--end-to-end: append via catalog, overwrite with deletes via catalog, verify original file isDeletedand replacement isAdded