Skip to content

Spec: Clarify writer requirements to prevent orphan DVs#13042

Merged
aokolnychyi merged 3 commits into
apache:mainfrom
aokolnychyi:dv-clarification
May 19, 2025
Merged

Spec: Clarify writer requirements to prevent orphan DVs#13042
aokolnychyi merged 3 commits into
apache:mainfrom
aokolnychyi:dv-clarification

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

This PR clarifies writer requirements to prevent orphan DVs as discussed on the dev list.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label May 13, 2025
@aokolnychyi

Copy link
Copy Markdown
Contributor Author

cc @stevenzwu

Comment thread format/spec.md Outdated
* Equality delete files identify deleted rows by the value of one or more columns

Deletion vectors are a binary representation of deletes for a single data file that is more efficient at execution time than position delete files. Unlike equality or position delete files, there can be at most one deletion vector for a given data file in a snapshot. Writers must ensure that there is at most one deletion vector per data file and must merge new deletes with existing vectors or position delete files.
Writers must also remove no longer valid deletion vectors from the metadata whenever adding new deletes or removing entire data files to maintain accurate statistics and prevent orphan deletion vectors. For instance, a compaction job that rewrites a set of data file must also remove all deletion vectors applicable to the original data files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to state the requirement more simply. When removing a data file, writers must also remove any DV that applies to that data file from delete manifests. I think that is clear and avoids sending like cleanup needs to happen on all writes.

And also clarify that the DV is puffin can be left?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. DV in Puffin can be left.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with it, although I don't think we need to specify when the DVs themselves are garbage collected

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chiming in — I also think it makes sense not to require specifying exactly when deletion vectors are garbage collected, since that may also be handled separately by something like compaction/maintenance operation for other reasons. Also, I’m not entirely sure about this (I tried looking into it), but could there be cases where a Puffin file contains other metadata, like an NDV sketch, that shouldn’t be collected even if the deletion vectors have been removed? That might be another reason to avoid being too specific in the spec about when the puffin files are cleaned up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated.

@rdblue

rdblue commented May 14, 2025

Copy link
Copy Markdown
Contributor

Thanks! Looks ready for a vote to me.

@RussellSpitzer

Copy link
Copy Markdown
Member

I think the verb "remove" here is a little confusing but sounds good to me.

@aokolnychyi aokolnychyi merged commit a44ec8d into apache:main May 19, 2025
2 checks passed
devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.