Spec: Clarify writer requirements to prevent orphan DVs#13042
Conversation
|
cc @stevenzwu |
| * Equality delete files identify deleted rows by the value of one or more columns | ||
|
|
||
| Deletion vectors are a binary representation of deletes for a single data file that is more efficient at execution time than position delete files. Unlike equality or position delete files, there can be at most one deletion vector for a given data file in a snapshot. Writers must ensure that there is at most one deletion vector per data file and must merge new deletes with existing vectors or position delete files. | ||
| Writers must also remove no longer valid deletion vectors from the metadata whenever adding new deletes or removing entire data files to maintain accurate statistics and prevent orphan deletion vectors. For instance, a compaction job that rewrites a set of data file must also remove all deletion vectors applicable to the original data files. |
There was a problem hiding this comment.
I think we may want to state the requirement more simply. When removing a data file, writers must also remove any DV that applies to that data file from delete manifests. I think that is clear and avoids sending like cleanup needs to happen on all writes.
And also clarify that the DV is puffin can be left?
There was a problem hiding this comment.
yeah. DV in Puffin can be left.
There was a problem hiding this comment.
What about now?
There was a problem hiding this comment.
I'm good with it, although I don't think we need to specify when the DVs themselves are garbage collected
There was a problem hiding this comment.
Just chiming in — I also think it makes sense not to require specifying exactly when deletion vectors are garbage collected, since that may also be handled separately by something like compaction/maintenance operation for other reasons. Also, I’m not entirely sure about this (I tried looking into it), but could there be cases where a Puffin file contains other metadata, like an NDV sketch, that shouldn’t be collected even if the deletion vectors have been removed? That might be another reason to avoid being too specific in the spec about when the puffin files are cleaned up.
|
Thanks! Looks ready for a vote to me. |
|
I think the verb "remove" here is a little confusing but sounds good to me. |
This PR clarifies writer requirements to prevent orphan DVs as discussed on the dev list.