Core: Keep track of data files to be removed for orphaned DV detection#13222
Conversation
a41fe6e to
13b0f91
Compare
5dbb131 to
f82cbdc
Compare
f82cbdc to
4dd4a3d
Compare
4dd4a3d to
2d8ea9a
Compare
2d8ea9a to
0338bfe
Compare
0338bfe to
4d8fd4c
Compare
| private boolean canTrustManifestReferences(List<ManifestFile> manifests) { | ||
| Set<String> manifestLocations = | ||
| manifests.stream().map(ManifestFile::path).collect(Collectors.toSet()); | ||
| return allDeletesReferenceManifests && manifestLocations.containsAll(manifestsWithDeletes); |
There was a problem hiding this comment.
@amogh-jahagirdar manifestLocations.containsAll(manifestsWithDeletes) was previously always true even though manifestsWithDeletes was empty, hence I've added a non-empty check here
There was a problem hiding this comment.
@nastra I think the additional check is fine but is there a situation where allDeletesReferenceManifests is true and the manifestsWithDeletes is empty? Every delete operation would either invalidate allDeletesReferenceManifests or if it's a delte(File f) where the file has defined a manifest location we'd add to the manifestLocations set.
There was a problem hiding this comment.
while debugging things locally I've actually ran into such a situation. I actually forgot which test it was exactly but that was one of the reason why I added this additional non-empty check
There was a problem hiding this comment.
manifestLocations.containsAll(manifestsWithDeletes) was previously always true even though manifestsWithDeletes was empty
Why is this a problem?
is there a situation where allDeletesReferenceManifests is true and the manifestsWithDeletes is empty?
To Amogh's question, this would be the case when there is no delete operation. In this case, should canTrustManifestReferences return true or false?
There was a problem hiding this comment.
The boolean trustManifestReferences is used by canContainDeletedFiles method.
When trustManifestReferences is true (old behavior), this code would return false since manifestsWithDeletes is empty.
if (trustManifestReferences) {
return manifestsWithDeletes.contains(manifest.path());
}
When trustManifestReferences is false (new behavior), the above return will be skipped. So it went on to execute this block.
return canContainDroppedFiles(manifest)
|| canContainExpressionDeletes(manifest)
|| canContainDroppedPartitions(manifest);
The above block would always be false when allDeletesReferenceManifests is true, as any of these conditions would render allDeletesReferenceManifests false.
It seems that both old and new behaviors are essentially the same when allDeletesReferenceManifests is true and the manifestsWithDeletes is empty.
There was a problem hiding this comment.
A good example that makes it clearer is when running TestDeleteFiles.removingDataFileByPathAlsoRemovesDV (or any other new tests that I added). When we hit canTrustManifestReferences() the allDeletesReferenceManifests flag is set to true, the manifestLocations contain one entry but manifestsWithDeletes is actually empty.
This results in canTrustManifestReferences being evaluated to true, which will then exit early in
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 357 to 360 in 3e6decc
which is because it also exited early in
iceberg/core/src/main/java/org/apache/iceberg/ManifestFilterManager.java
Lines 385 to 387 in 3e6decc
The issue there is that
manifestsWithDeletes in that line is still empty and so this condition there can never be true. IMO trustManifestReferences should be false in that case because manifestsWithDeletes was empty.
The above block would always be false when allDeletesReferenceManifests is true, as any of these conditions would render allDeletesReferenceManifests false.
Not necessarily. While this might be true for the DataFileFilterManager it's slightly different for the DeleteFileFilterManager. Just assume you're deleting a DataFile, which renders allDeletesReferenceManifests to false in DataFileFilterManager. In the DeleteFileFilterManager allDeletesReferenceManifests is true by default.
I haven't explored this, but maybe we would need to set allDeletesReferenceManifests = false as well when removeDanglingDeletesFor() is being called but I think adding this additional non-empty check here still makes sense to me.
I'll go ahead and merge the PR to unblock the release. If necessary we can follow up here afterwards.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
I may be missing something but I think we should double check path based/partition drops/removals by filters, I'm not sure we're capturing those data file entries for passing to the logic to remove orphan DVs for.
a07ae28 to
1233d04
Compare
There was a problem hiding this comment.
Overall looks good, just some nits, thank @nastra !
| F copyWithoutStats = file.copyWithoutStats(); | ||
| // add the file here in case it was deleted using an expression. The | ||
| // DeleteManifestFilterManager will then remove its matching DV | ||
| deleteFiles.add(copyWithoutStats); |
There was a problem hiding this comment.
This makes sense to me. If an entry is marked for delete or passes strict evaluation, then we're marking it for delete and we're adding to the tracking set of files for which we have to hunt down and cleanup orphans for.
1233d04 to
3e6decc
Compare
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM.
Just have a question on the canTrustManifestReferences condition change. please check if my understanding is correct.
|
thanks for the reviews @amogh-jahagirdar and @stevenzwu. I'll go ahead and merge this to unblock the release and we can follow up on #13222 (comment) if necessary |
This keeps track of all data files to be removed/rewritten in MergingSnapshotProducer and passes those to the ManifestFilterManager for deletes. Once ManifestFilterManager goes through delete manifests, it also checks whether a DV references any of the data files to be removed.
This is needed in addition to #13245 so that we can properly remove orphaned DVs when e.g. a metadata-only delete is performed.
The below benchmark shows that tracking data files to be removed and then detecting orphaned DVs when delete manifests are looked at is only adding a small fraction to the throughput.
without tracking data files to be removed
with tracking data files to be removed