API, Core: Add deleteFile to RowDelta API#12861
Conversation
sn2479
left a comment
There was a problem hiding this comment.
+1. Thanks @RussellSpitzer for making this improvement!
singhpk234
left a comment
There was a problem hiding this comment.
+1 for this, Thanks @RussellSpitzer !
| if (!deletedDataFiles.isEmpty()) { | ||
| validateNoNewDeletesForDataFiles( | ||
| base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, parent); | ||
| } |
There was a problem hiding this comment.
[doubt] Should we add the deletedDataFiles for existence checks implicitly ? as if these files which are deleted need to be in table or are we expecting the caller of this API to re-use validateDataFilesExist to catch this case ?
@TestTemplate
public void testValidateFileDeleteAndRowDeleteWithNonExistentFileRemoved() {
commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch);
long initialCommit = latestSnapshot(table, branch).snapshotId();
// compaction re-wrote this file.
commit(table, table.newRewrite().addFile(FILE_C).deleteFile(FILE_B), branch);
commit(
table,
table
.newRowDelta()
.addDeletes(fileADeletes())
.deleteFile(FILE_B)
.validateFromSnapshot(initialCommit)
// Note: we are not explicitly checking the FILE_B existence
// so this pass, so we should do implicit existence check then ?
.validateDataFilesExist(ImmutableList.of(FILE_A.location())),
branch);
}OR
@TestTemplate
public void testValidateFileDeleteAndRowDeleteWithNonExistentFileRemoved() {
commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch);
long initialCommit = latestSnapshot(table, branch).snapshotId();
// compaction re-wrote this file.
commit(table, table.newRewrite().addFile(FILE_C).deleteFile(FILE_B), branch);
assertThatThrownBy(
() -> {
commit(
table,
table
.newRowDelta()
.addDeletes(fileADeletes())
.deleteFile(FILE_B)
.validateFromSnapshot(initialCommit)
// Note: we are explicitly checking the FILE_B existence
// so this pass as it expected to throw.
.validateDataFilesExist(ImmutableList.of(FILE_A.location(), FILE_B.location())),
branch);
})
.isInstanceOf(ValidationException.class)
.hasMessageContaining("Cannot commit, missing data files: [" + FILE_B.location() + "]");
}There was a problem hiding this comment.
Hmmm I think it should be implicit here because the removal of files should be according to the conflict detection filter. I think it should look just like OverwriteFIles
There was a problem hiding this comment.
I changed this so now you can either use
.validateDeletedFiles() // Checks that paths exist for all deleted files
or
.validateDataFilesExist() // Explictly checking for
I also added a validation that you are not simultaneously deleting a file while also adding a DeleteFile for that file
singhpk234
left a comment
There was a problem hiding this comment.
LGTM, thanks @RussellSpitzer !
|
@aokolnychyi Please take a look |
| parent); | ||
| } | ||
|
|
||
| if (validateDeletes) { |
There was a problem hiding this comment.
If I remember correctly, validateDeletes controls whether we check concurrent delete snapshots to see if any of them removed data files that we include in referencedDataFiles. Then failMissingDeletePaths must ensure that whatever we are trying to remove is actually still in the table. In other words, this blocks means if we do UPDATE or MERGE, make sure that data files that we are trying to remove entirely are still there.
Correct?
There was a problem hiding this comment.
Yes, the doc is in RowDelta (although under validateDeletedFiles)
/**
* Enable validation that referenced data files passed to {@link
* #validateDataFilesExist(Iterable)} have not been removed by a delete operation.
*
* <p>If a data file has a row deleted using a position delete file, rewriting or overwriting the
* data file concurrently would un-delete the row. Deleting the data file is normally allowed, but
* a delete may be part of a transaction that reads and re-appends a row. This method is used to
* validate deletes for the transaction case.
*
* @return this for method chaining
*/
This usage is similar to adding a position delete for a data file but instead of removing a singular row we are attempting to remove every row in the file. The way I have been thinking of this is point deletes and file deletes. Everything we do for the point deletes we must also do for the file deletes.
|
|
||
| if (validateNewDeleteFiles) { | ||
| // validate that explicitly deleted files have not had added deletes | ||
| if (!deletedDataFiles.isEmpty()) { |
There was a problem hiding this comment.
This block simply ensures no concurrent transactions added new deletes for files we are trying to remove?
There was a problem hiding this comment.
Yes, our file delete is essentially again being treated as a "position delete" that effects the entire file
| } | ||
| } | ||
|
|
||
| /** Validates that the data files removed in this commit do not overlap with delete files added */ |
There was a problem hiding this comment.
I am not sure I understand here. What's the use case?
There was a problem hiding this comment.
This was really just a sanity check. Forbids a commit where you attempt to add a position delete/equality delete for a file which you are also completely removing. I can remove this but I am assuming that an implementation is doing something wrong if it attempts to delete FILEA and also add a DV for FILEA at the same time.
There was a problem hiding this comment.
I am OK keeping it. Maybe rename the method? I think it only covers concurrent position deletes.
There was a problem hiding this comment.
ah that's true, validating for equality deletes would obviously be much more of a hassle.
|
Merging, Thanks for the review @sn2479 , @singhpk234 and @aokolnychyi ! |
Adds a "deleteFile" api to RowDelta Operation. I believe this was originally not included because the target engine (Spark) was unable to determine if an entire file was being removed during MOR.
Fixes #12717