Skip to content

API, Core: Add deleteFile to RowDelta API#12861

Merged
RussellSpitzer merged 5 commits into
apache:mainfrom
RussellSpitzer:ExtendRowDelta
May 6, 2025
Merged

API, Core: Add deleteFile to RowDelta API#12861
RussellSpitzer merged 5 commits into
apache:mainfrom
RussellSpitzer:ExtendRowDelta

Conversation

@RussellSpitzer

@RussellSpitzer RussellSpitzer commented Apr 21, 2025

Copy link
Copy Markdown
Member

Adds a "deleteFile" api to RowDelta Operation. I believe this was originally not included because the target engine (Spark) was unable to determine if an entire file was being removed during MOR.

Fixes #12717

@sn2479 sn2479 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Thanks @RussellSpitzer for making this improvement!

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this, Thanks @RussellSpitzer !

Comment on lines +152 to +155
if (!deletedDataFiles.isEmpty()) {
validateNoNewDeletesForDataFiles(
base, startingSnapshotId, conflictDetectionFilter, deletedDataFiles, parent);
}

@singhpk234 singhpk234 Apr 22, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] Should we add the deletedDataFiles for existence checks implicitly ? as if these files which are deleted need to be in table or are we expecting the caller of this API to re-use validateDataFilesExist to catch this case ?

  @TestTemplate
  public void testValidateFileDeleteAndRowDeleteWithNonExistentFileRemoved() {
    commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch);
    long initialCommit = latestSnapshot(table, branch).snapshotId();
    // compaction re-wrote this file.
    commit(table, table.newRewrite().addFile(FILE_C).deleteFile(FILE_B), branch);

    commit(
            table,
            table
                    .newRowDelta()
                    .addDeletes(fileADeletes())
                    .deleteFile(FILE_B)
                    .validateFromSnapshot(initialCommit)
                    // Note: we are not explicitly checking the FILE_B existence 
                    // so this pass, so we should do implicit existence check then ? 
                    .validateDataFilesExist(ImmutableList.of(FILE_A.location())),
            branch);
  }

OR

 @TestTemplate
public void testValidateFileDeleteAndRowDeleteWithNonExistentFileRemoved() {
  commit(table, table.newAppend().appendFile(FILE_A).appendFile(FILE_B), branch);
  long initialCommit = latestSnapshot(table, branch).snapshotId();
  // compaction re-wrote this file.
  commit(table, table.newRewrite().addFile(FILE_C).deleteFile(FILE_B), branch);

  assertThatThrownBy(
          () -> {
            commit(
                    table,
                    table
                            .newRowDelta()
                            .addDeletes(fileADeletes())
                            .deleteFile(FILE_B)
                            .validateFromSnapshot(initialCommit)
                            // Note: we are explicitly checking the FILE_B existence
                            // so this pass as it expected to throw.
                            .validateDataFilesExist(ImmutableList.of(FILE_A.location(), FILE_B.location())),
                    branch);
          })
          .isInstanceOf(ValidationException.class)
          .hasMessageContaining("Cannot commit, missing data files: [" + FILE_B.location() + "]");
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think it should be implicit here because the removal of files should be according to the conflict detection filter. I think it should look just like OverwriteFIles

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this so now you can either use
.validateDeletedFiles() // Checks that paths exist for all deleted files
or
.validateDataFilesExist() // Explictly checking for

I also added a validation that you are not simultaneously deleting a file while also adding a DeleteFile for that file

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @RussellSpitzer !

@RussellSpitzer

Copy link
Copy Markdown
Member Author

@aokolnychyi Please take a look

parent);
}

if (validateDeletes) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, validateDeletes controls whether we check concurrent delete snapshots to see if any of them removed data files that we include in referencedDataFiles. Then failMissingDeletePaths must ensure that whatever we are trying to remove is actually still in the table. In other words, this blocks means if we do UPDATE or MERGE, make sure that data files that we are trying to remove entirely are still there.

Correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the doc is in RowDelta (although under validateDeletedFiles)

  /**
   * Enable validation that referenced data files passed to {@link
   * #validateDataFilesExist(Iterable)} have not been removed by a delete operation.
   *
   * <p>If a data file has a row deleted using a position delete file, rewriting or overwriting the
   * data file concurrently would un-delete the row. Deleting the data file is normally allowed, but
   * a delete may be part of a transaction that reads and re-appends a row. This method is used to
   * validate deletes for the transaction case.
   *
   * @return this for method chaining
   */

This usage is similar to adding a position delete for a data file but instead of removing a singular row we are attempting to remove every row in the file. The way I have been thinking of this is point deletes and file deletes. Everything we do for the point deletes we must also do for the file deletes.


if (validateNewDeleteFiles) {
// validate that explicitly deleted files have not had added deletes
if (!deletedDataFiles.isEmpty()) {

@aokolnychyi aokolnychyi May 5, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block simply ensures no concurrent transactions added new deletes for files we are trying to remove?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our file delete is essentially again being treated as a "position delete" that effects the entire file

}
}

/** Validates that the data files removed in this commit do not overlap with delete files added */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand here. What's the use case?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was really just a sanity check. Forbids a commit where you attempt to add a position delete/equality delete for a file which you are also completely removing. I can remove this but I am assuming that an implementation is doing something wrong if it attempts to delete FILEA and also add a DV for FILEA at the same time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK keeping it. Maybe rename the method? I think it only covers concurrent position deletes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's true, validating for equality deletes would obviously be much more of a hassle.

@RussellSpitzer RussellSpitzer merged commit 1e94465 into apache:main May 6, 2025
@RussellSpitzer RussellSpitzer deleted the ExtendRowDelta branch May 6, 2025 16:09
@RussellSpitzer

Copy link
Copy Markdown
Member Author

Merging, Thanks for the review @sn2479 , @singhpk234 and @aokolnychyi !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding deleteFile method to existing RowDelta operation

4 participants