Skip to content

Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath#13459

Merged
szehon-ho merged 6 commits into
apache:mainfrom
NikitaMatskevich:nmh/use-resolving-file-io-in-rewrite-table-path
Jul 9, 2025
Merged

Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath#13459
szehon-ho merged 6 commits into
apache:mainfrom
NikitaMatskevich:nmh/use-resolving-file-io-in-rewrite-table-path

Conversation

@NikitaMatskevich

@NikitaMatskevich NikitaMatskevich commented Jul 3, 2025

Copy link
Copy Markdown
Contributor

RewriteTablePath leverages iceberg native table IO for everything except for writing a "file list" file. For this task it currently leverages a Spark writer, which in turn uses Hadoop. This might become a limitation in some cloud environments: for instance, Hadoop does not support modern auth strategies of Azure, and thus enforces unwanted auth paths where it is not required by the cloud provider itself. Also it requires the user to manage 2 configurations for table IO and for spark IO. We suggest to use native table io instead.

See corresponding issue here:
#13458

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea sounds fine, had a question


private FileIO fetchResolvingFileIO() {
FileIO tableIO = table.io();
ResolvingFileIO resolvingIO = (ResolvingFileIO) CatalogUtil.loadFileIO(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? The rest of the code just gets io from table, and we can configure all that from the catalog/table?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for (rare) edge case when stagingDir location argument is specified and it points to another filesystem. For example, the table might be getting migrated from Azure to AWS, and stagingDir might be s3://... I know it is not the most straightforward usecase for this action, but the old hadoop code allowed it, so using table.io directly would have been a potential breaking change

@NikitaMatskevich NikitaMatskevich Jul 3, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code again I realized that stagingDir couldn't have been used with another filesystem. I will simplify this part as you said. Thank you for the review!

@NikitaMatskevich NikitaMatskevich changed the title Spark: Use ResolvingFileIO to save file list in RewriteTablePath Spark: Use native table FileIO instead of Hadoop to save file list in RewriteTablePath Jul 3, 2025
.mode(SaveMode.Overwrite)
.format("csv")
.save(fileListPath);
OutputFile fileList = table.io().newOutputFile(fileListPath);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[doubt] staging location default value is table metadata path, but can be set to anything ?

if thats the case :

  1. what if table's fileIO didn't had the credentials to write to staging directory but spark did, would this cause failures ?
  2. when we are using the local disk to write this, but tables file IO was pointing to object store ? would now that workloads fail ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the review!

  1. As I checked after the review, this case is impossible: in this case the action would fail earlier, because it uses table IO to create a copy of the metadata already, and the copies of the metadata files should reside in a staging dir as well
  2. Impossible for the same reason.

You can find examples in th method rewriteVersionFile or similar

@singhpk234 singhpk234 Jul 3, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking, it seems like then we might need to update the docs https://iceberg.apache.org/docs/nightly/spark-procedures/#rewrite_table_path on the restriction of the staging location, not part of this effort though !

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Generally LGTM

I added a few comments about testing and making the same changes to other version of spark

@@ -312,22 +313,25 @@ private String rebuildMetadata() {
}

private String saveFileList(Set<Pair<String, String>> filesToMove) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exists in 3.4 and 4.0 as well, should we also make changes to those files?
https://github.com/search?q=repo%3Aapache%2Ficeberg%20private%20String%20saveFileList&type=code

@szehon-ho szehon-ho Jul 7, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, yea we typically do the latest version first (spark 4.0). then its optionally whether we want to do all the sparks in one pr or separate.

@@ -312,22 +313,25 @@ private String rebuildMetadata() {
}

private String saveFileList(Set<Pair<String, String>> filesToMove) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NikitaMatskevich can you take a look if its possible

@NikitaMatskevich NikitaMatskevich Jul 8, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share your thoughts on the nature of the test that you would like to see for this change @kevinjqliu @szehon-ho ?

If you want me to reproduce the real-life issue that we've had, then I will need to implement the setup in the iceberg-spark test module where native iceberg file io has enough permissions to the filesystem and hadoop doesn't. It should be something very close to the azure setup with managed identities that we've had, or maybe there are other complex edge-cases that I am not aware of. Anyway, do you think that such effort would be justified for this tiny PR ? It does not introduce any important changes. Usage of native table IO in this context cannot really be bug-prone by itself, because the modified class already uses exactly the same file io instance to run other steps of the action.

If you are concerned with the test coverage of this method, in TestRewriteTablePathsAction.java there are many good tests, and every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards. I can see that the file list is created allright on my machine and if I compare it to the spark-generated one, it has exactly the same structure and file contents.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every one of them passes through the step of saving a file list somewhere and reading the files from this file list to copy them afterwards

great! that was my original motivation for the test, to make sure that the file list is properly materialized to storage. 👍

@kevinjqliu

Copy link
Copy Markdown
Contributor

The following files had format violations:

CI failed on formatting

@NikitaMatskevich NikitaMatskevich force-pushed the nmh/use-resolving-file-io-in-rewrite-table-path branch from 3b6d49a to 457c9be Compare July 8, 2025 15:22

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevinjqliu

Copy link
Copy Markdown
Contributor

@szehon-ho since we made changes to spark 3.4, 3.5, and 4.0, are there any special deployment steps we need to go through?

@szehon-ho

Copy link
Copy Markdown
Member

@kevinjqliu not as far as I know, i guess it will get bundled correctly in the appropriate spark-iceberg jar.

@NikitaMatskevich thanks it looks good to me

@szehon-ho szehon-ho merged commit 3f22677 into apache:main Jul 9, 2025
27 checks passed
@szehon-ho

Copy link
Copy Markdown
Member

thanks @NikitaMatskevich , also @kevinjqliu , @singhpk234 for thoughtful reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants