[Core] Add max files rewrite option for RewriteAction by coderfender · Pull Request #12824 · apache/iceberg

coderfender · 2025-04-17T06:22:47Z

This is a new option when re-writing data files (Spark Actions) to provide user the ability to limit the number of files re-written to potentially reduce file OPS . This option is named as max-files-to-rewrite which takes a positive integer as an input, truncates the file tasks until the value is reached. In case the table has fewer files than the parameter value, all the files are processed for re-write option. A property check to ensure that no value less than 0 has also been put in place to ensure early failure.

Implementation :

plan method in BinPackRewriteFilePlanner has been refactored to truncate the list of file scan tasks (and there by files to be processed)
An atomic integer (to ensure consistency in streams) called fileCountRunner is used to update counter as the StructLikeMap<List<List<FileScanTask>>> plan is processed in parallel
In case the size of entire fileScanTask list in a partition is > maxFilesToRewrite + fileCountRunner, the fileScanTask list is truncated to only add the files until the maxFilesToRewrite value is reached.
selectedFileGroups is leveraged to hold the final file groups.

Testing :

TestBinPackRewriteFilePlanner::testRewriteMaxFilesOption is written to handle upper bound use case where the value max-files-to-rewrite > total number of files in the table
TestBinPackRewriteFilePlanner::testRewriteMaxFilesOptionInequality is written to handle equality use case where the value max-files-to-rewrite > total number of files in the table and the resulting data files after rewrite are less than max-files-to-rewrite
textMaxFilesRewriteToOnlyTruncateNeededPartitions is written to ensure that only needed partitions truncated.
testInvalidMaxFilesRewriteParam is written to ensure that all validations (along with error messages) are working as expected

coderfender · 2025-04-17T14:16:13Z

@manuzhang , resolved conflicts and update the branch and the PR is ready for review

coderfender · 2025-04-17T14:26:42Z

Issue : #12832

yogevyuval · 2025-04-17T16:58:06Z

This is a new option when re-writing data files (Spark Actions) to provide user the ability to limit the number of files re-written to potentially reduce file OPS . This option is named as max-files-to-rewrite which takes a positive integer as an input, truncates the file tasks until the value is reached. In case the table has fewer files than the parameter value, all the files are processed for re-write option. A property check to ensure that no value less than 1 has also been put in place to ensure early failure.

Implementation :

toGroupStream method in RewriteDataFilesSparkAction has been refactored to truncate the list of file scan tasks (and there by files to be processed)

An atomic integer (to ensure consistency in parallel streams) called fileCountRunner is used to update counter as the groupsByPartition is processed in parallel

In case the size of entire fileScanTask list in a partition is > maxFilesToRewrite + fileCountRunner, the fileScanTask list is truncated to only add the files until the maxFilesToRewrite value is reached.

selectedFileGroups is leveraged to hold the final file groups.

Testing :

TestRewriteDataFilesAction::testRewriteMaxFilesOption is written to handle upper bound use case where the value max-files-to-rewrite > total number of files in the table

TestRewriteDataFilesAction::testRewriteMaxFilesOptionEquality is written to handle equality use case where the value max-files-to-rewrite < total number of files in the table and the resulting data files after rewrite are equal to max-files-to-rewrite

Could you elaborate on the use case that this is trying to solve? Is this to limit the resources a single job needed to reduce failures?

sririshindra

If the concern of the PR is to avoid the rewrite data files procedure to take a long time,
The MAX_FILES_TO_REWRITE option doesn't necessarily need to be applied across multiple partitions. Users can enable partial progress and maybe MAX_FILES_TO_REWRITE option be should be applied to per partition rather than the whole table.

Also, maybe the size of the files should be considered instead of the number of files as a criteria for cutoff. That way the criteria would be to compress as many small files to big files as possible without overwhelming the compute.

For instance, let's say the value of MAX_FILES_TO_REWRITE is set to 1000. But all 1000 files are really small say (100 KB), and rewriting all 1000 (as allowed by MAX_FILES_TO_REWRITE) files (say within one partition. assume there are 10000 small files that needs to be compressed overall) selected will still only result in a larger file of size 100MB.

But Iceberg recommends (I think it's the default) the target-file-size to be 512 MB. So, basically we are missing an opportunity to compress an additional 412 MB worth of small files into a larger file.

If I remember correctly, the rewrite_data_files procedure will return the number of data files it has rewritten as part of the output. Maybe we should also return the number of data files that are yet to be rewritten (but not rewritten due to MAX_FILES_TO_REWRITE option) as part of the output as well, so that the user can make an informed decision as to how to plan their next rewrite operation.

sririshindra · 2025-04-17T18:10:56Z

  Stream<RewriteFileGroup> toGroupStream(
      RewriteExecutionContext ctx, Map<StructLike, List<List<FileScanTask>>> groupsByPartition) {
-    return groupsByPartition.entrySet().stream()
+    if (maxFilesToRewrite == null) {


Actually can we instead set this value to LONG.MAX_FILES_TO_REWRITE by default, so that you don't have to have two code blocks to do the same thing. If the user doesn't specify this option, then MAX_FILES_TO_REWRITE is set to the default. So, along as the number of files being processed doesn't't exceed LONG.MAX_FILES_TO_REWRITE your implementation for the case when MAX_FILES_TO_REWRITE is set to some value will suffice.

If you instead decide to make the criteria something like MAX_FILES_SIZE_TO_REWRITE then the same trick works there as well.

Good point ! My motivation here is to keep the option as simple and straightforward as possible. Having an upper bound ( I am assuming you meant setting default value of param MAX_FILES_TO_REWRITE as LONG.MAX_VALUE) which would add a side effect to this functionality by limiting the number of file to 2^^63-1 . However unlikely that is, I dont believe optional parameters should interfere in the default behavior and should be isolated for the sake of consistency and clarity.

coderfender · 2025-04-17T22:42:11Z

@yogevyuval , The goal here is to provide user an option to limit the number of files to be rewritten (either through compaction , data rewrite etc) . In a use case (like mine) where there are 1 billion plus files in a lake house, the user might want to iteratively run compaction to reduce the file count to an acceptable level rather than going all in at the very first time. This option should help improve rewrite spark jobs and the users can tune this param to optimize scale and reliability

coderfender · 2025-04-18T04:36:18Z

@anuragmantri could you take a look whenever you get a chance please?

coderfender · 2025-04-21T14:11:52Z

@manuzhang , Could you please review this since I believe have addressed all the above issues / questions?

anuragmantri · 2025-04-21T16:25:59Z

Also @RussellSpitzer - If you have a minute.

manuzhang

@coderfender You might want to update on the dev list or slack for wider audience.

coderfender · 2025-04-22T03:24:05Z

@manuzhang , Already pinged in the dev channel to get some feedback : https://apache-iceberg.slack.com/archives/C03LG1D563F/p1744871503652659 . Let me bump the message again .
Thank you for reviewing and please let me know if you see any issues with the code which I might have been missing

coderfender · 2025-05-15T02:29:05Z

Rebased the branch with main

coderfender · 2025-05-15T15:57:38Z

@pvary , please take look whenever you get a chance and I would love to make any other changes your recommend
Thank you

pvary

+1 from my side
One small change and mostly a question, maybe a method name change or comment

pvary · 2025-05-16T19:08:21Z

Merged to main.
Thanks for all the work @coderfender on the PR, and @RussellSpitzer for the review!

@coderfender: Could you please create the backport PRs for Spark and Flink?

Thanks, Peter

coderfender · 2025-05-16T19:08:41Z

Thank you @pvary @RussellSpitzer @anuragmantri . I will start working on the documentation changes and raise a PR soon

RussellSpitzer · 2025-05-16T19:11:52Z

Remember to "forward port" too now that we have a 4.0 Module :)

coderfender · 2025-05-16T19:18:43Z

Sure , I will create another PR to support backport / forward port this functionality. @RussellSpitzer , @pvary
Spark - v 3.4 and 3.5 are already done so I will only have to make changes to support v4.0. Re : Flink I will have to make changes to support 1.19 and v2.0 and I am hoping all these changes can go in PR ?

RussellSpitzer · 2025-05-16T19:19:46Z

I have no problem with doing all the other changes at once, I just don't like having them all in the original PR because it's harder to track changes

coderfender · 2025-05-16T19:23:48Z

Sure I will create a new PR just for the porting changes

…#13082) backports #12824

…apache#13082) backports apache#12824

github-actions Bot added API spark labels Apr 17, 2025

coderfender force-pushed the add_option_to_write_max_files_overwrite branch from 6bbeb57 to e8eb111 Compare April 17, 2025 06:28

manuzhang reviewed Apr 17, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated

coderfender marked this pull request as ready for review April 17, 2025 14:13

coderfender requested a review from manuzhang April 17, 2025 14:16

coderfender mentioned this pull request Apr 17, 2025

Provide a new option to limit number of files being processed during Spark based Rewrite #12832

Closed

3 tasks

coderfender force-pushed the add_option_to_write_max_files_overwrite branch from 5abd324 to f53b205 Compare April 17, 2025 15:05

sririshindra reviewed Apr 17, 2025

View reviewed changes

Comment thread spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java Outdated

sririshindra reviewed Apr 17, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated

Comment thread ...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated

sririshindra reviewed Apr 17, 2025

View reviewed changes

manuzhang reviewed Apr 22, 2025

View reviewed changes

nastra requested review from RussellSpitzer and szehon-ho April 22, 2025 08:00

RussellSpitzer reviewed Apr 22, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java Outdated

RussellSpitzer reviewed Apr 22, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated

coderfender requested review from RussellSpitzer and sririshindra April 24, 2025 19:02

coderfender force-pushed the add_option_to_write_max_files_overwrite branch from d3cce44 to 1c6ed65 Compare April 24, 2025 19:19

RussellSpitzer reviewed Apr 24, 2025

View reviewed changes

Comment thread spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java Outdated

RussellSpitzer reviewed Apr 24, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/RewriteDataFilesSparkAction.java Outdated

coderfender requested a review from RussellSpitzer April 26, 2025 17:17

Bhargava Vadlamani added 5 commits May 14, 2025 19:21

address_review_comments_add_tests

fa1920c

address_review_comments_add_tests

9b4d203

address_review_comments_add_tests

e5896cc

address_review_comments_add_tests

3a4b6e1

address_review_comments_add_tests

984c8bc

coderfender force-pushed the add_option_to_write_max_files_overwrite branch from 650d730 to 984c8bc Compare May 15, 2025 02:21

Bhargava Vadlamani added 2 commits May 14, 2025 23:40

address_review_comments_add_tests

bc67c0a

address_review_comments_add_tests

d0f5062

pvary reviewed May 16, 2025

View reviewed changes

Comment thread flink/v1.20/flink/src/main/java/org/apache/iceberg/flink/maintenance/api/RewriteDataFiles.java Outdated

pvary reviewed May 16, 2025

View reviewed changes

Comment thread core/src/test/java/org/apache/iceberg/actions/TestBinPackRewriteFilePlanner.java Outdated

pvary reviewed May 16, 2025

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/actions/BinPackRewriteFilePlanner.java

pvary approved these changes May 16, 2025

View reviewed changes

Bhargava Vadlamani added 2 commits May 16, 2025 02:55

address_review_comments_add_tests

49a2ca5

address_review_comments_add_tests

7f77c24

pvary merged commit c247896 into apache:main May 16, 2025
42 checks passed

coderfender mentioned this pull request May 16, 2025

add_docs_and_backport_max_files_rewrite_option #13082

Merged

pvary pushed a commit that referenced this pull request May 20, 2025

Spark, Flink: Backport add max files rewrite option for RewriteAction (…

09a5317

…#13082) backports #12824

cbb330 mentioned this pull request Sep 26, 2025

Support budgeted file rewrite, ordering by file sequence number linkedin/iceberg#189

Merged

alessandro-nori mentioned this pull request Sep 29, 2025

Docs: Document max-files-to-rewrite in Spark rewrite_data_files #14211

Merged

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Core: Add max files rewrite option for RewriteAction (apache#12824)

e6e83db

devendra-nr pushed a commit to devendra-nr/iceberg that referenced this pull request Dec 8, 2025

Spark, Flink: Backport add max files rewrite option for RewriteAction (…

ef8e80b

…apache#13082) backports apache#12824

alessandro-nori mentioned this pull request Jun 9, 2026

feat(compaction): add MaxFilesToRewrite option apache/iceberg-go#1175

Draft

Uh oh!

Conversation

coderfender commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderfender commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfender commented Apr 17, 2025

Uh oh!

yogevyuval commented Apr 17, 2025

Uh oh!

Uh oh!

sririshindra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sririshindra Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderfender Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderfender commented Apr 17, 2025

Uh oh!

coderfender commented Apr 18, 2025

Uh oh!

coderfender commented Apr 21, 2025

Uh oh!

anuragmantri commented Apr 21, 2025

Uh oh!

manuzhang left a comment

Choose a reason for hiding this comment

Uh oh!

coderfender commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderfender commented May 15, 2025

Uh oh!

coderfender commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pvary left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pvary commented May 16, 2025

Uh oh!

coderfender commented May 16, 2025

Uh oh!

RussellSpitzer commented May 16, 2025

Uh oh!

coderfender commented May 16, 2025

Uh oh!

RussellSpitzer commented May 16, 2025

Uh oh!

coderfender commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

coderfender commented Apr 17, 2025 •

edited

Loading

coderfender commented Apr 17, 2025 •

edited

Loading

coderfender Apr 17, 2025 •

edited

Loading

coderfender commented May 15, 2025 •

edited

Loading