Spark: enable stream-results option for remove orphan files by arifazmidd · Pull Request #14278 · apache/iceberg

arifazmidd · 2025-10-08T01:41:45Z

Description

This PR adds streaming support to the remove_orphan_files Spark procedure to prevent driver OOM issues when dealing with tables that have many orphan files.

This mimics the existing behavior for expire_snapshots with the stream_results parameter that was added in #4152

Changes

Added stream-results option to DeleteOrphanFilesSparkAction
Modified deleteFiles() method to take a dataset and process streaming deletion using toLocalIterator()
Returns sample of up to 20,000 file paths
Added STREAM_RESULTS_PARAM to RemoveOrphanFilesProcedure

Real-World Testing Results

Tested on AWS EMR with a production table containing ~3PB of orphaned data:

Run Description	Learnings
Dry Run using Original Implementation	Driver crashes due to OOM.
Dry Run using Stream Results	Completed successfully. Returned a sample of 20k file paths that will be deleted and the total count of files that will be deleted as ~45M.
Full Run using Original Implementation	Driver crashes due to OOM.
Full Run using Stream Results	Terminated after 4 hours because that was the timeout set; however, it successfully iterated through and deleted ~31M files.
Full Run using Original Implementation (after streaming run)	Completed successfully. Deleted the remaining ~14M orphaned files.
Full Run using Stream Results on sps-eta (validation run)	Completed successfully. Deleted ~1200 new orphan files.

Key Findings:

Original implementation consistently crashes with OOM on large-scale orphan file cleanup (~45M files)
Streaming implementation successfully handles massive workloads without memory issues
Successfully deleted ~31M files in a single streaming run (terminated by timeout, not failure)
Streaming approach enables incremental cleanup of large orphan file sets

arifazmidd · 2025-10-13T23:16:11Z

Hi @RussellSpitzer @pvary @liziyan-lzy @huaxingao, if any of you have time to help review this that would be greatly appreciated!

pvary · 2025-10-14T14:23:42Z

@arifazmidd: Could you fix the test please?

arifazmidd · 2025-10-14T17:32:45Z

Thanks for running the CI @pvary; I have fixed the formatting issues.

…n method

pvary · 2025-10-21T13:59:29Z

+      }
+    }
+
+    orphanFileDS.unpersist();


I am unfamiliar with Spark, so this is just a question.

Do we need a try/finally to unpersist the DS?

I have added the try/finally as you mentioned but it's not necessary as spark automatically monitors this and drops old data in a LRU fashion. We are just explicitly trying to do this right when we no longer need the data.

pvary · 2025-10-21T14:01:13Z

+      Dataset<String> orphanFileDS, SetAccumulator<Pair<String, String>> conflicts) {
+    // Cache the dataset and force computation to populate the conflicts accumulator
+    // This allows us to validate conflicts before starting any deletions
+    orphanFileDS = orphanFileDS.cache();


I'm not familiar with Spark, so this is just a question:
What is the cost of this? Do we lose what we gain with not reading every file at once?

The data is being cached into the executors' memory so it's distributed and will spill to disk if needed. We also unpersist this data when it is no longer needed. This prevents us from having to double compute this dataset; once during the prefix validation on count() and then again on toLocalIterator or collectAsList(). Instead the dataset can be streamed or collected from cache.

pvary · 2025-10-23T13:22:50Z

+  }
+
+  @TestTemplate
+  public void testStreamResultsBackwardsCompatibility() throws IOException, InterruptedException {


There is no actual difference between the tests with streaming on and off.
The only noticeable difference could be the sample/full result in the list.
Could we test that?
Shall we create a method, or change the parametrization of this single test to avoid code duplication?

Yes there is no real difference between what the result is with streaming on or off other than the output containing a sample vs the full result.

I have removed testStreamResultsBackwardsCompatibility() as it is already covered in testDryRun. I also removed testStreamResultsWithDryRun() and added a check in testDryRun with streaming enabled.

Now we only have one new test method testStreamResultsDeletion which ensure files are correctly deleted with streaming enabled. This doesn't check that the output should only be 20k rows though. I'm not sure if that's something we want to test since it would require creating 20k+ files. The correctness of the streaming behavior is already validated by these tests.

The correctness of the streaming behavior is already validated by these tests.

If I accidentally remove the whole streaming feature, but keep the config, the test will not fail. For me, this is concerning

Hmm that's a good point. To address this I attempted to follow the ExpireSnapshotsAction.testUseLocalIterator() pattern by comparing Spark job counts, but it turns out both modes use the same number of jobs in our case because:

Both modes cache and count the dataset for validation

The difference (collectAsList vs toLocalIterator) happens on the cached data, which doesn't create different job patterns

Alternative approaches I can think of right now to verify streaming behavior:

Test with enough files to hit the 20k sampling limit. I guess someone could still remove most of the streaming feature but just keep this sampling portion at the end and the test would still pass though.

Test that verifies function calls (toLocalIterator vs collectAsList).

Do you think either of these suffice?

Maybe make the sample list file sizes configurable for tests?
That's still lame, because the feature is streaming and not the collection 😄
On a second thought. Maybe it's ok. At least we can know that the result is not stored in memory.

Could we extend or inject or call directly the RemoveOrphanFilesProcedure? This is very Sparky question, and I don't have a good answer.

Yes but the multiple batch deletion is not unique to streaming mode. We use an iterator and delete in batches for both

if (streamResults()) { return deleteFiles(orphanFileDS.toLocalIterator(), orphanFileDS); } else { return deleteFiles(orphanFileDS.collectAsList().iterator(), orphanFileDS); }

and then in deleteFiles:

Iterator<List<String>> fileGroups = Iterators.partition(orphanFiles, DELETE_GROUP_SIZE); while (fileGroups.hasNext()) { ... <call to deleteBulk or deleteNonBulk> ... }

Do I undestand the Spark behavior correctly, if I think that the Iterators.partition(orphanFiles, DELETE_GROUP_SIZE) will create a single group for the non-streaming path, and multiple groups for the streaming path?

If I am mistaken with the above assumption, then I don't understand why we have 2 parameters in the deleteFiles method. We can just have a single DS parameter, or we could even merge the deleteFile to the doExecute

I believe the group number will be equivalent for both streaming and non-streaming.

We had two parameters because ExpireSnapshotsAction's deleteFiles method takes in an iterator so to keep consistent we refactored to pass the iterator and then we are passing the dataset for un-persisting. But you are correct, this can be improved. There are two options I see.

Pass only the dataset and determine the iterator to use within the deleteFiles method.

Pass only the iterator and don't unpersist the dataset from cache manually, allow Spark to handle it on its own.

I will be out for a few days. Will come back next week. Sorry for the delay

Nws, thanks for the review so far. I've updated the test to have a configurable output sample size parameter as we previously discussed and simplified deleteFiles to only take the dataset as a parameter (option 1 from above).

…s only for deletion

pvary

Generally looks good to me, but I would like to have someone with more Spark knowledge to review it too.
CC: @szehon-ho, @huaxingao, @RussellSpitzer

pvary · 2025-11-04T09:22:05Z

+
+    // Cache and force computation to populate conflicts accumulator
+    orphanFileDS = orphanFileDS.cache();
+    orphanFileDS.count();


Is it possible to have an error here, and we don't unpersist the DS?

Good catch, yes it is possible.

We can either go with the alternative option I had suggested above:

Dataset<String> orphanFileDS = null; try { orphanFileDS = findOrphanFiles(...); return deleteFiles(orphanFileDS); } finally { if (orphanFileDS != null) { orphanFileDS.unpersist(); } }

or wrap the code block in findOrphanFiles with try-catch.

orphanFileDS = orphanFileDS.cache(); try { orphanFileDS.count(); if (prefixMismatchMode == PrefixMismatchMode.ERROR && !conflicts.value().isEmpty()) { throw new ValidationException(...); } return orphanFileDS; } catch (Exception e) { // not sure if we want to be catching broad exception and re-throwing like this orphanFileDS.unpersist(); throw e; }

I think the first option is better.

In case of an error we might already called the orphanFileDS.cache(), but not return a value, just throw an exception. So in this case the orphanFileDS will be null and we don't unpersist the cache with this code:

Dataset<String> orphanFileDS = null; try { orphanFileDS = findOrphanFiles(...); return deleteFiles(orphanFileDS); } finally { if (orphanFileDS != null) { orphanFileDS.unpersist(); } }

So I think we need to do the wrapping in the findOrphanFiles.

huaxingao · 2025-11-05T05:34:54Z

 *
+ * <p>Streaming mode can be enabled via the {@value #STREAM_RESULTS} option to avoid loading all
+ * orphan file paths into driver memory. When enabled, the result will contain only a sample of file
+ * paths (up to {@value #MAX_ORPHAN_FILE_PATHS_TO_RETURN_WHEN_STREAMING}). The total count of


should this be @value #MAX_ORPHAN_FILE_SAMPLE_SIZE_DEFAULT

Ah yes sorry the variable name was updated but didn't update the documentation.

huaxingao · 2025-11-05T05:39:35Z

        PREFIX_MISMATCH_MODE_PARAM,
-        PREFIX_LISTING_PARAM
+        PREFIX_LISTING_PARAM,
+        STREAM_RESULTS_PARAM


shall we update docs/docs/spark-procedures.md for remove_orphan_files to add the new parameter?

Updated docs -- same content as expire_snapshots documentation with an added note about the output size max

huaxingao · 2025-11-05T05:51:39Z

@arifazmidd Thanks for the PR! LGTM overall.

pvary · 2025-11-06T11:21:54Z

Merged to main.
Thanks @arifazmidd for the PR and @huaxingao for the review!

@arifazmidd: Please port the changes to Spark 3.4, and 4.0. This command could help:

g diff HEAD^ spark/v3.5 |sed "s/v3.5/v4.0/g">/tmp/patch;g apply -3 -p1 /tmp/patch

Please tell us on the new PR if the backport was clean (no manual changes are required), or you needed to do changes manually. In this case highlight the extra changes, so we have easier time to review.

Thanks,
Peter

arifazmidd · 2025-11-06T20:20:03Z

Thanks for the reviews @pvary and @huaxingao!

…4278)

Backport of apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K. When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…4278)

Backport of apache/iceberg#14278 to openhouse-1.5.2. Adds stream-results option to DeleteOrphanFilesSparkAction to prevent driver OOM when removing large numbers of orphan files. Instead of collecting all orphan file paths into driver memory, files are streamed partition-by-partition using toLocalIterator() and deleted in batches of 100K. When enabled, the result contains a sample of up to 20,000 file paths. The total count of deleted files is logged. Co-authored-by: Dushyant Kumar <dukumar@linkedin.biz> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

stream results for remove orphan files

c137ea9

github-actions Bot added the spark label Oct 8, 2025

spotlessApply

6fff5f3

RussellSpitzer reviewed Oct 8, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

remove comment to user and improve variable naming

561d18d

arifazmidd requested a review from RussellSpitzer October 9, 2025 19:06

spotlessApply

ff61c98

Fix validation timing for prefix mismatch detection

365fc9b

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread ...v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java Outdated

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread ...v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java Outdated

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary reviewed Oct 16, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated

pvary reviewed Oct 17, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

arifazmidd added 4 commits October 17, 2025 15:19

revert string concatenation format changes

9275235

refactor file deletion into reusable methods

b7a5f45

refactor sample path collection into method

2076d85

Add Collection import

601bf6e

pvary reviewed Oct 20, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary reviewed Oct 20, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary reviewed Oct 20, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary reviewed Oct 20, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

Validate conflicts before deletion and refactor sample path collectio…

51369f0

…n method

pvary reviewed Oct 21, 2025

View reviewed changes

Refactor: rename to findOrphanFiles and encapsulate accumulator creation

88c90df

pvary reviewed Oct 23, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated

pvary reviewed Oct 23, 2025

View reviewed changes

arifazmidd added 2 commits October 23, 2025 16:49

Refactor: testDryRun covers streaming and testStreamResultsDeletion i…

88823ed

…s only for deletion

Sample size paramater for testing and refactor deleteFiles

5a65ddf

pvary reviewed Nov 3, 2025

View reviewed changes

Comment thread ...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Outdated

pvary reviewed Nov 3, 2025

View reviewed changes

Comment thread .../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated

pvary approved these changes Nov 3, 2025

View reviewed changes

pvary requested review from huaxingao and szehon-ho November 3, 2025 13:23

Refactor: unpersist DS with creation and use constants in test

33b6fb3

pvary reviewed Nov 4, 2025

View reviewed changes

Add try catch to prevent resource leak

e2e7219

huaxingao reviewed Nov 5, 2025

View reviewed changes

Add and update docs

2912c58

github-actions Bot added the docs label Nov 5, 2025

huaxingao approved these changes Nov 5, 2025

View reviewed changes

pvary merged commit 99ccfc4 into apache:main Nov 6, 2025
29 checks passed

arifazmidd mentioned this pull request Nov 6, 2025

Spark: Backport stream-results for remove orphan files to 3.4 and 4.0 #14522

Merged

cccs-nik pushed a commit to CybercentreCanada/iceberg that referenced this pull request Nov 18, 2025

Spark: enable stream-results option for remove orphan files (apache#1…

6eb47c2

…4278)

thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026

Spark: enable stream-results option for remove orphan files (apache#1…

a39c2dc

…4278)

dushyantk1509 mentioned this pull request Mar 17, 2026

Spark: backport stream-results option for remove orphan files linkedin/iceberg#234

Merged

talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026

Spark: enable stream-results option for remove orphan files (apache#1…

1febae7

…4278)

Uh oh!

Conversation

arifazmidd commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Real-World Testing Results

Uh oh!

Uh oh!

arifazmidd commented Oct 13, 2025

Uh oh!

pvary commented Oct 14, 2025

Uh oh!

arifazmidd commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pvary left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd commented Oct 8, 2025 •

edited

Loading

arifazmidd commented Oct 14, 2025 •

edited

Loading

arifazmidd Oct 21, 2025 •

edited

Loading

arifazmidd Oct 23, 2025 •

edited

Loading

arifazmidd Oct 24, 2025 •

edited

Loading

arifazmidd Oct 28, 2025 •

edited

Loading

arifazmidd Oct 29, 2025 •

edited

Loading