Adding new rewrite manifest spark action to accept custom partition order by zachdisc · Pull Request #12840 · apache/iceberg

zachdisc · 2025-04-18T18:20:42Z

Note this is a fresh PR replacing #9731. It had too much accumulated conflicts and changes, I rebased and messed it up. This is a clean start with all previous feedback incorporated.

What

This adds a simple sort method to the RewriteManifests spark action which lets user specify the partition column order to consider when grouping manifests.

Illustration:

RewriteManifests.Result result =
        actions
            .rewriteManifests(table)
            .sort("c", "b", "a")  < -- this is the new api piece
            .execute();

Closes #9615

Why

Iceberg's metadata is organized into a forest of manifest_files which point to data files sharing common partitions. By default, and during RewriteManifests, the partition grouping is determined by the default Spec partition order. If the primary query pattern is more aligned with the last partition in the table's spec, manifests are poorly suited to quickly plan and prune around those partitions.

EG

CREATE TABLE
...
PARTITIONED BY (region, storeId, bucket(ipAddress, 100), days(event_time)

Will create manifests that first group by region, whose manifest_file contents may span a wide range of event_time values. For a primary query pattern that doesn't care about region, storeId, etc, this leads to inefficient queries.

…rder

RussellSpitzer · 2025-04-22T20:31:04Z

   */
  RewriteManifests rewriteIf(Predicate<ManifestFile> predicate);

+  /**


I think there are still a few issues in this doc

This should only produce a single manifest list.

Unclear what this is supposed to produce, are we sorting or clustering here? Will manifests be separated based on values of B and D? Does the order matter?

Rewriting Manifests in this way * will yield manifest_lists that point to manifest_files containing data files for common 'd' and * 'b' partitions.

I think the core message here is
Rewrites the manifests and order (cluster?) their entries based on the specified transforms. Manifests can be filtered based on their min and max partition values. Choosing a frequently queried partition field can reduce planning time by skipping unnecessary manifests.

Or something along those lines

This also shouldn't use the current spec but the one specified by specId(int specId)

Can update documentation. Thanks!

RussellSpitzer · 2025-04-22T20:31:21Z

+   *     bucket(N, data) partition * definition
+   * @return this for method chaining
+   */
+  default RewriteManifests clusterBy(List<String> partitionFields) {


why cluster and not sort?

In the original proposal we talked about updating the Spark api to match the Java API, where they have a clusterBy interface

iceberg/core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java

Line 116 in 90d1c90

public RewriteManifests clusterBy(Function<DataFile, Object> func) {

.

Educate me - I see clustering and sorting as synonyms. Clustering is just the word for the spark technique to sort and repartitionByRange data into "clusters". I'm not married to either.

Hmm that's a good question. In my eyes we are doing a hierarchical sort which feels different to me than a multi-dimensional clustering algo. So for example Cluster(a, b) might get me manifests with common tuples where A and B are correlated but we can't actually do that here.

So for example if I would expect cluster to make files like

{(1,1)(1,2)(2,1)(2,2)} {(1,3)(1,4)(2,3)(2,4)} {(3,1)(3,2)(4,1)(4,2)} {(3,3)(3,4)(4,3)(4,4)}

I would consider that clustered

But our current algo can't do that, it can only do a hierarchical sort, each column is dependent on the one before it. Like in the above example if I cluster (a,b) I would produce

{(1,1)(1,2)(1,3)(1,4)} {(2,1)(2,2)(2,3)(2,4)} {(3,1)(3,2)(3,3)(3,4)} {(4,1)(4,2)(4,3)(4,4)}

Good illustration. I think the question in your illustration is - are the tuples globally sorted? And does that matter? I don't think the manifest list assumes sorting, so each manifest's lower/uppers are checked. In which case I'd call it "clustering".

Really no preference on naming. Like I said I just picked this to match the Java API. At the end of the day either of your examples is better than

{(1,4)(1,4)(1,4)(1,4)} {(1,4)(1,4)(1,4)(1,4)} {(1,4)(1,4)(1,4)(1,4)} {(1,4)(1,4)(1,4)(1,4)}

:-)

The original API "clusters" this api "sorts", I'm not sure they are comparable. I really think the naming should be "sort"

RussellSpitzer · 2025-04-22T20:49:54Z

+    // Check if these partition fields are included in the spec
+    Preconditions.checkArgument(
+        missingFields.isEmpty(),
+        "Cannot set manifest clustering because specified field(s) %s were not found in current partition spec %s.",


This shouldn't be the current spec, but the spec chosen for rewrite target

RussellSpitzer · 2025-04-22T21:18:34Z

+    List<ManifestFile> manifests = table.currentSnapshot().allManifests(table.io());
+    assertThat(manifests).as("Should have 1 manifests before rewrite").hasSize(1);
+
+    // Capture the c3 partition's lower and upper bounds - used for later test assertions


The checks here are pretty complicated, I think it probably would be a lot simpler to just check that the rows within the manifest are ordered rather than going through all the manifest metadata. Our readers should already read through entries in order from the file I think ...

Then we could just check "contents before = contents after" - "row before unsorted on c3 != rows after sorted on c3 transform"

I'll see what I can do. Contents can still be sorted locally, but not globally sorted without the clusterBy. I think you're suggesting to just say "sorted but different" is sufficient?

I'm just trying to think how we can avoid re-implementing a check overlapping bounds function (we already have done this once in the spark rewrite datafiles test code - but for metrics) and avoid doing all the conversions from byte buffer. It's a lot of code to have in a single test case and I think it's a bit difficult to check.

Since you are using a Integer output here we probably could just use the MetadataTable - ManfiestsTable which reads the manifest list and serailzes all the partition min/maxs into strings. So you could take those flatten them into a list and check that the list is sorted

for (lower, upper <- partitionsummaries.field you want) { bounds.add(Integer.of(lower)) bounds.add(Integer.of(upper)) } assert that bounds is sorted

Mostly I'm looking here for a way to simplify this test so it's more clear what's doing without so much boiler plate code. If you have other suggestions I'm open those as well

I'd also like to apologize for taking so long to get back to this, please feel free to ping my on slack if I haven't noticed a github comment. My email is a bit packed so sometimes I lose important notices.

Will simplify, thanks for staying plugged in!

…g column identifier for RewriteManifests with custom ordering

zachdisc · 2025-05-07T17:29:14Z

Made most of the changes asked for. Still undecided around clustering vs sorting naming, and I didn't change the test that verifies content ordering within manifests.

github-actions · 2025-06-07T00:17:25Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

zachdisc · 2025-06-09T14:35:52Z

@RussellSpitzer mind a review?

RussellSpitzer · 2025-06-09T14:54:00Z

+  /**
+   * Rewrite manifests in a given order, based on partition field names
+   *
+   * <p>Supply an optional set of partition field names to cluster the rewritten manifests by.


Here and below we should be using "sort" and not "cluster"

Updated documentation

zachdisc · 2025-06-18T16:03:40Z

Reverted signature back to sortBy at @RussellSpitzer's request

github-actions · 2025-07-19T00:18:19Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

zachdisc · 2025-07-21T14:32:09Z

@RussellSpitzer

zachdisc · 2025-08-13T15:28:39Z

@RussellSpitzer anything else you want here?

zachdisc · 2025-08-15T19:56:50Z

@RussellSpitzer thanks for the tip on the unit test. Spot on with your simplification idea. Hope this is what you're looking for!

RussellSpitzer · 2025-08-20T15:01:26Z

Merged! Thanks @zachdisc for the contribution. You should now be able to get the integration tests to run automatically. If you have a chance please backwards and forwards port your changes to the other spark versions. I assume they will be clean merges so we can do a fast review on those and get them in as well.

Adding new rewrite manifest spark action to accept custom partition o…

a0ff66e

…rder

github-actions Bot added API spark labels Apr 18, 2025