Skip to content

Spark 4.0: Pass format-version when creating a snapshot in table migration actions#14163

Merged
amogh-jahagirdar merged 3 commits into
apache:mainfrom
Fokko:fd-fix-snapshot
Sep 23, 2025
Merged

Spark 4.0: Pass format-version when creating a snapshot in table migration actions#14163
amogh-jahagirdar merged 3 commits into
apache:mainfrom
Fokko:fd-fix-snapshot

Conversation

@Fokko

@Fokko Fokko commented Sep 22, 2025

Copy link
Copy Markdown
Contributor

Closes #14162

@github-actions github-actions Bot added the spark label Sep 22, 2025
@kevinjqliu

kevinjqliu commented Sep 23, 2025

Copy link
Copy Markdown
Contributor

TestAddFilesProcedure > addFilteredPartitionsToPartitioned() > catalogName = testhive, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hive, default-namespace=default}, formatVersion = 1 FAILED
java.lang.IllegalArgumentException: Snapshot id must be assigned during commit
at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)

from

@TestTemplate
public void addFilteredPartitionsToPartitioned() {
createCompositePartitionedTable("parquet");
createIcebergTable(
"id Integer, name String, dept String, subdept String", "PARTITIONED BY (id, dept)");
List<Object[]> result =
sql(
"CALL %s.system.add_files('%s', '`parquet`.`%s`', map('id', 1))",
catalogName, tableName, fileTableDir.getAbsolutePath());
assertOutput(result, 2L, 1L);
assertEquals(
"Iceberg table contains correct data",
sql("SELECT id, name, dept, subdept FROM %s WHERE id = 1 ORDER BY id", sourceTableName),
sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));
}

,

and
Preconditions.checkArgument(
manifest.snapshotId() == null || manifest.snapshotId() == -1,
"Snapshot id must be assigned during commit");

looks like we can relax that constraint; snapshot_id could be null :)

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +881 to +886
final Long snapshotId;
if (formatVersion == 1 && !snapshotIdInheritanceEnabled) {
snapshotId = ops.newSnapshotId();
} else {
snapshotId = null;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
from the docs, https://iceberg.apache.org/spec/#manifest-entry-fields

Snapshot id where the file was added, or deleted if status is 2. Inherited when null.

@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Pass format-version when creating a snapshot Spark: Pass format-version when creating a snapshot in table migration actions Sep 23, 2025

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fokko for the fix, looks right to me! Is it possible to add a test to some either the snapshot action test or table migration test which asserts more state about the manifests that are produced via the procedure? I'd imagine in those tests we should be able to query the metadata tables and assert that some of these fields on the manfiest are what we expect?

@Fokko

Fokko commented Sep 23, 2025

Copy link
Copy Markdown
Contributor Author

@kevinjqliu It was late yesterday, but I wanted to get something out. It looks like the last snipper you shared shows the right approach rather than relaxing constraints:

Preconditions.checkArgument(
manifest.snapshotId() == null || manifest.snapshotId() == -1,
"Snapshot id must be assigned during commit");

Instead, we set it to -1, and the SnapshotProducer will set the SnapshotID:

// the manifest must be rewritten with this update's snapshot ID and null first_row_ids
ManifestFile copiedManifest = copyManifest(manifest);
rewrittenAppendManifests.add(copiedManifest);

@amogh-jahagirdar Thanks for the review, I've added some tests 👍. I have to probe the Avro files directly because Java does allow for V1 tables to have Snapshot Inheritance (which is not allowed by the spec because snapshot_id has to be required).

@nastra Do you have time to check out the JUnit conventions? I'm not doing Java often enough :D

sql("INSERT INTO TABLE %s (id, data) VALUES (1, 'a'), (2, 'b')", SOURCE_NAME);
List<Object[]> result =
sql(
"CALL %s.system.snapshot(source_table => '%s', table => '%s', parallelism => %d, properties => map('format-version', '1'))",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to just create two separate test methods without having any parallelism flag. Because what we're effectively testing is that the manifest has the correct format version. Having parallelism in the mix makes readers of the code wonder whether that flag is important or not to reproduce the issue

@Fokko Fokko Sep 23, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will add an aditional test, but that works for me 👍

@Fokko Fokko added this to the Iceberg 1.10.1 milestone Sep 23, 2025
@manuzhang manuzhang changed the title Spark: Pass format-version when creating a snapshot in table migration actions Spark 4.0: Pass format-version when creating a snapshot in table migration actions Sep 23, 2025

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fokko this looks great to me! Thank you @kevinjqliu @huaxingao @nastra for reviewing

@amogh-jahagirdar amogh-jahagirdar merged commit a71f521 into apache:main Sep 23, 2025
27 checks passed
@kevinjqliu

Copy link
Copy Markdown
Contributor

do we want to backport this to Spark 3.5?

@amogh-jahagirdar

Copy link
Copy Markdown
Contributor

@kevinjqliu Yes I think we should considering 3.5 is still going to be supported in the next release.

@Fokko Fokko deleted the fd-fix-snapshot branch September 23, 2025 15:56
@Fokko

Fokko commented Sep 23, 2025

Copy link
Copy Markdown
Contributor Author

@kevinjqliu @amogh-jahagirdar I think it makes sense to backport it to 3.4 and 3.5 for the 1.10.1 release since that one still supports both versions. Let me raise a PR in a sec

thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Snapshot action creates a V2 table with V1 metadata

5 participants