Spark 4.0: Pass format-version when creating a snapshot in table migration actions#14163
Conversation
b5d0040 to
651accb
Compare
from , and iceberg/core/src/main/java/org/apache/iceberg/MergeAppend.java Lines 57 to 59 in 16fa673 looks like we can relax that constraint; |
| final Long snapshotId; | ||
| if (formatVersion == 1 && !snapshotIdInheritanceEnabled) { | ||
| snapshotId = ops.newSnapshotId(); | ||
| } else { | ||
| snapshotId = null; | ||
| } |
There was a problem hiding this comment.
👍
from the docs, https://iceberg.apache.org/spec/#manifest-entry-fields
Snapshot id where the file was added, or deleted if status is 2. Inherited when null.
format-version when creating a snapshotformat-version when creating a snapshot in table migration actions
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @Fokko for the fix, looks right to me! Is it possible to add a test to some either the snapshot action test or table migration test which asserts more state about the manifests that are produced via the procedure? I'd imagine in those tests we should be able to query the metadata tables and assert that some of these fields on the manfiest are what we expect?
|
@kevinjqliu It was late yesterday, but I wanted to get something out. It looks like the last snipper you shared shows the right approach rather than relaxing constraints: iceberg/core/src/main/java/org/apache/iceberg/MergeAppend.java Lines 57 to 59 in 16fa673 Instead, we set it to iceberg/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Lines 326 to 328 in 16fa673 @amogh-jahagirdar Thanks for the review, I've added some tests 👍. I have to probe the Avro files directly because Java does allow for V1 tables to have Snapshot Inheritance (which is not allowed by the spec because @nastra Do you have time to check out the JUnit conventions? I'm not doing Java often enough :D |
| sql("INSERT INTO TABLE %s (id, data) VALUES (1, 'a'), (2, 'b')", SOURCE_NAME); | ||
| List<Object[]> result = | ||
| sql( | ||
| "CALL %s.system.snapshot(source_table => '%s', table => '%s', parallelism => %d, properties => map('format-version', '1'))", |
There was a problem hiding this comment.
I think it makes sense to just create two separate test methods without having any parallelism flag. Because what we're effectively testing is that the manifest has the correct format version. Having parallelism in the mix makes readers of the code wonder whether that flag is important or not to reproduce the issue
There was a problem hiding this comment.
This will add an aditional test, but that works for me 👍
format-version when creating a snapshot in table migration actionsformat-version when creating a snapshot in table migration actions
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @Fokko this looks great to me! Thank you @kevinjqliu @huaxingao @nastra for reviewing
|
do we want to backport this to Spark 3.5? |
|
@kevinjqliu Yes I think we should considering 3.5 is still going to be supported in the next release. |
|
@kevinjqliu @amogh-jahagirdar I think it makes sense to backport it to 3.4 and 3.5 for the 1.10.1 release since that one still supports both versions. Let me raise a PR in a sec |
Closes #14162