Spark: Add Variant read support for Spark Iceberg tables by aihuaxu · Pull Request #13219 · apache/iceberg

aihuaxu · 2025-06-02T18:52:52Z

This PR is to add the support for Spark to read Variant data against Iceberg tables. Basically when reading the Variant data (unshredded or shredded), Spark VariantReader reads an Iceberg Variant and converts to Spark VariantVal. The Iceberg VariantReader handles reading shredded/unshredded Iceberg Variant. Currently VariantWriter handles writing unshredded Iceberg Variant only.

aihuaxu · 2025-06-02T18:59:12Z

@aokolnychyi, @szehon-ho Can you help to check if it's the right direction? Thanks.

amogh-jahagirdar

I was chatting with @danielcweeks and he brought up a good point that one of the implications of only releasing writing without shredding is that in the future when that is added, older readers wouldn't be able to read those datasets.

It may be worth trying to add the support upfront so we avoid situations where it's like "Spark4-iceberg 1.10 has the limitation that it is not able to read shredded datasets" for example. Though, that would be a considerable amount of work we need to think through how to do properly (the whole write path for shredded columns is a bit unclear, probably requiring some buffering/resetting based on the records to even figure out what schema to write with)

danielcweeks · 2025-06-04T00:27:15Z

@aihuaxu I think it's important that we have a path forward for writing shredded columns before we introduce this. If we release a version that doesn't support reading shredded columns, it will be incompatible with future writers that produce shredded data.

I also think we want to produce shredded values initially so that all readers accommodate shredding to being with. I don't think it's safe to do this in isolation.

aihuaxu · 2025-06-04T17:22:32Z

@aihuaxu I think it's important that we have a path forward for writing shredded columns before we introduce this. If we release a version that doesn't support reading shredded columns, it will be incompatible with future writers that produce shredded data.

I also think we want to produce shredded values initially so that all readers accommodate shredding to being with. I don't think it's safe to do this in isolation.

Thanks @danielcweeks and @amogh-jahagirdar for the suggestion. That makes sense. Initially I thought I can break the changes to get feedback earlier. Let me incorporate shredding as well.

rdblue · 2025-06-26T22:16:19Z

@aihuaxu, I caught up with @danielcweeks about this yesterday and I think his concern was that we need to support reading shredded values. It would be nice to be able to write them as well, but I think as long as this can read them (and we have a test to validate it) then we should be able to move forward with this. Thanks for your patience on this while I was out at conferences!

aihuaxu · 2025-07-14T20:07:48Z

@aihuaxu, I caught up with @danielcweeks about this yesterday and I think his concern was that we need to support reading shredded values. It would be nice to be able to write them as well, but I think as long as this can read them (and we have a test to validate it) then we should be able to move forward with this. Thanks for your patience on this while I was out at conferences!

I have added a test case to read from shredded variant. I didn't add multiple test cases like TestVariantReaders but just added one. We will use ParquetVariantReaders to read and convert to VariantVal so the logic should have been covered by the existing TestVariantReaders.

Shredded writer is not included yet.

rdblue · 2025-07-26T00:01:27Z

+    Types.VariantType icebergVariantType = Types.VariantType.get();
+    DataType sparkVariantType = SparkSchemaUtil.convert(icebergVariantType);
+
+    assertThat(sparkVariantType).isEqualTo(VariantType$.MODULE$);


I think this should be instanceof VariantType right? Instances are equivalent so when we are accepting an object we typically use the instanceof check.

I think it's even further to make sure it's returning the singleton of VariantType. Would this be better?

If there can be instances of VariantType other than VariantType$.MODULE$ then I think it is better not to be overly restrictive.

From the implementation, it can only be VariantType$.MODULE$ in TypeToSparkType.java. It should be fine to check against the instance. That means it will not be other instance. Let me know if I misunderstand here.

@Override public DataType variant(Types.VariantType variant) { return VariantType$.MODULE$; }

rdblue · 2025-07-28T18:13:47Z

@aihuaxu, it looks like the test failures are only checkstyle. Can you update this to fix them?

rdblue · 2025-07-28T18:28:08Z

This is missing an annotation and tests are failing, but this should be ready when those are fixed. Thanks, @aihuaxu!

aihuaxu · 2025-07-28T18:28:08Z

@aihuaxu, it looks like the test failures are only checkstyle. Can you update this to fix them?

Sorry about that. Let me fix that.

amogh-jahagirdar

Stepped through this locally, all the changes look good to me @aihuaxu !

amogh-jahagirdar · 2025-07-29T02:29:40Z

Thanks @aihuaxu @rdblue !

nastra · 2025-07-29T10:34:55Z

    // verify that the dataframe matches
-    assertThat(rows).hasSameSizeAs(records);
-    Iterator<GenericData.Record> recordIter = records.iterator();
+    assertThat(rows.size()).isEqualTo(records.size());


it is generally better to use assertThat(rows).hasSameSizeAs(records); as that will show you the content of rows/records when the assertion ever fails. @aihuaxu was there a particular reason why this check was changed?

@nastra I checked the usage and assertThat(rows).hasSameSizeAs(records); does show the content when the number of records doesn't match compare to isEqualTo(), helping for debug. Let me change that.

github-actions Bot added spark parquet core hive labels Jun 2, 2025

aihuaxu force-pushed the aixu-spark-basic-variant branch from aa48ff0 to 51ee01d Compare June 2, 2025 18:57

amogh-jahagirdar self-requested a review June 2, 2025 19:20

amogh-jahagirdar reviewed Jun 3, 2025

View reviewed changes

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/TripleIterator.java Outdated

amogh-jahagirdar requested a review from danielcweeks June 3, 2025 23:42

rdblue reviewed Jun 4, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/SparkTypeVisitor.java Outdated

aihuaxu force-pushed the aixu-spark-basic-variant branch from 51ee01d to 9efae80 Compare June 5, 2025 18:23