Fix variant type filtering in ParquetMetricsRowGroupFilter by geruh · Pull Request #14081 · apache/iceberg

geruh · 2025-09-15T07:58:34Z

Variant types were not handled correctly in Parquet row group filtering. The ParquetMetricsRowGroupFilter.notNull() needs to account for variant types which require post scan evaluation to access shredded statistics for nested field filtering, similar to structs are handled.

This change ensures variant columns are properly handled during Parquet metrics evaluation, allowing row groups with variant data to be correctly included for post-scan filtering rather than being incorrectly filtered out at the row group level.

Testing

testVariantFilterNotNull(): test with mixed variant/null data
testAllNullsVariantNotNull(): Edge case with all-null variant columns
TestInclusiveMetricsEvaluatorWithExtract.java already provides extensive coverage of variant extract expressions

amogh-jahagirdar

Thanks @geruh for taking this up! The fix looks right to me but I think we need to cleanup the tests in ParquetMetricsRowGroupFilter since I don't think there's a reason we need to actually write files (take a look at how the other tests in the class work).

Can we also add some end to end tests via Spark 4.0 which are more like what was reported in the original issue?

amogh-jahagirdar · 2025-09-15T13:38:55Z

+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+
+        if (i % 2 == 0) {
+          VariantMetadata metadata = Variants.metadata("field");
+          ShreddedObject obj = Variants.object(metadata);
+          obj.put("field", Variants.of("value" + i));
+          Variant variant = Variant.of(metadata, obj);
+          record.setField("variant_field", variant);
+        }
+
+        appender.add(record);
+      }
+    }


Why do we need to write records in these tests? At this level of abstraction, I think we should just create the row group filter with the notNull("variant_field") and it assert that shouldRead is true.

+1. Seems we don't need to write to the files.

Thanks for the feedback Amogh!

Good point about avoiding file writing at this level. I initially wrote the tests this way because this test class shares a schema for writing out data files in both ORC and Parquet. We're now in a situation where ORC doesn't have full support for variant types while Parquet does, so adding variant fields to the shared schema would break the existing ORC tests.

That said, it probably makes sense to use separate schemas for Parquet and ORC given the differnt levels of support. and write the tests to reflect that.

Talked with @geruh offline and I also poked around refactoring this class, and while I think we should (this mixture of orc/parquet both trying to test "row group filtering" is leading to weird tests), it's a big change especially for something going into a patch release.

Also while technically the implementation of filtering with variant doesn't depend on the actual contents of the file, after some more thought I concluded that it's better to write a more realistic test which does contain the records like @geruh was doing before.

amogh-jahagirdar · 2025-09-15T13:40:48Z

+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+        record.setField("variant_field", null);
+        appender.add(record);
+      }
+    }
+


Same as above, don't think we need to actually write parquet files in these tests

addressed above

amogh-jahagirdar · 2025-09-15T13:47:48Z

+      // evaluated post scan. Variant types also need to be evaluated post scan to access
+      // shredded statistics.


I'd remove the "variant types also need to be evaluated post scan to access shredded statistics". I'd just update to reflect the current state of things which is in that first sentence, "When filtering nested types or variant...." and the second sentence to be "Leave these type filters...".

For shredded stats pruning, the core library already contains BoundExtract, what we need is the translation/plumbing from engines to that extract, and then I think we can do pruning based on the shredded stats. For now though , I'd just leave it out of comments since I think it's more confusing and a bit inaccurate.

+1 on this.

amogh-jahagirdar · 2025-09-15T13:54:48Z

I also think this is something that should go into a 1.10.1 patch release since it's unfortunately a correctness issue

amogh-jahagirdar · 2025-09-15T13:55:43Z

cc @aihuaxu

amogh-jahagirdar

Also can we double check some of the other cases like eq, in..? The spark tests would surface that as well

RussellSpitzer · 2025-09-15T14:38:21Z

+      // evaluated post scan. Variant types also need to be evaluated post scan to access
+      // shredded statistics.
+      Type type = schema.findType(id);
+      if (type instanceof Type.NestedType || type.isVariantType()) {


RussellSpitzer

The fix looks good to me, I agree with @amogh-jahagirdar that the tests should get changed to match up with the others in the file.

We will definitely have to resist this once we push through variant shredded predicates

aihuaxu

The change looks good to me. Thanks for fixing.

aihuaxu · 2025-09-15T19:22:07Z

+      // evaluated post scan. Variant types also need to be evaluated post scan to access
+      // shredded statistics.


+1 on this.

aihuaxu · 2025-09-15T20:03:19Z

+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+
+        if (i % 2 == 0) {
+          VariantMetadata metadata = Variants.metadata("field");
+          ShreddedObject obj = Variants.object(metadata);
+          obj.put("field", Variants.of("value" + i));
+          Variant variant = Variant.of(metadata, obj);
+          record.setField("variant_field", variant);
+        }
+
+        appender.add(record);
+      }
+    }


+1. Seems we don't need to write to the files.

amogh-jahagirdar · 2025-09-18T21:34:08Z

+    OutputFile outFile = Files.localOutput(parquetFile);
+    try (FileAppender<GenericRecord> appender =
+        Parquet.write(outFile)
+            .schema(variantSchema)
+            .createWriterFunc(GenericParquetWriter::create)
+            .build()) {
+
+      for (int i = 0; i < 10; i++) {
+        GenericRecord record = GenericRecord.create(variantSchema);
+        record.setField("id", i);
+
+        if (i % 2 == 0) {
+          VariantMetadata metadata = Variants.metadata("field");
+          ShreddedObject obj = Variants.object(metadata);
+          obj.put("field", Variants.of("value" + i));
+          Variant variant = Variant.of(metadata, obj);
+          record.setField("variant_field", variant);
+        }
+
+        appender.add(record);
+      }
+    }


Talked with @geruh offline and I also poked around refactoring this class, and while I think we should (this mixture of orc/parquet both trying to test "row group filtering" is leading to weird tests), it's a big change especially for something going into a patch release.

Also while technically the implementation of filtering with variant doesn't depend on the actual contents of the file, after some more thought I concluded that it's better to write a more realistic test which does contain the records like @geruh was doing before.

amogh-jahagirdar · 2025-09-18T21:39:23Z

      assertThat(planAsString)
          .as("Post scan filter should match")
-          .contains("Filter (" + sparkFilter + ")");
+          .containsAnyOf("Filter (" + sparkFilter + ")", "Filter " + sparkFilter);


Are the filters on variant missing the parens?

No this is specific to the spark filters. I added this so we can capture a simple spark filter. e.g. when spark doesn't apply a NOT NULL check to a filter. This is observed with IN predicates where Spark doesn't add the implicit IS NOT NULL check, resulting in simpler filter expressions without parentheses.

amogh-jahagirdar

Ok from my side I think this looks good now w just a minor non-blocking nit, thank you @geruh for the fix. I'll leave it up for another day or so for others before merging

amogh-jahagirdar · 2025-09-19T16:30:57Z

+        assertThat((actualValue).toString())
+            .as("%s contents should match (VariantVal JSON)", context)
+            .isEqualTo((expectedValue).toString());


Discussed offline, I think it's reasonable to compare the JSON representation to determine if the logical variant is equivalent.

nastra

LGTM with a few minor comments

amogh-jahagirdar · 2025-09-20T18:08:43Z

Thanks @geruh! Thank you @aihuaxu @RussellSpitzer @nastra for reviewing.

…GroupFilter (apache#14081) (cherry picked from commit fb63af0)

…GroupFilter (#14081) (#14467) (cherry picked from commit fb63af0) Co-authored-by: Drew Gallardo <imgeru@gmail.com>

Fix variant type filtering in ParquetMetricsRowGroupFilter

83ae077

github-actions Bot added parquet data labels Sep 15, 2025

amogh-jahagirdar requested changes Sep 15, 2025

View reviewed changes

amogh-jahagirdar requested review from Fokko, RussellSpitzer, nastra, rdblue and stevenzwu September 15, 2025 13:55

amogh-jahagirdar reviewed Sep 15, 2025

View reviewed changes

RussellSpitzer reviewed Sep 15, 2025

View reviewed changes

RussellSpitzer approved these changes Sep 15, 2025

View reviewed changes

aihuaxu approved these changes Sep 15, 2025

View reviewed changes

Address comments and add more tests

1d41313

github-actions Bot added the spark label Sep 16, 2025

geruh added 3 commits September 16, 2025 13:49

Refactor tests to be format specific

99dc168

Revert refactoring

ab202af

Revert the test refactoring and clean up original

c7c715e

geruh requested a review from amogh-jahagirdar September 18, 2025 06:36

amogh-jahagirdar reviewed Sep 18, 2025

View reviewed changes

geruh added 3 commits September 18, 2025 16:51

remove old tests

3f06a3d

clean up and add another test

3ba38bb

checkstyle..

a618461

amogh-jahagirdar approved these changes Sep 19, 2025

View reviewed changes

nastra reviewed Sep 19, 2025

View reviewed changes

Comment thread spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/SparkTestHelperBase.java Outdated

nastra reviewed Sep 19, 2025

View reviewed changes

Comment thread data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java Outdated

nastra reviewed Sep 19, 2025

View reviewed changes

Comment thread data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java Outdated

nastra approved these changes Sep 19, 2025

View reviewed changes

address comments

a48ec72

amogh-jahagirdar merged commit fb63af0 into apache:main Sep 20, 2025
42 checks passed

amogh-jahagirdar added this to the Iceberg 1.10.1 milestone Sep 22, 2025

huaxingao mentioned this pull request Oct 10, 2025

Parquet: Treat VARIANT like nested for eq/in in ParquetMetricsRowGroupFilter #14279

Merged

huaxingao pushed a commit to huaxingao/iceberg that referenced this pull request Nov 2, 2025

Parquet, Data, Spark: Fix variant type filtering in ParquetMetricsRow…

964f724

…GroupFilter (apache#14081) (cherry picked from commit fb63af0)

huaxingao added a commit that referenced this pull request Nov 2, 2025

Parquet, Data, Spark: Fix variant type filtering in ParquetMetricsRow…

4119a36

…GroupFilter (#14081) (#14467) (cherry picked from commit fb63af0) Co-authored-by: Drew Gallardo <imgeru@gmail.com>

		// evaluated post scan. Variant types also need to be evaluated post scan to access
		// shredded statistics.

Uh oh!

Conversation

geruh commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Sep 15, 2025

Uh oh!

amogh-jahagirdar commented Sep 15, 2025

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

aihuaxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geruh Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar commented Sep 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

geruh commented Sep 15, 2025 •

edited

Loading

geruh Sep 18, 2025 •

edited

Loading