Parquet: Implement Variant metrics by rdblue · Pull Request #12496 · apache/iceberg

rdblue · 2025-03-10T22:39:38Z

This implements metrics for Variant types stored in Parquet files, using new visitors to produce the metrics.

This also refactors the existing metrics code to use a visitor. If I remember correctly, the metrics code predates the Parquet visitor. I think the visitor version is cleaner.

rdblue · 2025-03-10T22:41:16Z

  private final T lowerBound;
  private final T upperBound;

+  public FieldMetrics(int id, long valueCount, long nullValueCount) {


These are just convenience constructors.

rdblue · 2025-03-10T22:42:42Z


  private static <T> T visitVariant(
-      Types.VariantType variant, GroupType group, TypeWithSchemaVisitor<T> visitor) {
+      Types.VariantType variant, GroupType variantGroup, TypeWithSchemaVisitor<T> visitor) {


In order to call a visitor that has a different return type than the Parquet schema visitor (in this case, VariantMetrics instead of FieldMetrics), the Parquet type needs to be passed to the visit method. I think that not including this originally was an accident.

rdblue · 2025-03-10T22:43:13Z

    if (writer != null) {
-      return ParquetUtil.footerMetrics(writer.getFooter(), model.metrics(), metricsConfig);
+      return ParquetMetrics.metrics(
+          schema, parquetSchema, metricsConfig, writer.getFooter(), model.metrics());


Schema is currently required because there is no annotation for variants in the Parquet schema.

rdblue · 2025-03-10T22:46:56Z

+    Map<Integer, Long> columnSizes = Maps.newHashMap();
+    Multimap<ColumnPath, ColumnChunkMetaData> columns =
+        Multimaps.newMultimap(Maps.newHashMap(), Lists::newArrayList);
+    for (BlockMetaData block : metadata.getBlocks()) {


This organizes the metadata for each column so that it is available when processing that leaf column in the builder.

rdblue · 2025-03-10T22:47:42Z

+    Map<Integer, FieldMetrics<?>> metricsById =
+        fields.collect(Collectors.toMap(FieldMetrics::id, Function.identity()));
+
+    Iterable<FieldMetrics<ByteBuffer>> results =


Metrics are returned from the builder as an iterable of FieldMetrics, one for each field. The lower and upper bounds are already serialized since the field type is known in the visitor.

rdblue · 2025-03-10T22:48:19Z

+    Map<Integer, ByteBuffer> lowerBounds = Maps.newHashMap();
+    Map<Integer, ByteBuffer> upperBounds = Maps.newHashMap();
+
+    for (FieldMetrics<ByteBuffer> metrics : results) {


This translates the metrics for each field into the top-level maps. (When we modify how we track metrics/stats we would probably change this.)

rdblue · 2025-03-10T22:49:33Z

+
+      int length = truncateLength(mode);
+
+      FieldMetrics<ByteBuffer> metrics = metricsFromFieldMetrics(fieldId, iPrimitive, length);


The main value of doing this in the visitor is to avoid processing the column metadata if there is already a FieldMetrics object available.

rdblue · 2025-03-10T22:51:20Z

+      T lowerBound = null;
+      T upperBound = null;
+
+      for (ColumnChunkMetaData column : columns.get(path)) {


This uses the column path so the same columns map can be used for fields with IDs (normal Iceberg fields) and for fields without IDs (shredded Variant fields).

rdblue · 2025-03-10T22:56:44Z

+      }
+
+      return ImmutableList.of(
+          new FieldMetrics<>(


Currently, all fields are kept and truncated to 16. This seems reasonable, but we could also apply the truncation length to all sub-fields, or use it for the number of fields to keep bounds for?

I feel like truncate 16 for all the shredded fields is the most intuitive out of all the options. That said I was wondering, would it make more sense to pass the MetricsMode to MetricsVariantVisitor that way if it's a truncate, we can use the user set length property as opposed to ignoring it? I don't think this is a hard blocker though, since the default is already truncate(16) and it's probably a sane truncation for the shredded fields as well.

Right now, I just wanted to get the basics working so I went with the simplest implementation. We should definitely revisit this and discuss how to configure metrics collection.

That said, I think the mode isn't the right problem to solve. For the mode, we don't keep counts other than the top-level value and null count for the variant itself. That leaves only how to handle lower and upper bounds, where we know that truncate is the right config and 16 is a reasonable length default. At that point, the only question is whether we want to hard-code it, pass through the mode for the variant column to use a configurable length for all sub-fields, or if we want to have a truncate length for each field individually.

The bigger problem is which fields to collect metrics for. Restricting the fields to just the ones that are shredded is a good heuristic because we don't expect types to be uniform for other fields, and a value of another type will prevent the field's bounds from being stored. Even then, there could be quite a few fields and that will make the lower and upper bound payloads large. We may want to further restrict the number of fields, but for now I think the reasonable path forward is to use the shredded fields. Then we can see if we want to change it once we tackle the problem of how we determine the fields to shred.

aihuaxu · 2025-03-17T15:55:37Z

@@ -1172,6 +1172,11 @@ acceptedBreaks:
        \ java.util.function.Consumer<T>)"
      justification: "Removing deprecated code"


To help me understand the scope of this PR: we will collect the metrics for Variant subcolumns but the storing them as Variant to Avro files and retrieving from Avro files will be separate since that depends on Avro Variant read/write implementation. And also the pruning with such metrics will be separate as well.

Is that correct?

Yes, this PR translates Parquet metrics for shredded fields to a Variant object that is accepted by InclusiveMetricsEvaluator. For Avro, we would need to come up with a strategy for producing metrics. I'm not sure that it makes sense to because there are no shredded fields.

Got it. I'm referring to write the metrics (in Variant object) into Iceberg manifest file (in Avro file), not about collecting metrics for Avro - we don't shred fields for Avro files so it doesn't make sense to to produce metrics.

rdblue · 2025-03-17T23:28:39Z

  private static final Map<Types.NestedField, Integer> FIELDS_WITH_NAN_COUNT_TO_ID =
-      ImmutableMap.of(
-          FLOAT_FIELD, 3, DOUBLE_FIELD, 4, FLOAT_LIST, 10, MAP_FIELD_1, 18, MAP_FIELD_2, 22);
+      ImmutableMap.of(FLOAT_FIELD, 3, DOUBLE_FIELD, 4);


Like ORC, Parquet will no longer produce metrics for repeated fields, like those in map keys or values or in list elements.

rdblue · 2025-03-17T23:30:51Z

    }
  }
-
-  private Type toParquetSchema(VariantValue value) {


Moved into ParquetVariantUtil.

amogh-jahagirdar

Still going through tests, but took look through the code changes and had some questions/comments.

amogh-jahagirdar

I've checked this out locally and stepped through the tests, this is looking great. Thanks @rdblue!

amogh-jahagirdar · 2025-03-21T15:53:00Z

+          String truncatedString =
+              UnicodeUtil.truncateStringMax((String) value.asPrimitive().get(), 16);
+          return truncatedString != null ? Variants.of(PhysicalType.STRING, truncatedString) : null;
+        case BINARY:
+          ByteBuffer truncatedBuffer =
+              BinaryUtil.truncateBinaryMin((ByteBuffer) value.asPrimitive().get(), 16);


Nit: Do we want to move 16 to an internal constant?

amogh-jahagirdar · 2025-03-21T15:56:32Z

+      }
+
+      return ImmutableList.of(
+          new FieldMetrics<>(


I feel like truncate 16 for all the shredded fields is the most intuitive out of all the options. That said I was wondering, would it make more sense to pass the MetricsMode to MetricsVariantVisitor that way if it's a truncate, we can use the user set length property as opposed to ignoring it? I don't think this is a hard blocker though, since the default is already truncate(16) and it's probably a sane truncation for the shredded fields as well.

rdblue · 2025-03-21T16:33:01Z

Thanks for the reviews, @aihuaxu and @amogh-jahagirdar! I'll leave this open a bit longer because I think @danielcweeks also wanted to take a look. I'll follow up with him.

rdblue · 2025-03-21T20:03:52Z

Merging this. Thanks for the reviews, @aihuaxu, @amogh-jahagirdar, and @danielcweeks!

github-actions Bot added API parquet core labels Mar 10, 2025

rdblue commented Mar 10, 2025

View reviewed changes

rdblue force-pushed the parquet-variant-metrics branch from f9dc85f to 77021c7 Compare March 10, 2025 22:44

rdblue commented Mar 10, 2025

View reviewed changes

rdblue force-pushed the parquet-variant-metrics branch from 77021c7 to 77d811c Compare March 10, 2025 22:58

github-actions Bot added data ORC labels Mar 13, 2025

aihuaxu reviewed Mar 17, 2025

View reviewed changes

rdblue force-pushed the parquet-variant-metrics branch from a8043fa to a041643 Compare March 17, 2025 23:21

rdblue commented Mar 17, 2025

View reviewed changes

rdblue force-pushed the parquet-variant-metrics branch 2 times, most recently from 0618811 to b337972 Compare March 17, 2025 23:40

rdblue added 8 commits March 18, 2025 16:52

API: Add truncate methods for objects not wrapped in Literal.

66ec3fc

Parquet: Pass Parquet group to variant in schema visitor.

fa5f7b6

Parquet: Refactor metrics and add Variant metrics.

6813d04

Fix checkstyle.

8c6f354

Fix TestInputFormatReaderDeletes.

a331535

Fix TestMergingMetrics.

a71b004

Fix OrcMetrics complexity and convert timestamp(9).

d965389

Fix checkstyle in Parquet.

43c5d33

rdblue added 6 commits March 18, 2025 16:52

Fix revapi.

d661d34

Parquet: Fix TestParquetDataWriter.

490e3c0

Parquet: Add tests for variant metrics.

4a5ba00

Add tests.

c1c1706

Fix typo.

a04035b

Fix checkstyle.

396c4b7

rdblue force-pushed the parquet-variant-metrics branch from b004fb2 to 396c4b7 Compare March 18, 2025 23:52

amogh-jahagirdar reviewed Mar 19, 2025

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/variants/Variants.java Outdated

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetVariantUtil.java

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetMetrics.java

Avoid an extra loop.

e8cd9dd

amogh-jahagirdar approved these changes Mar 21, 2025

View reviewed changes

danielcweeks reviewed Mar 21, 2025

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/variants/VariantObject.java

rdblue merged commit ded0670 into apache:main Mar 21, 2025

rdblue deleted the parquet-variant-metrics branch March 21, 2025 20:32

lliangyu-lin pushed a commit to lliangyu-lin/iceberg that referenced this pull request Mar 23, 2025

Parquet: Implement Variant metrics (apache#12496)

7ba5cf6

greenlaw mentioned this pull request Aug 21, 2025

Does the add_files procedure add column lower and upper bounds statistics to manifest files? #13218

Closed

vishnuprakaz mentioned this pull request Jun 19, 2026

Parquet: Fix variant BINARY upper bound to truncate up #16880

Open


		int length = truncateLength(mode);

		FieldMetrics<ByteBuffer> metrics = metricsFromFieldMetrics(fieldId, iPrimitive, length);

		@@ -1172,6 +1172,11 @@ acceptedBreaks:
		\ java.util.function.Consumer<T>)"
		justification: "Removing deprecated code"

Uh oh!

Conversation

rdblue commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Mar 21, 2025

Uh oh!

Uh oh!

rdblue commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdblue commented Mar 10, 2025 •

edited

Loading

amogh-jahagirdar Mar 21, 2025 •

edited

Loading

amogh-jahagirdar Mar 21, 2025 •

edited

Loading