Core: Add readable metrics columns to files metadata tables by szehon-ho · Pull Request #5376 · apache/iceberg

szehon-ho · 2022-07-28T21:56:02Z

This adds following columns to all files tables:

readable_metrics, which is struct of:
column_sizes
value_counts
null_value_counts
nan_value_counts
lower_bounds
upper_bounds

These are then a map of column_name to value.

szehon-ho · 2022-08-01T22:17:35Z

+            Transforms.identity(field.type())
+                .toHumanString(Conversions.fromByteBuffer(field.type(), value)));
+      } catch (Exception e) { // Ignore
+        return Optional.empty();


This happens in some cases, I found it in some case of importing external files to Iceberg table, ie TestIcebergSourceHadoopTables.testFilesTableWithSnapshotIdInheritance, where the I think columns are out of order of the original schema and the metrics are corrupt (underflow exception in this case).

Not sure if we should error out the files tables, in that case, I was leaning towards just returning null. User has original column to see why the error happened.

szehon-ho · 2022-08-02T18:44:48Z

All Spark tests are updated/fixed now.

Fyi @RussellSpitzer @aokolnychyi @rdblue if you guys have time to leave some feedback. There are new tests added but its a bit big to show 'Files Changed': TestMetadataTableMetricsColumns.java

aokolnychyi · 2022-08-10T04:15:26Z

+        return Optional.of(
+            Transforms.identity(field.type())
+                .toHumanString(Conversions.fromByteBuffer(field.type(), value)));
+      } catch (Exception e) { // Ignore


Do you have examples when this throws an exception?

Yea I tried a put a comment, but unfortunately got disassociated with this line after a rebase.

This happens in some cases, I found it in some case of importing external files to Iceberg table, ie TestIcebergSourceHadoopTables.testFilesTableWithSnapshotIdInheritance, where the I think columns are out of order of the original schema and the metrics are corrupt (underflow exception in this case).

Not sure if we should error out the files tables, in that case, I was leaning towards just returning null. User has original column to see why the error happened.

@aokolnychyi filed an issue: #5543

Following up on this, this is a non-issue as the spark procedures set the flag: schema.name-mapping.default , just this test does not. Fixed the test.

aokolnychyi · 2022-08-10T04:33:25Z

It seems like a great idea to add readable metrics. It is hard to make sense of them otherwise.

@szehon-ho, what do you think about adding a single map column, let's say called readable_metrics, that will hold a mapping from a column name into a struct that would represent metrics? The type will be Map<String, StructType> and we will have individual struct fields for each type of metric.

We can then easily access them via SQL.

SELECT readable_metrics['col1'].lower_bound FROM db.t.files

I am okay with individual columns too but it seems a bit cleaner to just have one.

aokolnychyi · 2022-08-12T15:50:10Z

Let me check in a bit.

aokolnychyi · 2022-08-22T23:12:34Z

Let me take a look today.

aokolnychyi

Looks really close to me.

aokolnychyi · 2022-08-26T18:20:43Z

Let me take a look now.

aokolnychyi · 2022-08-26T20:55:16Z

    return new Schema(joinedColumns);
  }

+  public static Schema joinCommon(Schema left, Schema right) {


What about simply adapting the existing join method? Are there any scenarios where we want to skip the validation and simply add all columns (old logic)?

aokolnychyi · 2022-08-26T21:54:16Z

+        return CloseableIterable.transform(files(projection), file -> (StructLike) file);
+      } else {
+        Schema fileProjection = TypeUtil.selectNot(projection, READABLE_METRICS_FIELD_IDS);
+        Schema minProjection =


I think this logic should be part of BaseFilesTableScan and BaseAllFilesTableScan.
Otherwise, our scans won't report the correct schema in Scan$schema().

I think putting it there will break the scan right, as its not the projection the user requested.

Note, this is actually a bit subtle here. Because we are doing the join, (original projection + minimum metrics), the file's schema becomes
{any_projected_field_on_file} : {readable_metrics because its also projected} : {un-projected but required metrics fields}

So the ContentFileWithMetrics works because it will discard any of the "un-projected but required metrics fields", given they are outside the range it will read. For the remaining fields it uses the existing logic (delegate to file for the first n-1, and then get from MetricsStruct for nth field).

I mean, we could add a select method to GenericDataFile to modify its internal 'fromProjectionPos' map to conform back to the original projection (dropping the "un-projected but required metrics fields"). But it would mainly be for clarity, and not strictly needed.

szehon-ho · 2022-08-31T23:08:16Z

Added additional test, looks it is working even when readable_metric column is selected before other columns (spark somehow calls the rows in their original order)

chenjunjiedada · 2022-11-02T11:58:03Z


  public static Schema join(Schema left, Schema right) {
-    List<Types.NestedField> joinedColumns = Lists.newArrayList();
-    joinedColumns.addAll(left.columns());


nit: This changes the original behavior, why not add a new function?

@chenjunjiedada Yea that was my original version, and changed after comment of @aokolnychyi #5376 (comment)

Technically this is changing a public API which previously would have allowed these combos. This is ... maybe ok since it's a utility method but we may end up breaking users of the function at runtime. That said I think Anton is right and any schema with multiple columns with the same ID would always be wrong.

chenjunjiedada · 2022-11-02T12:20:46Z

Really nice PR, thanks @szehon-ho and @aokolnychyi for the effort! When can we merge this? I think it is ready and has been two months since the last review, which will lead to more conflicts if leave it.

szehon-ho · 2022-12-02T13:19:58Z

@RussellSpitzer addressed the comments, thanks!

szehon-ho · 2022-12-02T14:25:32Z

Actually hold on a second, looking at a small refactor to make it more generic to add a readable_metric definition in future

…ion in READABLE_METRIC_COLS static array

szehon-ho · 2022-12-02T19:18:48Z

@RussellSpitzer should be good now for another look when you get a chance, thanks!

szehon-ho · 2022-12-02T19:31:33Z

    Dataset<Row> inputDF = spark.createDataFrame(records, SimpleRecord.class);
    inputDF.select("data", "id").write().mode("overwrite").insertInto("parquet_table");

+    NameMapping mapping = MappingUtil.create(table.schema());


Test used to write the wrong metrics to imported table, without these lines.

RussellSpitzer · 2022-12-02T19:44:50Z

+  public static Schema readableMetricsSchema(Schema dataTableSchema, Schema metadataTableSchema) {
+    List<Types.NestedField> fields = Lists.newArrayList();
+    Map<Integer, String> idToName = dataTableSchema.idToName();
+    AtomicInteger nextId =


I don't believe this has to be atomic, and metadataTableSchema should already have a method highestFieldId() which according to the doc includes nested fields

If you prefer incrementAndGet() to ++nextId though I think using the Atomic just for readability is probably fine

Ah AtomicInteger is because there is a lambda function in there (the map) and compiler complains :

Variable used in lambda expression should be final or effectively final

Re: highestFieldId(), good to know! Done.

RussellSpitzer · 2022-12-02T20:05:29Z

+
+    Table filesTable = new FilesTable(table.ops(), table);
+    Types.StructType actual = filesTable.newScan().schema().select("readable_metrics").asStruct();
+


Just to make this a little easier in the future you may just want to do something like

firstAssigned = (schema.highestId - 15)
Then do
1001 = firstAssigned +1; ....

not sure this really helps that much though

RussellSpitzer

One remaining nit on the "highestId" call. I think overall we probably should do a refactoring of our tests for the files table in Spark, they have been really brittle to changes for a long time and I think we can do better. I think that can wait though or maybe be a task for a newcomer who wants to understand metadata tables better.

RussellSpitzer

Actually wasn't there a set of tests checking that the projection was working correctly? I'm not sure I see those tests anymore but maybe I've looked at this for too long?

szehon-ho · 2022-12-02T22:33:15Z

Yep , test should be here: https://github.com/apache/iceberg/blob/6681dba9bc7dc0d793aa8de739d2b9962260b0ff/spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestMetadataTableReadableMetrics.java

szehon-ho · 2022-12-02T22:34:18Z

Would love to see what is a good way to simplify it without breaking the checks. Currently compares every single field.

szehon-ho · 2022-12-05T18:09:07Z

Thanks @RussellSpitzer @aokolnychyi @chenjunjiedada for detailed reviews

) (apache#861) Co-authored-by: Szehon Ho <szehon.apache@gmail.com>

atifiu · 2023-09-17T11:12:58Z

@szehon-ho @RussellSpitzer Is there any document about these readable metrics ? All these metrics are exposed using files metadata only ?

atifiu · 2023-09-17T11:38:34Z

Closes #4362

This adds following columns to all files tables:

readable_metrics, which is struct of:

column_sizes

value_counts

null_value_counts

nan_value_counts

lower_bounds

upper_bounds

These are then a map of column_name to value.

@szehon-ho Actual column names are without 's' in the end

github-actions Bot added core spark labels Jul 28, 2022

szehon-ho mentioned this pull request Jul 28, 2022

Interpreting the upper/lower bounds column returned from querying the .files metadata #5370

Closed

szehon-ho force-pushed the readable_metrics branch from c06a868 to a6e3cbe Compare August 1, 2022 22:08

szehon-ho commented Aug 1, 2022

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/MetricsUtil.java Outdated

szehon-ho force-pushed the readable_metrics branch 2 times, most recently from 223a3ad to 936e2ea Compare August 2, 2022 17:33

aokolnychyi reviewed Aug 10, 2022

View reviewed changes

szehon-ho force-pushed the readable_metrics branch from d1324e3 to 62eb36c Compare August 11, 2022 05:08

aokolnychyi reviewed Aug 12, 2022

View reviewed changes

szehon-ho mentioned this pull request Aug 16, 2022

Imported parquet tables may have wrong metrics #5543

Closed

aokolnychyi reviewed Aug 18, 2022

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/BaseFilesTable.java Outdated

github-actions Bot added the API label Aug 19, 2022

szehon-ho commented Aug 19, 2022

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/BaseFilesTable.java

aokolnychyi reviewed Aug 23, 2022

View reviewed changes

aokolnychyi reviewed Aug 26, 2022

View reviewed changes

szehon-ho mentioned this pull request Aug 28, 2022

Error projecting nested structs from manifests table #5649

Closed

szehon-ho mentioned this pull request Nov 1, 2022

The metadata table returns from Spark query should show human readable string for lower/upper bounds #6085

Closed

chenjunjiedada reviewed Nov 2, 2022

View reviewed changes

szehon-ho force-pushed the readable_metrics branch 2 times, most recently from 21c7205 to 0e68ae3 Compare November 3, 2022 23:01

szehon-ho added 3 commits December 1, 2022 12:16

Simplify id assignment logic for readable_metrics struct

7aec222

Fix tests

5a5afba

Fix whitespace

fc91a62

szehon-ho added 4 commits December 2, 2022 08:02

Make code more generic for future metrics, need to only add a definit…

766da81

…ion in READABLE_METRIC_COLS static array

Fix tests

3dbfe95

Fix spotless

04ff627

Restore missing null check

ab2b186

szehon-ho added 2 commits December 2, 2022 11:26

Simplify old tests use to util methods

1436853

Same for 3.2 test

6681dba

szehon-ho commented Dec 2, 2022

View reviewed changes

RussellSpitzer reviewed Dec 2, 2022

View reviewed changes

RussellSpitzer approved these changes Dec 2, 2022

View reviewed changes

RussellSpitzer reviewed Dec 2, 2022

View reviewed changes

Review comments

01f5e9d

Fix minor typo of class name in UnsupportedOperationException

bb70c43

szehon-ho merged commit 9a00f74 into apache:master Dec 5, 2022

hililiwei mentioned this pull request Dec 6, 2022

Flink: Support inspecting table #6222

Merged

szehon-ho mentioned this pull request Apr 17, 2023

Add readable_metrics to entries metadata table #7364

Closed

dramaticlly mentioned this pull request May 6, 2023

Implement ReadableMetrics for Entries table #7539

Merged

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Core: Add readable_metrics columns to files metadata tables (apache#5376

2ea38ad

) (apache#861) Co-authored-by: Szehon Ho <szehon.apache@gmail.com>

szehon-ho mentioned this pull request May 24, 2023

Iceberg metadata not stored properly #6244

Closed


		Table filesTable = new FilesTable(table.ops(), table);
		Types.StructType actual = filesTable.newScan().schema().select("readable_metrics").asStruct();

Uh oh!

Conversation

szehon-ho commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho commented Aug 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Aug 10, 2022

Uh oh!

aokolnychyi commented Aug 12, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Aug 22, 2022

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Aug 26, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho commented Aug 31, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Jul 28, 2022 •

edited

Loading

szehon-ho Aug 1, 2022 •

edited

Loading

szehon-ho Aug 29, 2022 •

edited

Loading

szehon-ho Dec 2, 2022 •

edited

Loading