Core: Add total data size to Partitions table by hsiang-c · Pull Request #7920 · apache/iceberg

hsiang-c · 2023-06-27T03:53:16Z

This PR adds total_data_file_size_in_bytes to Partitions Table

ajantha-bhat · 2023-06-27T13:26:42Z

cc: @szehon-ho as you are mostly working and reviewing this area.

szehon-ho

Thanks , yea I was chatting earlier with @hsiang-c for this :)

szehon-ho · 2023-06-27T17:13:34Z

                Types.LongType.get(),
-                "Id of snapshot that last updated this partition"));
+                "Id of snapshot that last updated this partition"),
+            Types.NestedField.required(


And this topic always comes up, but what do you think of the position? @ajantha-bhat @dramaticlly . Maybe its better after file_count? (so we have 3 columns for data, pos_delete, and eq_delete)

Yeah I think what Szehon said make sense, given last 2 columns are optional and new column is required

Agree. I have already kept it beside file_count for partition stats.

Note: Here we should not modify field id while reordering to maintain the compatibility.

dramaticlly · 2023-06-27T22:38:36Z

+                "total_data_size_in_bytes",
+                StreamSupport.stream(
+                        table.currentSnapshot().addedDataFiles(table.io()).spliterator(), false)
+                    .mapToLong(DataFile::fileSizeInBytes)
+                    .sum())


probably worth extract a variable instead of inline the computation.

also I saw you added coverage for unpartitioned table only, shall we also add one for partitioned table to make sure it s data size in bytes match for each partition?

@dramaticlly Thank you for your feedback.

Yes, we should add tests for partitioned table. I was able to do it for testPartitionsTable and testPartitionsTableDeleteStats but not testPartitionsTableLastUpdatedSnapshot.

Will dig into it more today.

@dramaticlly I think I fixed testPartitionsTableLastUpdatedSnapshot, please take a look, thanks!

szehon-ho

Looks good, just some style nits

szehon-ho · 2023-06-29T18:14:55Z

+                11,
+                "total_data_file_size_in_bytes",
+                Types.LongType.get(),
+                "Total bytes of data files in a partition"),


nit: 'total size in bytes'

szehon-ho · 2023-06-29T18:24:20Z

    return SparkSchemaUtil.convert(selectNonDerived(metadataTable).schema()).asStruct();
  }
+
+  private long getDataFileSizeInBytes(Iterable<DataFile> dataFiles) {


nit: we can remove 'get' (Iceberg code style guideline are a bit different: https://iceberg.apache.org/contribute/#method-naming)

szehon-ho · 2023-06-29T18:25:33Z

    private int eqDeleteFileCount;
    private Long lastUpdatedMs;
    private Long lastUpdatedSnapshotId;
+    private long dataFileSizeInBytes;


nit: can we move after dataFileCount? (as its part of 'dataFile' group)

speak of which, @szehon-ho do you feel we shall do the same in Schema method to move this new field with id 11 to be right after file_count (field id 3)? It seem to fit into same dataFile group by it might be some concern about reference by position to mess up?

Oh yea , i think that was the consensus from the other comemnt: #7920 (comment) @hsiang-c do you think we can move it?

@szehon-ho Sure thing!

szehon-ho · 2023-06-30T07:56:09Z

+                11,
+                "total_data_file_size_in_bytes",
+                Types.LongType.get(),
+                "Total size in bytes"),


Ah sorry, in my previous comment I meant just change "total bytes' => 'total size in bytes', but the rest was ok.

So can we revert back the original end of sentence where you talked about data files?

'Total size in bytes of data files' (maybe 'in a partition' was redundant there)

szehon-ho · 2023-06-30T18:47:27Z


 Note:
-For unpartitioned tables, the partitions table will contain only the record_count and file_count columns.
+For unpartitioned tables, the partitions table will contain only the record_count, file_count, position_delete_record_count, position_delete_file_count, equality_delete_record_count, equality_delete_file_count, last_updated_ms, last_updated_snapshot_id and total_data_file_size_in_bytes columns.


Should we do this in another pr? I feel we need to edit the table above as well.

Also, I think we can just say 'For unpartitioned tables, the partitions table will not contain the partition and spec_id field', as the list of fields we do support is becoming too big.

agreed, we can follow up with doc PR after this is merged

szehon-ho · 2023-06-30T18:48:28Z

                Types.IntegerType.get(),
                "Count of equality delete files"),
+            Types.NestedField.required(
+                11, "total_data_file_size_in_bytes", Types.LongType.get(), "Total size in bytes"),


This is still not changed back? "Total size in bytes of data files" Sorry if its still pending

and also let's move it up to between 3 and 5 since it belong to data file group

szehon-ho · 2023-07-06T18:04:10Z

            AvroSchemaUtil.convert(
                partitionsTable.schema().findType("partition").asStructType(), "partition"));
+
+    List<DataFile> dataFilesFromFirstCommit = listDataFilesFromCommitId(table, firstCommitId);


Would it work to make a method List dataFiles(table) to get all the data files, so we don't have to do add data files from both commits?

I did this before here: https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java#L682

(maybe we can do it without column stats here, to be shorter).

If we do this, we can even extract to TestHelpers in a later PR.

@szehon-ho Thanks for pointing out! Adopted it.

If we do this, we can even extract to TestHelpers in a later PR.

+1, let's do the extraction in a later PR.

- https://github.com/apache/iceberg/pull/7105/files

szehon-ho · 2023-07-07T06:26:33Z

+    return Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
+  }
+
+  private void assertDataFilePartitions(List<DataFile> dataFiles, int[] expectedPartitionIds) {


Nit: we can put back the size check.

This reverts commit e3dbd94.

szehon-ho · 2023-07-07T17:35:27Z

Merged , thanks a lot @hsiang-c for the first contribution, and thanks @ajantha-bhat and @dramaticlly for additional reviews!

szehon-ho · 2023-07-07T17:36:08Z

(Feel free to make follow prs to update the docs)

github-actions Bot added core docs spark labels Jun 27, 2023

ajantha-bhat reviewed Jun 27, 2023

View reviewed changes

Comment thread docs/flink-queries.md Outdated

Comment thread docs/spark-queries.md Outdated

Comment thread core/src/main/java/org/apache/iceberg/PartitionsTable.java Outdated

szehon-ho reviewed Jun 27, 2023

View reviewed changes

dramaticlly reviewed Jun 27, 2023

View reviewed changes

szehon-ho reviewed Jun 29, 2023

View reviewed changes

szehon-ho reviewed Jun 30, 2023

View reviewed changes

hsiang-c commented Jul 3, 2023

View reviewed changes

Comment thread spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated

dramaticlly reviewed Jul 5, 2023

View reviewed changes

Comment thread spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java Outdated

szehon-ho reviewed Jul 6, 2023

View reviewed changes

hsiang-c added 17 commits July 7, 2023 11:10

Add total_data_size_in_bytes to Partitions table

b818218

Return total_data_size_in_bytes for unpartitioned partitions table

155f6fb

Style fix

943b52d

Required columns first

de3948e

Align column name w/ partition stats spec

2328796

- https://github.com/apache/iceberg/pull/7105/files

Extract datafile size summing to a method

647b2a2

Test 'total_data_file_size_in_bytes' for partitioned table

630fc6c

Deleted files are excluded from size stat

7289c46

Renamed doc/method based on review comments

e559b2b

Sum data file size after rewriting manifest

77001d6

Fix style

28d9d58

Sync implementation and flink/spark docs

1cd14a9

Fixed field doc according to comments

7e35379

Group data file stats

fdf508d

Suppress MethodLength warnings

2f52907

Make long explicit

b340540

Revert doc change. Will fix it in later PR.

61de57d

hsiang-c added 6 commits July 7, 2023 11:55

Extract assertions to a helper method

92c66af

Parameterize helper method

e5fb59a

Fix statement

6067cfd

Rename methods based on feedbacks

4641684

Switch to Guava's Lists

32074d6

Check partition id for all data files

f29a865

hsiang-c force-pushed the partitions_data_size branch from 0373042 to f29a865 Compare July 7, 2023 05:22

Switch to array impl

e3dbd94

szehon-ho reviewed Jul 7, 2023

View reviewed changes

szehon-ho approved these changes Jul 7, 2023

View reviewed changes

hsiang-c added 3 commits July 7, 2023 14:37

Revert "Switch to array impl"

88b8504

This reverts commit e3dbd94.

Move size assertion into test helper

78a4c65

Removed redundant size

e0b7f87

szehon-ho merged commit 025cdf0 into apache:master Jul 7, 2023

hsiang-c deleted the partitions_data_size branch July 9, 2023 02:31

This was referenced Jul 9, 2023

Docs: Update Partitions table in Flink/Spark doc #8021

Merged

Spark: Consolidate duplicated test methods to TestHelpers #8024

Merged

Uh oh!

Conversation

hsiang-c commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajantha-bhat commented Jun 27, 2023

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dramaticlly Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

szehon-ho Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsiang-c Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

szehon-ho commented Jul 7, 2023

Uh oh!

hsiang-c commented Jun 27, 2023 •

edited

Loading

dramaticlly Jun 27, 2023 •

edited

Loading

hsiang-c Jun 28, 2023 •

edited

Loading

hsiang-c Jul 3, 2023 •

edited

Loading

szehon-ho Jun 30, 2023 •

edited

Loading

szehon-ho Jul 6, 2023 •

edited

Loading

hsiang-c Jul 7, 2023 •

edited

Loading