Skip to content

Core: Add total data size to Partitions table#7920

Merged
szehon-ho merged 27 commits into
apache:masterfrom
hsiang-c:partitions_data_size
Jul 7, 2023
Merged

Core: Add total data size to Partitions table#7920
szehon-ho merged 27 commits into
apache:masterfrom
hsiang-c:partitions_data_size

Conversation

@hsiang-c

@hsiang-c hsiang-c commented Jun 27, 2023

Copy link
Copy Markdown
Contributor

Closes #7896

This PR adds total_data_file_size_in_bytes to Partitions Table

Comment thread docs/flink-queries.md Outdated
Comment thread docs/spark-queries.md Outdated
Comment thread core/src/main/java/org/apache/iceberg/PartitionsTable.java Outdated
@ajantha-bhat

Copy link
Copy Markdown
Member

cc: @szehon-ho as you are mostly working and reviewing this area.

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks , yea I was chatting earlier with @hsiang-c for this :)

Comment thread core/src/main/java/org/apache/iceberg/PartitionsTable.java Outdated
Types.LongType.get(),
"Id of snapshot that last updated this partition"));
"Id of snapshot that last updated this partition"),
Types.NestedField.required(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this topic always comes up, but what do you think of the position? @ajantha-bhat @dramaticlly . Maybe its better after file_count? (so we have 3 columns for data, pos_delete, and eq_delete)

@dramaticlly dramaticlly Jun 27, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think what Szehon said make sense, given last 2 columns are optional and new column is required

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I have already kept it beside file_count for partition stats.

Note: Here we should not modify field id while reordering to maintain the compatibility.

Comment on lines +1299 to +1303
"total_data_size_in_bytes",
StreamSupport.stream(
table.currentSnapshot().addedDataFiles(table.io()).spliterator(), false)
.mapToLong(DataFile::fileSizeInBytes)
.sum())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth extract a variable instead of inline the computation.

also I saw you added coverage for unpartitioned table only, shall we also add one for partitioned table to make sure it s data size in bytes match for each partition?

@hsiang-c hsiang-c Jun 28, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dramaticlly Thank you for your feedback.

Yes, we should add tests for partitioned table. I was able to do it for testPartitionsTable and testPartitionsTableDeleteStats but not testPartitionsTableLastUpdatedSnapshot.

Will dig into it more today.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dramaticlly I think I fixed testPartitionsTableLastUpdatedSnapshot, please take a look, thanks!

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some style nits

11,
"total_data_file_size_in_bytes",
Types.LongType.get(),
"Total bytes of data files in a partition"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'total size in bytes'

return SparkSchemaUtil.convert(selectNonDerived(metadataTable).schema()).asStruct();
}

private long getDataFileSizeInBytes(Iterable<DataFile> dataFiles) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove 'get' (Iceberg code style guideline are a bit different: https://iceberg.apache.org/contribute/#method-naming)

private int eqDeleteFileCount;
private Long lastUpdatedMs;
private Long lastUpdatedSnapshotId;
private long dataFileSizeInBytes;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move after dataFileCount? (as its part of 'dataFile' group)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

speak of which, @szehon-ho do you feel we shall do the same in Schema method to move this new field with id 11 to be right after file_count (field id 3)? It seem to fit into same dataFile group by it might be some concern about reference by position to mess up?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea , i think that was the consensus from the other comemnt: #7920 (comment) @hsiang-c do you think we can move it?

@hsiang-c hsiang-c Jul 3, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho Sure thing!

11,
"total_data_file_size_in_bytes",
Types.LongType.get(),
"Total size in bytes"),

@szehon-ho szehon-ho Jun 30, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, in my previous comment I meant just change "total bytes' => 'total size in bytes', but the rest was ok.

So can we revert back the original end of sentence where you talked about data files?

'Total size in bytes of data files' (maybe 'in a partition' was redundant there)

Comment thread docs/flink-queries.md Outdated

Note:
For unpartitioned tables, the partitions table will contain only the record_count and file_count columns.
For unpartitioned tables, the partitions table will contain only the record_count, file_count, position_delete_record_count, position_delete_file_count, equality_delete_record_count, equality_delete_file_count, last_updated_ms, last_updated_snapshot_id and total_data_file_size_in_bytes columns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this in another pr? I feel we need to edit the table above as well.

Also, I think we can just say 'For unpartitioned tables, the partitions table will not contain the partition and spec_id field', as the list of fields we do support is becoming too big.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we can follow up with doc PR after this is merged

Types.IntegerType.get(),
"Count of equality delete files"),
Types.NestedField.required(
11, "total_data_file_size_in_bytes", Types.LongType.get(), "Total size in bytes"),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not changed back? "Total size in bytes of data files" Sorry if its still pending

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also let's move it up to between 3 and 5 since it belong to data file group

AvroSchemaUtil.convert(
partitionsTable.schema().findType("partition").asStructType(), "partition"));

List<DataFile> dataFilesFromFirstCommit = listDataFilesFromCommitId(table, firstCommitId);

@szehon-ho szehon-ho Jul 6, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to make a method List dataFiles(table) to get all the data files, so we don't have to do add data files from both commits?

I did this before here: https://github.com/apache/iceberg/blob/master/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewritePositionDeleteFilesAction.java#L682

(maybe we can do it without column stats here, to be shorter).

If we do this, we can even extract to TestHelpers in a later PR.

@hsiang-c hsiang-c Jul 7, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho Thanks for pointing out! Adopted it.

If we do this, we can even extract to TestHelpers in a later PR.

+1, let's do the extraction in a later PR.

@hsiang-c hsiang-c force-pushed the partitions_data_size branch from 0373042 to f29a865 Compare July 7, 2023 05:22
return Lists.newArrayList(CloseableIterable.transform(tasks, FileScanTask::file));
}

private void assertDataFilePartitions(List<DataFile> dataFiles, int[] expectedPartitionIds) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can put back the size check.

@szehon-ho szehon-ho merged commit 025cdf0 into apache:master Jul 7, 2023
@szehon-ho

Copy link
Copy Markdown
Member

Merged , thanks a lot @hsiang-c for the first contribution, and thanks @ajantha-bhat and @dramaticlly for additional reviews!

@szehon-ho

Copy link
Copy Markdown
Member

(Feel free to make follow prs to update the docs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add total file size to Partitions table

4 participants