Skip to content

API, Core: Expose file and data sequence numbers through ContentFile#7555

Merged
amogh-jahagirdar merged 1 commit into
apache:masterfrom
gaborkaszab:expose_seq_nums
May 16, 2023
Merged

API, Core: Expose file and data sequence numbers through ContentFile#7555
amogh-jahagirdar merged 1 commit into
apache:masterfrom
gaborkaszab:expose_seq_nums

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

No description provided.

@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Resolves #7449

@aokolnychyi

Copy link
Copy Markdown
Contributor

Will take a look today. Thanks, @gaborkaszab!

}

/** Returns the data sequence number of the snapshot in which the file should be applied. */
default Long dataSequenceNumber() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the spec, we should call this sequenceNumber instead of dataSequenceNumber? Although I know dataSequenceNumber is probably more clear.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually deprecated and removed sequenceNumber in ManifestEntry cause it was not clear what it represents. I'd be inclined to match what we currently have in ManifestEntry.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I saw that ManifestEntry.sequenceNumber() was removed, hence I used the same naming in ContentFile to be consistent with the current state of ManifestEntry.

@jackye1995 jackye1995 added this to the Iceberg 1.3.0 milestone May 8, 2023
Comment thread api/src/main/java/org/apache/iceberg/ContentFile.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/ContentFile.java Outdated
private byte[] keyMetadata = null;
private Integer sortOrderId;

private Long dataSequenceNumber = null;

@aokolnychyi aokolnychyi May 9, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to have these two as a separate block? Right now, BaseFile only have two blocks to keep optional fields separate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise that pattern with the blocks because 'avroSchema' and 'fromProjectionPos/partitionType' also make their own block. Regardless, I merged the new ones with the optional block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these two indeed optional? I'd say they belong more to the block with partitionSpecId. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, giving it a second thought, they indeed aren't optional. Added them to the same block as partitionSpecId.

Comment thread core/src/main/java/org/apache/iceberg/BaseFile.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/BaseFile.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/ContentFile.java
Comment thread core/src/test/java/org/apache/iceberg/TableTestBase.java
@aokolnychyi

aokolnychyi commented May 9, 2023

Copy link
Copy Markdown
Contributor

I think this approach is correct, I had mostly minor comments.

I also wonder whether we should prohibit passing new data files with a set explicit fileSequenceNumber, just like we don't allow new manifests with a set snapshot ID. The file sequence number must be assigned at commit.

Thoughts, @jackye1995 @RussellSpitzer @rdblue?

V1Assert.assertEquals(
"Data sequence number should default to 0", 0, entry.dataSequenceNumber().longValue());
V1Assert.assertEquals(
"Data sequence number should default to 0",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need assert on file sequence number too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for file sequence number.

However, I'm wondering if the checks for dataSequenceNumber are correct. If I'm not mistaken if there was a rewrite then the dataSequenceNumber on the ManifestEntry can differ from the one on the Snapshot. So these checks (even the existing ones) are correct only if there were no rewrites on the table. Is my understanding correct here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaborkaszab, you are correct that rewrites may preserve the initial data sequence number. This method would not work for such rewrites. I believe we have other tests that pass an iterator of sequence numbers that should be used in those cases.

@rdblue rdblue left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a reasonable approach to this.

@gaborkaszab gaborkaszab self-assigned this May 11, 2023
Comment thread core/src/test/java/org/apache/iceberg/TestSnapshot.java
Comment thread core/src/test/java/org/apache/iceberg/TableTestBase.java
Comment thread core/src/main/java/org/apache/iceberg/BaseFile.java
@aokolnychyi

Copy link
Copy Markdown
Contributor

Let me take another look in a bit.

@aokolnychyi aokolnychyi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me. I had one question about the place where to put new variables in BaseFile. I believe @amogh-jahagirdar also had some nits. Otherwise, should be good to go.

Co-authored-by: chenjunjiedada <jimmyjchen@tencent.com>

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me thanks @gaborkaszab !

@amogh-jahagirdar

Copy link
Copy Markdown
Contributor

Merging since we have a few approvals, and no blocking comments. thanks @gaborkaszab for the PR, and @rdblue @aokolnychyi @stevenzwu @jackye1995 for reviews.

@amogh-jahagirdar amogh-jahagirdar merged commit 9907a97 into apache:master May 16, 2023
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Thanks for the reviews, @amogh-jahagirdar @aokolnychyi @rdblue @stevenzwu @jackye1995 !

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants