Skip to content

Core: Add TableUtil to provide access to a table's format version#11620

Merged
nastra merged 3 commits into
apache:mainfrom
nastra:table-util
Dec 16, 2024
Merged

Core: Add TableUtil to provide access to a table's format version#11620
nastra merged 3 commits into
apache:mainfrom
nastra:table-util

Conversation

@nastra

@nastra nastra commented Nov 21, 2024

Copy link
Copy Markdown
Contributor

This is an alternative impl to #11587

Comment thread core/src/main/java/org/apache/iceberg/SerializableTable.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TableUtil.java

// being able to read the format version from the PositionDeletesTable is mainly needed in
// SparkPositionDeletesRewrite when determining whether to rewrite V2 position deletes to DVs
if (table instanceof BaseMetadataTable) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had scala or even kotlin

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I want scala is I really want to just enumerate cases here. I would recommend we just go through all of our cases narrower then broader if we have exceptions so

If (PositionDeleteTable)
  return format version 
else if (MetaTable) {
  Sorry Brah
}
else if (HasTableOperations) {
   return format version
}
else {
 // Sorry Brah
}

Although I am curious why position delete table needs a format version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I am curious why position delete table needs a format version?

@RussellSpitzer this is mainly for SparkPositionDeletesRewrite (which operates against the position deletes table). Basically when rewriting existing position deletes we need to know whether we need to rewrite them to V2 position deletes or to DVs by looking at the underlying format version of the table. A table that was upgraded to V3 can still have V2 position deletes, meaning that these would then have to be rewritten as DVs

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?

Ie

formatVersion(metadataTable.baseTable)

Rather than

formatVersion(metadataTable)

Implicitly calling metadataTable.baseTable but only sometimes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend we just go through all of our cases narrower then broader if we have exceptions so

I would prefer that too, but the fact that SerializableMetadataTable is a subclass of SerializableTable which in turn implements HasTableOperations makes this more difficult and you still need to differentiate there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just be looking at the underlying table then? Shouldn't the converter look at the base table rather than the metadata table?

I'm not fully sure I follow your comment. Do you mean the calling site should first check whether it's a metadata table before calling TableUtil.formatVersion(...)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I also don't think the caller should need to check. Shouldn't the caller know what it's doing? Like if it is compacting DeleteVectors it knows it has a DeleteVectorMetadataTable and therefor it uses the parent table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would require changing a bunch of more places, because effectively we have a Broadcast<Table> in SparkPositionDeletesRewrite:

Broadcast<Table> tableBroadcast =
sparkContext.broadcast(SerializableTableWithSize.copyOf(table));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a an argument for ease of current implementation. But do you think it's the right decision going forward to have a format version for position deletes metadata table and none of the other metadata tables?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the checks a bit. I do think it makes sense to have an easy way to fetch the format version from the PositionDeletesTable, since it basically translates 1:1 to the underlying table and so the format version of the PositionDeletesTable would be the same as the underlying table . The question is whether we add this logic to TableUtil or to the calling site(s) and would like to hear what others think here /cc @amogh-jahagirdar @aokolnychyi

@nastra nastra force-pushed the table-util branch 3 times, most recently from 4515c45 to 8c69d34 Compare November 22, 2024 10:03
return new BaseTable(ops, tableName);
}

public Table underlyingTable() {

@aokolnychyi aokolnychyi Nov 26, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What about capturing the format version as a field in SerializableTable, similar to what we do for the metadata file location? The problem right now is that calling lazyTable() may actually require a request to load the metadata, which is something we would want to ideally avoid. Historically, we kept separate fields for what is considered important information and can be accessed frequently.

Comment thread core/src/main/java/org/apache/iceberg/TableUtil.java Outdated
@nastra nastra force-pushed the table-util branch 2 times, most recently from bdfc988 to 13a016c Compare November 27, 2024 07:57

// SerializableMetadataTable is a subclass of SerializableTable but does not support
// operations()
if (table instanceof HasTableOperations

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still have a problem that getting SerializableTable here would require loading the metadata location and sending a request to the storage. I would consider capturing format version as a field in SerializableTable and throwing UnsupportedOperationException in SerializableMetadataTable that implements it. Then the whole logic can be structured as follows:

if (table instanceof SerializableTable) {
  SerializableTable serializableTable = (SerializableTable) table;
  return serializableTable.formatVersion();
} else if (table instanceof HasTableOperations) {
  TableOperations ops = ((HasTableOperations) table).operations();
  return ops.current().formatVersion();
} else {
  throw new IllegalArgumentException(
      String.format("%s does not have a format version", table.getClass().getSimpleName()));
}

What do you think, @nastra?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still have a problem that getting SerializableTable here would require loading the metadata location and sending a request to the storage

Yes this is a trade-off here because I was trying to avoid having effectively a mix of a Util class and adding formatVersion() to the SerializableTable API. I don't have a strong opinion here so I've done this in 1d8ef05 to show its implications

private transient volatile Schema lazySchema = null;
private transient volatile Map<Integer, PartitionSpec> lazySpecs = null;
private transient volatile SortOrder lazySortOrder = null;
private final UUID uuid;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this up to the other final fields

@aokolnychyi aokolnychyi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some optional suggestions. Thanks for adding the tests too!

SerializableTable serializableTable = (SerializableTable) table;
return serializableTable.formatVersion();
} else if (table instanceof HasTableOperations) {
return ((HasTableOperations) table).operations().current().formatVersion();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: Same comment about an extra variable for TableOperations.


private int formatVersion(Table table) {
if (table instanceof HasTableOperations) {
return ((HasTableOperations) table).operations().current().formatVersion();

@aokolnychyi aokolnychyi Dec 13, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: What about an extra variable for TableOperations? We have an explicit casts and 3 method invocations on a single line. An extra variable may be easier to read and would be consistent with metadataFileLocation.

if (table instanceof HasTableOperations) {
return ((HasTableOperations) table).operations().current().formatVersion();
} else {
// formatVersion=-1 will never be used/returned, because

@aokolnychyi aokolnychyi Dec 13, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: We may actually avoid the override by doing something like below.

public int formatVersion() {
  if (formatVersion == UNKNOWN_FORMAT_VERSION) {
    throw new UnsupportedOperationException("Format version is unknown");
  }
  return formatVersion;
}

private int formatVersion(Table table) {
  if (table instanceof HasTableOperations) {
    ...
  } else {
    return UNKNOWN_FORMAT_VERSION;
  }
}

This would also cover cases for regular tables that didn't have TableOperations under the hood.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, I've updated this (although I've used a slightly different error msg to be consistent with the error msg in TableUtil)

@nastra nastra merged commit f40ec20 into apache:main Dec 16, 2024
@nastra nastra deleted the table-util branch December 16, 2024 10:10
sfc-gh-prsingh added a commit to singhpk234/iceberg that referenced this pull request Apr 22, 2026
…pat parser

Three of five reviewer-flagged items from the Anton/Ryan synthesis:

1. Salt out of the Catalyst resolution inner loop
   ApplyReadRestrictions previously generated a new SecureRandom salt
   inside the per-attribute match inside resolveOperators, so fixed-point
   re-entry could regenerate the salt mid-planning and break determinism
   within a query. Generate the salt once per apply() call and thread it
   through buildMaskExpression so all Sha256QueryLocal actions in the
   rewrite share a stable salt.

2. Rename ReadRestrictionsAware -> SupportsReadRestrictions
   Match the capability-marker naming used elsewhere in Iceberg
   (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize
   the sniff in TableUtil.readRestrictions(Table) following the
   TableUtil.formatVersion precedent (apache#11620), so engines have one decision
   locus and future Flink/Trino integrations don't re-invent the check.

3. Forward-compat for unknown action discriminators
   ActionParser used to throw IllegalArgumentException on an unrecognized
   action type, blocking older clients from interoperating with newer
   servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString)
   so parsing is lossless; enforcement (Actions.bind, rule binding) fails
   closed when it encounters Unknown so silent bypass of unmasked data is
   impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec.
   The existing 'unknownActionTypeRejected' test is renamed and updated to
   assert preservation instead of rejection.

Deferred to follow-up PRs:
  - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser)
  - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)
sfc-gh-prsingh added a commit to singhpk234/iceberg that referenced this pull request May 27, 2026
…pat parser

Three of five reviewer-flagged items from the Anton/Ryan synthesis:

1. Salt out of the Catalyst resolution inner loop
   ApplyReadRestrictions previously generated a new SecureRandom salt
   inside the per-attribute match inside resolveOperators, so fixed-point
   re-entry could regenerate the salt mid-planning and break determinism
   within a query. Generate the salt once per apply() call and thread it
   through buildMaskExpression so all Sha256QueryLocal actions in the
   rewrite share a stable salt.

2. Rename ReadRestrictionsAware -> SupportsReadRestrictions
   Match the capability-marker naming used elsewhere in Iceberg
   (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize
   the sniff in TableUtil.readRestrictions(Table) following the
   TableUtil.formatVersion precedent (apache#11620), so engines have one decision
   locus and future Flink/Trino integrations don't re-invent the check.

3. Forward-compat for unknown action discriminators
   ActionParser used to throw IllegalArgumentException on an unrecognized
   action type, blocking older clients from interoperating with newer
   servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString)
   so parsing is lossless; enforcement (Actions.bind, rule binding) fails
   closed when it encounters Unknown so silent bypass of unmasked data is
   impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec.
   The existing 'unknownActionTypeRejected' test is renamed and updated to
   assert preservation instead of rejection.

Deferred to follow-up PRs:
  - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser)
  - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)
sfc-gh-prsingh added a commit to singhpk234/iceberg that referenced this pull request May 27, 2026
…pat parser

Three of five reviewer-flagged items from the Anton/Ryan synthesis:

1. Salt out of the Catalyst resolution inner loop
   ApplyReadRestrictions previously generated a new SecureRandom salt
   inside the per-attribute match inside resolveOperators, so fixed-point
   re-entry could regenerate the salt mid-planning and break determinism
   within a query. Generate the salt once per apply() call and thread it
   through buildMaskExpression so all Sha256QueryLocal actions in the
   rewrite share a stable salt.

2. Rename ReadRestrictionsAware -> SupportsReadRestrictions
   Match the capability-marker naming used elsewhere in Iceberg
   (SupportsDistributedScanPlanning, SupportsReplaceView, etc.). Centralize
   the sniff in TableUtil.readRestrictions(Table) following the
   TableUtil.formatVersion precedent (apache#11620), so engines have one decision
   locus and future Flink/Trino integrations don't re-invent the check.

3. Forward-compat for unknown action discriminators
   ActionParser used to throw IllegalArgumentException on an unrecognized
   action type, blocking older clients from interoperating with newer
   servers. Introduce Action.Unknown that preserves (fieldId, rawTypeString)
   so parsing is lossless; enforcement (Actions.bind, rule binding) fails
   closed when it encounters Unknown so silent bypass of unmasked data is
   impossible. Mirrors ImmutableUnknownViewRepresentation in the view spec.
   The existing 'unknownActionTypeRejected' test is renamed and updated to
   assert preservation instead of rejection.

Deferred to follow-up PRs:
  - Bind row filter at parse time (needs schema plumbing in LoadTableResponseParser)
  - TestApplyReadRestrictions + TestReadRestrictedQuery (needs SparkSession fixture)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants