Skip to content

Hive: Optimize tableExists API in hive catalog#11597

Merged
danielcweeks merged 11 commits into
apache:mainfrom
dramaticlly:hiveTableExists
Dec 12, 2024
Merged

Hive: Optimize tableExists API in hive catalog#11597
danielcweeks merged 11 commits into
apache:mainfrom
dramaticlly:hiveTableExists

Conversation

@dramaticlly

Copy link
Copy Markdown
Contributor

Skip creation of hive table operation when check existence of iceberg table in hive catalog.

Today the table existence rely on load the table first and return true if table can be loaded

default boolean tableExists(TableIdentifier identifier) {
try {
loadTable(identifier);
return true;
} catch (NoSuchTableException e) {
return false;
}
}

I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.

This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.

Why this is safe? HiveOperationsBase.validateTableIsIceberg is also used in catalog listTables API to differentiate the iceberg table from hive table

Skip creation of hive table operation when check existence of iceberg table in hive catalog
@github-actions github-actions Bot added the hive label Nov 20, 2024
@dramaticlly

Copy link
Copy Markdown
Contributor Author

FYI @szehon-ho and @haizhou-zhao if you are interested

@haizhou-zhao haizhou-zhao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @dramaticlly , nice work

@pvary

pvary commented Nov 20, 2024

Copy link
Copy Markdown
Contributor

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

@dramaticlly

dramaticlly commented Nov 20, 2024

Copy link
Copy Markdown
Contributor Author

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.

Personally I think we can focus on former only and here's my thought process

There are roughly 3 places where catalog.tableExists was used in iceberg code base

  1. Check before table can be registered in registerTable API
  • I believe behaviour change is allowed here as long as entry exist in catalog, register shall fail regardless files is corrupted
  1. Check before table stage creation, this is only used in REST catalog handler
  • I believe behaviour change is also allowed here as long as entry exist in catalog, stage creation shall fail regardless version files is corrupted
  1. REST API to check for table existence:
    head:
    tags:
    - Catalog API
    summary: Check if a table exists
    operationId: tableExists
  • I think this is what I originally hoped for to optimize on, to speed up on the existence check without rely on reading metadata first. The reason is that sometimes existence check is all we need without subsequent load table call

@pvary

pvary commented Nov 21, 2024

Copy link
Copy Markdown
Contributor

Thanks for the check @dramaticlly!
I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.

Thanks,
Peter

@pvary

pvary commented Nov 21, 2024

Copy link
Copy Markdown
Contributor

Do we want to do the same opimization for the viewExists method too?

@karuppayya

Copy link
Copy Markdown
Contributor

Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass.
With the change, tableExists would pass with only access to HMS table?

Comment thread hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated
Comment thread hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated
Comment thread hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated

/**
* Check whether table exists.
* Check whether table exists, including metadata table.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
* Check whether table exists, including metadata table.
* Check whether table or metadata table exists.

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I just went through for educating myself and left some nits. Let me know if they doesn't make sense!

Comment thread hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated
Comment thread hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java Outdated
Comment thread hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java Outdated

/**
* Check whether table exists.
* Check whether table or metadata table exists.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checking whether the metadata table exists a new behavior? Since we're changing the comment for the Catalog interface here, it might affect catalogs that have not specifically overridden this function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the existing (undocumented) behaviour

@kevinjqliu kevinjqliu Dec 4, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know about this behavior.

Changing this doc string in the Catalog interface will also change the doc string for functions that overrides tableExists. For example, the EcsCatalog overrides the tableExists and inherits the doc string in its javadocs https://iceberg.apache.org/javadoc/1.7.0/org/apache/iceberg/dell/ecs/EcsCatalog.html
but it doesn't look like this behavior is supported in the EcsCatalog.

There are a couple more places where this happens https://grep.app/search?q=boolean%20tableExists%28&filter[repo.pattern][0]=apache/iceberg

Perhaps it is better to move this into HiveCatalog or any other catalogs that support this behavior to avoid confusion in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it did cross my mind to be a bit hesitant to change the top level javadoc, I am not sure if metadata tables is a catalog wide concept. Kevin's suggestion makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's otherwise, all other catalog reuse the default implementation where they tried to load table first as a way to determine whether table exists. The ECSCatalog is the only concrete implementation override this method which might not handle the metadata tables. Partly it was because such behaviour was implicit without unit test or documentation, until @szehon-ho mentioned it in #11597 (comment)

My vote is that we can override the javadoc in ECSCatalog instead, what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that we don't want to expose the metadata Table concept here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @danielcweeks , revert as suggested. My main concern is that if we dont document this explicitly, the future override of this method in catalog might not even take metadata table into consideration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metadata table concept only applied to loadTable right now
https://github.com/search?q=repo%3Aapache%2Ficeberg%20isValidMetadataIdentifier&type=code

so its a side effect of tableExists using the underlying loadTable function

on a meta level, its a bit weird to check if a metadata table exists. if the underlying iceberg table exists, I would assume its metadata table also exists.

that said, I'm in favor of documenting this hidden behavior of tableExists

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think the main concern with putting on top level javadoc is that , there's a lot of Catalog in the wild that may not (and may not want to) support this. As @kevinjqliu points out it seems its a bit of an outlier that it happened here (that we should keep for backward compatbiility) and doc can probably be added in HiveCatalog specifically for this PR.

@github-actions github-actions Bot added the DELL label Dec 4, 2024
Comment thread dell/src/main/java/org/apache/iceberg/dell/ecs/EcsCatalog.java Outdated

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the optimization!

there seem to be implementations of HiveCatalog in other projects, maybe we can raise this optimization there as well
https://grep.app/search?q=HiveCatalog&filter[path.pattern][0]=HiveCatalog.java

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Just some really small nits, otherwise I'm fine with the patch.

Comment thread hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java Outdated
@pvary

pvary commented Dec 9, 2024

Copy link
Copy Markdown
Contributor

@danielcweeks: If you don't have more comments here, then I think we could merge this PR

@dramaticlly

Copy link
Copy Markdown
Contributor Author

@rdblue @danielcweeks can you help take anther look to see if we can move forward on this PR?

@danielcweeks

Copy link
Copy Markdown
Contributor

@dramaticlly minor comment, but other than that LGTM.

@danielcweeks danielcweeks merged commit a3dcfd1 into apache:main Dec 12, 2024
@danielcweeks

Copy link
Copy Markdown
Contributor

Thanks @dramaticlly !!

@dramaticlly

Copy link
Copy Markdown
Contributor Author

Thank you @danielcweeks @szehon-ho @pvary @kevinjqliu @gaborkaszab @haizhou-zhao for the review! I will look into similar change for hive view existence check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants