Hive: Optimize tableExists API in hive catalog#11597
Conversation
Skip creation of hive table operation when check existence of iceberg table in hive catalog
|
FYI @szehon-ho and @haizhou-zhao if you are interested |
haizhou-zhao
left a comment
There was a problem hiding this comment.
Thx @dramaticlly , nice work
|
Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed. How do we handle corrupt metadata in other catalog implementations? |
Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted. Personally I think we can focus on former only and here's my thought process There are roughly 3 places where
|
|
Thanks for the check @dramaticlly! Thanks, |
|
Do we want to do the same opimization for the |
|
Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass. |
|
|
||
| /** | ||
| * Check whether table exists. | ||
| * Check whether table exists, including metadata table. |
There was a problem hiding this comment.
nit:
| * Check whether table exists, including metadata table. | |
| * Check whether table or metadata table exists. |
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for the changes! I just went through for educating myself and left some nits. Let me know if they doesn't make sense!
|
|
||
| /** | ||
| * Check whether table exists. | ||
| * Check whether table or metadata table exists. |
There was a problem hiding this comment.
Is checking whether the metadata table exists a new behavior? Since we're changing the comment for the Catalog interface here, it might affect catalogs that have not specifically overridden this function.
There was a problem hiding this comment.
I believe it's the existing (undocumented) behaviour
There was a problem hiding this comment.
Thanks! I didn't know about this behavior.
Changing this doc string in the Catalog interface will also change the doc string for functions that overrides tableExists. For example, the EcsCatalog overrides the tableExists and inherits the doc string in its javadocs https://iceberg.apache.org/javadoc/1.7.0/org/apache/iceberg/dell/ecs/EcsCatalog.html
but it doesn't look like this behavior is supported in the EcsCatalog.
There are a couple more places where this happens https://grep.app/search?q=boolean%20tableExists%28&filter[repo.pattern][0]=apache/iceberg
Perhaps it is better to move this into HiveCatalog or any other catalogs that support this behavior to avoid confusion in the future.
There was a problem hiding this comment.
Yea it did cross my mind to be a bit hesitant to change the top level javadoc, I am not sure if metadata tables is a catalog wide concept. Kevin's suggestion makes sense.
There was a problem hiding this comment.
I think it's otherwise, all other catalog reuse the default implementation where they tried to load table first as a way to determine whether table exists. The ECSCatalog is the only concrete implementation override this method which might not handle the metadata tables. Partly it was because such behaviour was implicit without unit test or documentation, until @szehon-ho mentioned it in #11597 (comment)
My vote is that we can override the javadoc in ECSCatalog instead, what do you think?
There was a problem hiding this comment.
I would agree that we don't want to expose the metadata Table concept here.
There was a problem hiding this comment.
thank you @danielcweeks , revert as suggested. My main concern is that if we dont document this explicitly, the future override of this method in catalog might not even take metadata table into consideration.
There was a problem hiding this comment.
the metadata table concept only applied to loadTable right now
https://github.com/search?q=repo%3Aapache%2Ficeberg%20isValidMetadataIdentifier&type=code
so its a side effect of tableExists using the underlying loadTable function
on a meta level, its a bit weird to check if a metadata table exists. if the underlying iceberg table exists, I would assume its metadata table also exists.
that said, I'm in favor of documenting this hidden behavior of tableExists
There was a problem hiding this comment.
Yea I think the main concern with putting on top level javadoc is that , there's a lot of Catalog in the wild that may not (and may not want to) support this. As @kevinjqliu points out it seems its a bit of an outlier that it happened here (that we should keep for backward compatbiility) and doc can probably be added in HiveCatalog specifically for this PR.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the optimization!
there seem to be implementations of HiveCatalog in other projects, maybe we can raise this optimization there as well
https://grep.app/search?q=HiveCatalog&filter[path.pattern][0]=HiveCatalog.java
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for the changes! Just some really small nits, otherwise I'm fine with the patch.
|
@danielcweeks: If you don't have more comments here, then I think we could merge this PR |
|
@rdblue @danielcweeks can you help take anther look to see if we can move forward on this PR? |
|
@dramaticlly minor comment, but other than that LGTM. |
|
Thanks @dramaticlly !! |
|
Thank you @danielcweeks @szehon-ho @pvary @kevinjqliu @gaborkaszab @haizhou-zhao for the review! I will look into similar change for hive view existence check |
Skip creation of hive table operation when check existence of iceberg table in hive catalog.
Today the table existence rely on load the table first and return true if table can be loaded
iceberg/api/src/main/java/org/apache/iceberg/catalog/Catalog.java
Lines 279 to 286 in 3badfe0
I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.
This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.
Why this is safe?
HiveOperationsBase.validateTableIsIcebergis also used in catalog listTables API to differentiate the iceberg table from hive table