Support for Namespace properties in JDBC Catalog#3275
Conversation
|
CC: @rdblue @jackye1995 |
|
@jackye1995 . thanks for the review. Made the fixes based on the comments. |
| NAMESPACE_NAME + " VARCHAR(255) NOT NULL," + | ||
| NAMESPACE_LOCATION + " VARCHAR(5500)," + | ||
| NAMESPACE_METADATA + " VARCHAR(65535)," + | ||
| NAMESPACE_PROPERTIES + " VARCHAR(65535)," + |
There was a problem hiding this comment.
what's the difference of metadata and properties?
There was a problem hiding this comment.
The createNamespace method had Map<String, String> metadata as a parameter. I kept this assuming there was separation of metadata from properties. Do we wish to have one or the other?
|
Thanks @jackye1995. Addressed the comments except the metadata properties comment. Let me know if we need to have two separate given how the method signature has metadata. |
|
@jackye1995 @rdblue , please have a look when you get a chance. Let me know if I can add/augment anything in the PR. |
| } | ||
|
|
||
| LOG.debug("Creating table {} to store iceberg catalog namespaces", | ||
| JdbcUtil.CATALOG_NAMESPACE_TABLE_NAME); |
There was a problem hiding this comment.
Style: This one is also off.
|
Previously, a namespace existed if there was a table in that namespace. This changes the definition so that a namespace exists if there is a row in the namespaces table. That is an incompatible definition. I think we need to decide what the behavior should be for the JDBC catalog, and then make sure that the implementation is consistent. If you should be able to implicitly create a namespace by creating a table (as is still allowed) then this needs to return that a namespace exists if a table is in the namespace. On the other hand, if that isn't allowed then Next, it looks like this creates a generic table with catalog name, namespace name, and string blobs. Is that the right way to go? Why not use a |
|
Thanks for the comments @rdblue . re: namespace and table re: table schema |
|
Also, I couldn't get clarity on whether we need both namespace |
I don't think we should introduce a breaking change like this. We could, in theory, fix it up when starting up the catalog by populating the table... but that's a bit odd. I think it's better to keep behavior and allow an option to be more strict.
I didn't follow this. Can you explain it a bit more for me?
The trade-off here is that removing the last namespace property would basically drop the namespace. In that case, the logic to check whether tables or properties exist is actually a good thing. I like the idea of the two table being loosely coupled. |
The More explanation:
Maybe there's a cleaner way to do this that I am missing. |
Agreed.
Agreed. I'll make the changes in upcoming commits. |
|
@rdblue added the changes we discussed.
Question:
|
| DatabaseMetaData dbMeta = conn.getMetaData(); | ||
| ResultSet tableExists = dbMeta.getTables(null, null, JdbcUtil.CATALOG_TABLE_NAME, null); | ||
| ResultSet tableExists = dbMeta.getTables(null, null, | ||
| JdbcUtil.CATALOG_TABLE_NAME, null); |
There was a problem hiding this comment.
Did this need to change? Looks like it is a whitespace-only change.
There was a problem hiding this comment.
Missed that. Will fix.
|
@nssalian, looks like there are two tables, |
| protected static final String CATALOG_NAMESPACE_TABLE_NAME = "iceberg_namespaces"; | ||
| protected static final String NAMESPACE_NAME = "namespace_name"; | ||
| protected static final String NAMESPACE_METADATA = "metadata"; | ||
| protected static final String NAMESPACE_PROPERTIES_TABLE_NAME = "namespace_properties"; |
There was a problem hiding this comment.
This should probably be iceberg_namespace_properties to fit with the other table names.
| NAMESPACE_NAME + " VARCHAR(255) NOT NULL," + | ||
| NAMESPACE_PROPERTY_KEY + " VARCHAR(5500)," + | ||
| NAMESPACE_PROPERTY_VALUE + " VARCHAR(5500)," + | ||
| "PRIMARY KEY (" + CATALOG_NAME + ", " + NAMESPACE_NAME + ")" + |
There was a problem hiding this comment.
I think the primary key should be catalog, namespace, and key.
| "(" + | ||
| CATALOG_NAME + " VARCHAR(255) NOT NULL," + | ||
| NAMESPACE_NAME + " VARCHAR(255) NOT NULL," + | ||
| NAMESPACE_METADATA + " VARCHAR(65535)," + |
There was a problem hiding this comment.
I don't think namespace metadata is needed. This should be the same thing as properties, right?
There was a problem hiding this comment.
createNamespace(Namespace namespace, Map<String, String> metadata)
is the reason I added that. What does metadata refer to here? Based on our discussion, I think createNamespace needs to change. In the master branch it is Unsupported.
There was a problem hiding this comment.
The metadata here and the properties that can be set in the other method are the same thing.
There was a problem hiding this comment.
I'll update the PR with setProperties as well in a follow up commit.
There was a problem hiding this comment.
Curious, why are they metadata and properties named differently if they mean the same?
There was a problem hiding this comment.
Sometimes, an argument is used with a different name because it shadows a field in the class, which would break checkstyle.
So if the class has a field properties, then an argument calld properties would make references to properties ambiguous. So our checkstyle doesn't allow it (outside of constructors where it's not ambiguous and this. has to be used to reference the class's value).
This is often why they're different. Also, in some cases it's just whomever wrote the function signature and they were different over time because of authors or just word choice of the person when they were working on that code🤷 . But in my experience, it's usually that the arguments name would shadow a field on the class.
I think we can remove |
|
Trying to catch up with the conversation...
Yes that's actually the very original idea that we thought about in the initial PR. The schema in the current PR was from @miR172 's POC, and I overlooked the fact that it would break existing behavior, sorry for that. So we should switch to use 1 table for namespace properties and 1 table for just table information as you suggested, and we don't need another namespace table because everything can be modeled as key value pairs. |
|
Thanks for the comment @jackye1995 . |
|
Sorry for the long delay on this @nssalian! Now that we have the docker based JDBC catalog notebook env, I'm going to install this in it and run some actual tests (as it's been a while since I've reviewed this). If things work, then I'm ok with this as I'd like to see this supported and we've already addressed a lot of feedback. |
| ResultSet tableExists = dbMeta.getTables(null /* catalog name */, null /* schemaPattern */, | ||
| JdbcUtil.NAMESPACE_PROPERTIES_TABLE_NAME /* tableNamePattern */, null /* types */); |
There was a problem hiding this comment.
Probably using the comments to define the values that are null etc one time is fine.
Maybe making this a private function that takes in conn and then has the comments would enhance readability.
Like
/**
* Check if a table exists in the database via a pattern to match against.
* Searches all schemas & catalogs.
* @return true if a table matching the pattern exists in the database
*/
private static boolean doesTableExist(Connection conn, String tableNamePatternToCheck) {
DatabaseMetaData dbMeta = conn.getMetaData();
ResultSet tablesMatchingPattern = dbMeta.getTables(
null /* catalog name */, null /* schemaPattern */, tableNamePatternToCheck, null /* types */);
return tablesMatchingPattern.hasNext();
}That way, the null fields can be annotated once and then we could just call
connections.run(conn -> {
if (!doesTableExist(conn, JdbcUtil.NAMESPACE_PROPERTIES_TABLE_NAME)) {
// create table
}
return true;
});I'm also not sure if there's a reason to use .next() instead of .hasNext() on the ResultSet. I'm ok with either way, but hasNext seems like the more intuitive API if there's no semantic difference. I don't know enough about JDBC to be sure if there is a difference or not though so it might be safer to just leave it as is.
| sql.setString(rowIndex + 1, key); | ||
| rowIndex += 1; | ||
| } | ||
| LOG.info("Final log string {}", sql); |
There was a problem hiding this comment.
Nit: This log line can be removed - or at least downgraded to debug and changed to something like LOG.debug("Running query to update namespace properties: {}", sql)
| AssertHelpers.assertThrows("Cannot create a namespace with null or empty metadata", IllegalArgumentException.class, | ||
| () -> catalog.createNamespace(testNamespace, null)); |
There was a problem hiding this comment.
Can you make a separate test, testCreateNamespaceFailsWithoutNamespaceProperties()?
That way, there would be something in the code that self-documents that we are intentional about needing namespace properties.
You could then even do something like the following
AssertHelpers.assertThrows(
"JDBC catalog cannot create a namespace without adding namespace properties as the properties are used as an existence check for RDBMs that don't support an atomic compare and swap",
...The long / detailed comment probably isn't needed, but having it as a separate test (with both null and an empty map being tried in the test) would be nice.
| throw new UncheckedInterruptedException(e, "Interrupted in call to insertProperties(namespace, properties) " + | ||
| "Namespace: %s", namespace); |
There was a problem hiding this comment.
Can you update this error message to not refer to code, but human-readable ones?
If somebody said it was fine for here, then leave it. But generally speaking, we prefer to use plain English (vs English with code in it) as end users are unlikely to know the code.
I'd go with something like Interrupted while inserting properties to namespace %s in catalog %s. Insertion likely needs to be retried.
You can drop the insertion likely needs to be retried part if there's not a chance of a partial success (I don't think there is as you're using prepared statements).
PR Description:
Adding the table to hold namespace names and attributes like location, metadata, properties.
Splitting up the implementation in: #3274 into smaller single change PRs.
Changes: