SPEC: Add SQL UDF spec#14117
Conversation
| | *optional* | `doc` | `string` | Parameter documentation. | | ||
|
|
||
| Notes: | ||
| 1. The `name` and `type` of a `parameter` are immutable. To change them, a new overload must be created. Only the optional documentation field (`doc`) can be updated in-place. |
There was a problem hiding this comment.
should name be immutable? typically function signature (like Java) doesn't include parameter name
There was a problem hiding this comment.
The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.
{ "name": "x", "type": "int", "doc": "Input integer" }
...
"overload-version-id": 1,
"deterministic": true,
"representations": [
{ "dialect": "trino", "body": "x + 2" }
],
...
"overload-version-id": 2,
"deterministic": true,
"representations": [
{ "dialect": "trino", "body": "x + 1" }
],
There was a problem hiding this comment.
not sure if I understand how the parameter renaming cause problem for rollback.
To change them, a new overload must be created.
Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?
There was a problem hiding this comment.
not sure if I understand how the parameter renaming cause problem for rollback.
Taking the example I put above. If we rename it to y at some point, then rollback to v1 or v2 will cause inconsistency between representation x+1 and parameter name y.
Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?
It shouldn't be allowed, as the signatures are the same.
There was a problem hiding this comment.
got it. basically, parameter renaming is not allowed. hence, we require the name and type are immutable.
There was a problem hiding this comment.
It is unclear to me what "immutable" means. Does it mean that you can't change these without updating the overload-id? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:
- Function definitions are identified by the tuple of types and there can be only one definition for a given tuple
- All parameter names must match the definition in all versions and representations
There was a problem hiding this comment.
After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition:
interface Definition {
int do_something(String foo);
}
class Impl implements Definition {
int do_something(String bar) {
return bar.length();
}
}There was a problem hiding this comment.
That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations.
There was a problem hiding this comment.
Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.
There was a problem hiding this comment.
Per last discussion(https://lists.apache.org/thread/t30hfxydwd8qkfzk9mtscc2xpg3kf621), we keep parameters only at the definition layer.
|
Thanks @stevenzwu for the review. Resolved all comments. Please take another look! |
| | *required* | `definition-versions` | `list<{ definition-id: string, version-id: int }>` | Mapping of each definition to its selected version at this time. | | ||
|
|
||
| ## Function Call Convention and Resolution in Engines | ||
| Resolution rule is decided by engines, but engines SHOULD: |
There was a problem hiding this comment.
I think it would be more clear to say this:
Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:
- Prefer exact parameter matches over safe (widening) or unsafe casts
- Safely widen types as needed to avoid failing to find a matching definition
- Require explicit casts for unsafe or non-obvious conversions
- Use definitions with the same number of arguments as the input
- Pass positional arguments in the same position as the input
- Use definitions with the same set of field names as named input arguments
As for the last point of specifically not mixing positional and named arguments, I think that points 5 and 6 cover it. Don't reorder positional arguments and match the whole set of names. Also, implementers may ignore the "don't mix positional and named matching" but clearly stating how to match positional and named at least gives us some insurance that behavior won't be wacky if people do it anyway.
There was a problem hiding this comment.
Fixed per suggestion
| **self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, | ||
| properties, and engine-specific representations. | ||
|
|
||
| * Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata. |
There was a problem hiding this comment.
How is the UDF metadata file referenced by table or view metadata? Does it need to be updated together with the swap? If only function-uuid is referenced, then this is not an issue.
There was a problem hiding this comment.
The udf name will be the identifier, just like table name, and view name. I think it's fine to go with that convention. For example, if users' sql refers a table by its identifier (ns1.t1), instead of its uuid. We may apply the similar logic there for udf.
| ### Parameter | ||
| | Requirement | Field | Type | Description | | ||
| |-------------|--------|----------|--------------------------------------------------------------| | ||
| | *required* | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). | |
There was a problem hiding this comment.
Do we allow nullable parameter? I just saw the expected behavior if any input is null. Do we need finer-grained control?
There was a problem hiding this comment.
We do allow nullable parameter. The on-null-input is a hint that engines can decide whether to optimize when one of parameters is null. Please check the section "Null Input Handling" in this doc for more details, https://docs.google.com/document/d/1GC896Z4gxYP0Vz-ENqZ3tZZBqXEUQf4qDJO11NRo8F4/edit?tab=t.0
|
Thank you all for the review. The PR is ready for another look. |
|
Fixed the spec related to |
| | *required* | `representations` | `list<representation>` | [Dialect-specific implementations](#representation). | | ||
| | *optional* | `deterministic` | `boolean` (default `false`) | Whether the function is deterministic. | | ||
| | *optional* | `on-null-input` | `string` (`"return-null"` or `"call"`, default `"call"`) | Defines how the UDF behaves when any input parameter is NULL. | | ||
| | *required* | `timestamp-ms` | `long` (unix epoch millis) | Creation timestamp of this version. | |
There was a problem hiding this comment.
Since we are discussing this the @stevenzwu for tables, do we want to state that this is monotonically increasing or just go with it as-is?
There was a problem hiding this comment.
Discussed with @stevenzwu a bit, some points to share:
- Do we support time-travel for UDF versions? We may never.
- Do we support rollback to a historical version? We might. Do we support time-based rollback, like rollback to a UDF version 3 days ago? Maybe not. It feels really weird to me, that a user just want to rollback to a version 3 days ago without checking the exact version he/she want to rollback. Version-id based rollback should just work.
- Most engines(Spark, Trino, Snowflake, Bigquery, Postgres) don't even support UDF versioning.
With that, I think it's fine to not enforce monotonically increasing.
There was a problem hiding this comment.
The version id and timestamp are update per overload. Since the monotonic timestamp at per overload scope, it is not meaningful for rollback (even if we want to). It makes sense to increment the version id at per overload scope. We won't want to increment version id at per overload scope and increment timestamp at global scope.
Hence, I am favor of not adding the monotonic requirement to the timestamp value here.
| Notes: | ||
| 1. Variadic (vararg) parameters are not supported. Each definition must declare a fixed number of parameters. | ||
|
|
||
| #### Types |
There was a problem hiding this comment.
Should we state that all types are considered nullable/optional?
There was a problem hiding this comment.
added it as the point 3 in the above section.
| ## Function Call Convention and Resolution in Engines | ||
| Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should: | ||
|
|
||
| 1. Prefer exact parameter matches over safe (widening) or unsafe casts. |
There was a problem hiding this comment.
Minor: Point 3 says engines should require explicit casts for unsafe conversions, which might make this less clear because an explicit cast changes the input being matched. For example, when matching foo(cast(string_col as int)), the cast makes int an exact match. I think this guidance still applies when the engine doesn't strictly follow point 3, though. I'm on the fence about whether to only mention "safe (widening) casts" or to use the current version.
There was a problem hiding this comment.
I might be missing some of the subtle conflicts between points 1 and 3. Overall, I slightly prefer the current version, since the three points are largely consistent with each other, and points 2 and 3 are symmetric in structure and intent.
rdblue
left a comment
There was a problem hiding this comment.
I just commented on the vote thread. There are some minor things to clarify and/or fix (and probably more once we work with this more) but I think this is reasonable and well thought through. Awesome work, everyone! Thanks for all the effort and discussion on this!
| |-------------|------------|----------|------------------------------------------------------| | ||
| | *required* | `type` | `string` | Must be `"sql"` | | ||
| | *required* | `dialect` | `string` | SQL dialect identifier (e.g., `"spark"`, `"trino"`). | | ||
| | *required* | `sql` | `string` | SQL expression text. | |
There was a problem hiding this comment.
minor : do we wanna say its client responsibility to make sure the each dialect produces same output before storing ?
There was a problem hiding this comment.
Discussed offline with @singhpk234. We chose to follow the view spec and intentionally avoid specifying this further. There is a fine balance between providing enough clarity in the specification and over specifying behavior that is better left to implementations.
Dev mailing thread: https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:versioned%20UDF
Design docs:
I have updated the metadata structure per last meeting. Here is the latest structure in a nutshell. Please use the PRs as the source of truth.
