SPEC: Add SQL UDF spec by flyrain · Pull Request #14117 · apache/iceberg

flyrain · 2025-09-19T07:01:12Z

Dev mailing thread: https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:versioned%20UDF
Design docs:

I have updated the metadata structure per last meeting. Here is the latest structure in a nutshell. Please use the PRs as the source of truth.

stevenzwu · 2025-10-13T22:06:01Z

+| *optional*  | `doc`  | `string` | Parameter documentation. |
+
+Notes:
+1. The `name` and `type` of a `parameter` are immutable. To change them, a new overload must be created. Only the optional documentation field (`doc`) can be updated in-place.


should name be immutable? typically function signature (like Java) doesn't include parameter name

The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.

{ "name": "x", "type": "int", "doc": "Input integer" } ... "overload-version-id": 1, "deterministic": true, "representations": [ { "dialect": "trino", "body": "x + 2" } ], ... "overload-version-id": 2, "deterministic": true, "representations": [ { "dialect": "trino", "body": "x + 1" } ],

not sure if I understand how the parameter renaming cause problem for rollback.

To change them, a new overload must be created.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

not sure if I understand how the parameter renaming cause problem for rollback.

Taking the example I put above. If we rename it to y at some point, then rollback to v1 or v2 will cause inconsistency between representation x+1 and parameter name y.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

It shouldn't be allowed, as the signatures are the same.

got it. basically, parameter renaming is not allowed. hence, we require the name and type are immutable.

It is unclear to me what "immutable" means. Does it mean that you can't change these without updating the overload-id? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:

Function definitions are identified by the tuple of types and there can be only one definition for a given tuple

All parameter names must match the definition in all versions and representations

After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition:

interface Definition { int do_something(String foo); } class Impl implements Definition { int do_something(String bar) { return bar.length(); } }

That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations.

Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.

Per last discussion(https://lists.apache.org/thread/t30hfxydwd8qkfzk9mtscc2xpg3kf621), we keep parameters only at the definition layer.

flyrain · 2025-10-16T00:44:28Z

Thanks @stevenzwu for the review. Resolved all comments. Please take another look!

rdblue · 2026-01-21T22:37:48Z

+| *required*  | `definition-versions` | `list<{ definition-id: string, version-id: int }>` | Mapping of each definition to its selected version at this time. |
+
+## Function Call Convention and Resolution in Engines
+Resolution rule is decided by engines, but engines SHOULD:


I think it would be more clear to say this:

Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:

Prefer exact parameter matches over safe (widening) or unsafe casts

Safely widen types as needed to avoid failing to find a matching definition

Require explicit casts for unsafe or non-obvious conversions

Use definitions with the same number of arguments as the input

Pass positional arguments in the same position as the input

Use definitions with the same set of field names as named input arguments

As for the last point of specifically not mixing positional and named arguments, I think that points 5 and 6 cover it. Don't reorder positional arguments and match the whole set of names. Also, implementers may ignore the "don't mix positional and named matching" but clearly stating how to match positional and named at least gives us some insurance that behavior won't be wacky if people do it anyway.

Fixed per suggestion

wgtmac · 2026-01-22T02:42:32Z

+**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata.


How is the UDF metadata file referenced by table or view metadata? Does it need to be updated together with the swap? If only function-uuid is referenced, then this is not an issue.

The udf name will be the identifier, just like table name, and view name. I think it's fine to go with that convention. For example, if users' sql refers a table by its identifier (ns1.t1), instead of its uuid. We may apply the similar logic there for udf.

wgtmac · 2026-01-22T03:24:45Z

+### Parameter
+| Requirement | Field  | Type     | Description                                                  |
+|-------------|--------|----------|--------------------------------------------------------------|
+| *required*  | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). |


Do we allow nullable parameter? I just saw the expected behavior if any input is null. Do we need finer-grained control?

We do allow nullable parameter. The on-null-input is a hint that engines can decide whether to optimize when one of parameters is null. Please check the section "Null Input Handling" in this doc for more details, https://docs.google.com/document/d/1GC896Z4gxYP0Vz-ENqZ3tZZBqXEUQf4qDJO11NRo8F4/edit?tab=t.0

Resolve comments

flyrain · 2026-01-24T20:24:58Z

Thank you all for the review. The PR is ready for another look.

stevenzwu

LGTM

flyrain · 2026-01-26T18:41:33Z

Fixed the spec related to secure and Types per today's community sync. Please take another look.

rdblue · 2026-02-03T22:58:24Z

+| *required*  | `representations` | `list<representation>`                                   | [Dialect-specific implementations](#representation).           |
+| *optional*  | `deterministic`   | `boolean` (default `false`)                              | Whether the function is deterministic.                         |
+| *optional*  | `on-null-input`   | `string` (`"return-null"` or `"call"`, default `"call"`) | Defines how the UDF behaves when any input parameter is NULL.  |
+| *required*  | `timestamp-ms`    | `long` (unix epoch millis)                               | Creation timestamp of this version.                            |


Since we are discussing this the @stevenzwu for tables, do we want to state that this is monotonically increasing or just go with it as-is?

Discussed with @stevenzwu a bit, some points to share:

Do we support time-travel for UDF versions? We may never.

Do we support rollback to a historical version? We might. Do we support time-based rollback, like rollback to a UDF version 3 days ago? Maybe not. It feels really weird to me, that a user just want to rollback to a version 3 days ago without checking the exact version he/she want to rollback. Version-id based rollback should just work.

Most engines(Spark, Trino, Snowflake, Bigquery, Postgres) don't even support UDF versioning.

With that, I think it's fine to not enforce monotonically increasing.

The version id and timestamp are update per overload. Since the monotonic timestamp at per overload scope, it is not meaningful for rollback (even if we want to). It makes sense to increment the version id at per overload scope. We won't want to increment version id at per overload scope and increment timestamp at global scope.

Hence, I am favor of not adding the monotonic requirement to the timestamp value here.

rdblue · 2026-02-03T23:00:20Z

+Notes:
+1. Variadic (vararg) parameters are not supported. Each definition must declare a fixed number of parameters.
+
+#### Types


Should we state that all types are considered nullable/optional?

added it as the point 3 in the above section.

rdblue · 2026-02-03T23:07:06Z

+## Function Call Convention and Resolution in Engines
+Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:
+
+1. Prefer exact parameter matches over safe (widening) or unsafe casts.


Minor: Point 3 says engines should require explicit casts for unsafe conversions, which might make this less clear because an explicit cast changes the input being matched. For example, when matching foo(cast(string_col as int)), the cast makes int an exact match. I think this guidance still applies when the engine doesn't strictly follow point 3, though. I'm on the fence about whether to only mention "safe (widening) casts" or to use the current version.

I might be missing some of the subtle conflicts between points 1 and 3. Overall, I slightly prefer the current version, since the three points are largely consistent with each other, and points 2 and 3 are symmetric in structure and intent.

rdblue

I just commented on the vote thread. There are some minor things to clarify and/or fix (and probably more once we work with this more) but I think this is reasonable and well thought through. Awesome work, everyone! Thanks for all the effort and discussion on this!

singhpk234

LGTM as well !

singhpk234 · 2026-02-05T02:05:00Z

+|-------------|------------|----------|------------------------------------------------------|
+| *required*  | `type`     | `string` | Must be `"sql"`                                      |
+| *required*  | `dialect`  | `string` | SQL dialect identifier (e.g., `"spark"`, `"trino"`). |
+| *required*  | `sql`      | `string` | SQL expression text.                                 |


minor : do we wanna say its client responsibility to make sure the each dialect produces same output before storing ?

Discussed offline with @singhpk234. We chose to follow the view spec and intentionally avoid specifying this further. There is a fine balance between providing enough clarity in the specification and over specifying behavior that is better left to implementations.

Add SQL UDF spec

d118b64

github-actions Bot added the Specification Issues that may introduce spec changes. label Sep 19, 2025

RussellSpitzer reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

stevenzwu reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

stevenzwu reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

stevenzwu reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

stevenzwu reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

stevenzwu reviewed Sep 19, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md

talatuyarer reviewed Sep 22, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

sfc-gh-ygu added 2 commits September 23, 2025 11:41

Resolve commemnts

33ad408

Resolve comments

6990aea

stevenzwu reviewed Sep 23, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

Comment thread format/udf-spec.md Outdated

Resolve comments

35cda2a

danielcweeks reviewed Oct 1, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

danielcweeks reviewed Oct 1, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

danielcweeks reviewed Oct 1, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

danielcweeks reviewed Oct 6, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

talatuyarer reviewed Oct 6, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

sfc-gh-ygu added 5 commits October 6, 2025 15:26

Resolve comments

8a75909

Resolve comments

7432d52

Add field types

e779099

Resolve comments

786f82b

Resolve comments

96a5880

stevenzwu reviewed Oct 13, 2025

View reviewed changes

sfc-gh-ygu added 2 commits October 15, 2025 17:32

Resolve comments

071e97f

Resolve comments

78223c1

flyrain requested a review from rdblue October 16, 2025 00:44

rdblue reviewed Oct 20, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Oct 20, 2025

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Jan 21, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Jan 21, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Jan 21, 2026

View reviewed changes

Comment thread format/udf-spec.md

wgtmac reviewed Jan 22, 2026

View reviewed changes

flyrain added 5 commits January 22, 2026 15:42

Resolve comments.

678621e

Resolve comments

Resolve comments.

e6e3bcb

SQL expression and call conventions

9b02a15

Use Iceberg Type Json representation

eb97175

Secure udf fix

af0e694

stevenzwu approved these changes Jan 25, 2026

View reviewed changes

Secure and types fixes

c1c0a96

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

Comment thread format/udf-spec.md Outdated

rdblue reviewed Feb 3, 2026

View reviewed changes

rdblue approved these changes Feb 3, 2026

View reviewed changes

Resolve comments

c2a3c2e

singhpk234 approved these changes Feb 5, 2026

View reviewed changes

Resolve comments

b7efe21

flyrain merged commit b35624e into apache:main Feb 5, 2026
4 checks passed

talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026

Spec: Introduce SQL UDF specification (apache#14117)

5c62091

Uh oh!

Conversation

flyrain commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flyrain commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Jan 24, 2026

Uh oh!

stevenzwu left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain commented Jan 26, 2026

Uh oh!

Uh oh!

flyrain commented Sep 19, 2025 •

edited

Loading