API, Spark: Support StringLiteral to Fixed and StringLiteral to Binary Conversions by singhpk234 · Pull Request #14882 · apache/iceberg

singhpk234 · 2025-12-18T11:59:13Z

About the change

Presently the expression when serialized doesn't capture the type so even binary when de-serialized its used as string which later fails. For the parsers its important to know the schema so that they could de-serialize stuff correctly, a part of it is handled in the SDK during response de-serialization via parser context but while the client can set this since its making the call the same can't be assumed by the server which would be doing the same deserialization of the request.

There is 3 ways to solve this problem :

deserialize the stuff when you have appropriate inputs, till then we store the serialized version in memory when we required it based on the input we deserialize for example expression require schema so the response would store json and then filter(schema) and then wire the schema while deserialization in ExpressionParser.
While serailization capture the type of literal (it changes the serialization as well is spec change) and use that info while deserialziation.
send schema as part of request and then wire that to parser (its a spec change) while parsing the response in the server.

Considering where we are i implemented Approach#1

Testing

New test and existing tests

singhpk234 · 2025-12-18T12:00:07Z

    Expression filter = null;
    if (jsonNode.has(RESIDUAL_FILTER)) {
-      filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL_FILTER));
+      filter = ExpressionParser.fromJson(jsonNode.get(RESIDUAL_FILTER), spec.schema());


this was caught during the execution phase of spark, need to pass schema for residual

nastra · 2025-12-19T13:38:45Z

    }
-    if (request.filter() != null) {
-      configuredScan = configuredScan.filter(request.filter());
+    Expression filter = request.filter(schema);


nit: this change is probably not needed

i removed the .filter() check because we are marking that as deprecated and if we do this check this way i.e first try deserialize (if its null we bail out early) and based on the nullbability that we feed it to our scan down. Please let me know wdyt considering above.

nastra · 2025-12-19T13:42:04Z

+   * @param schema the table schema to use for type-aware deserialization of filter values
+   * @return the filter expression, or null if no filter was specified
+   */
+  public Expression filter(Schema schema) {


let me think about this a bit more. I also think we have a few more cases across the codebase where we also ser/de Expression without a Schema and theoretically we would have the same issue in those places as well.
Whatever approach we pick, we'd want to follow up in those other places too

the other thing we might need to consider is how we would be lazily binding this in other client implementations. @Fokko does pyiceberg have examples of how it does a late-binding similar to this one?
The issue that we have here is that we deserialize an Expression where we can only correctly do so when we bind it to a Schema

amogh-jahagirdar · 2026-02-10T15:54:22Z

   */
  private static <T extends Scan<T, FileScanTask, ?>> T configureScan(
-      T scan, PlanTableScanRequest request) {
+      T scan, PlanTableScanRequest request, Schema schema) {


The schema to use would always be the configured schema for the scan no? In that case I think we can get rid of the 3rd argument and just do request.filter(configuredScan.schema()) below

you mean use scan.schema() ? because configured scan post select would cause issue if I have filter on column not being projected ? I think we can do that, let me refactor

amogh-jahagirdar · 2026-02-10T21:39:38Z

+   * @deprecated since 1.11.0, will be removed in 1.12.0; use {@link #filter(Schema)} instead for
+   *     proper type-aware deserialization
+   */
+  @Deprecated


I'm not sure about deprecating this. In the long run we do expect the expression filter to be self-describing with the grammar that's being proposed by @rdblue ; in particular we'd expect a data type at that point in the literal and for ExpressionParser.fromJson(json) to just work for that case.

So in the long run, I'd actually expect filter(Schema) to be deprecated, we just need it for now due to limitations in the protocol.

I think we should probably just keep both (with the limitation that filter() won't work for the 3 specific data types) and then drop the filter(Schema) if/when that proposal materializes.

Unless there are other cases we can envision where we'd want to pass a schema here?

amogh-jahagirdar · 2026-02-10T21:43:21Z

+     * @deprecated since 1.11.0, will be removed in 1.12.0; this method serializes the expression to
+     *     JSON immediately, which may lose type information for BINARY, FIXED, and DECIMAL types
+     */
+    @Deprecated


Same rationale as above, I'm not sure we should deprecate this because I think in the long run we'd expect such an API to exist. We can avoid API/deprecation churn by just keeping both until any grammar/protocol changes are made.

singhpk234 · 2026-02-11T03:48:07Z

Update: I synced with Amogh offline on this, thanks a tons for brainstorming with me and suggesting this alternative in this first place, we bind any ways during the planning in the later all this lacked was a convertor case handling, i added this
No spec change, no messy handling requireed for this particular problem, I added additional test for fixed Type for this

nastra · 2026-02-11T07:23:49Z

          BigDecimal decimal = new BigDecimal(value().toString());
          return (Literal<T>) new DecimalLiteral(decimal);

+        case FIXED:


this is great. I think it would be good to have a unit test in the api module for this change

nastra · 2026-02-11T07:24:33Z

-
-  @TestTemplate
-  @Disabled(
-      "binary filter that is used by Spark is not working because ExpressionParser.fromJSON doesn't have the Schema to properly parse the filter expression")


you might want to remove this for the other Spark versions as well in this PR, since the change is very small

nastra

LGTM and thanks for fixing this @singhpk234. It would be good to have a unit test in the api module for the change in Literals, so that we don't rely on integration tests

@disabled

…rk versions - Add unit tests in api module for string-to-binary and string-to-fixed conversions - Test valid hex string conversions (uppercase and lowercase) - Test invalid hex strings return null - Test fixed type with wrong length returns null - Remove @disabled testBinaryInFilter from Spark v3.4, v3.5, and v4.0 - The fix in Literals.java now properly handles binary/fixed types - Tests should work across all Spark versions

amogh-jahagirdar

Thanks @singhpk234! Overall looks good to me.

amogh-jahagirdar · 2026-02-12T15:34:40Z

+          try {
+            ByteBuffer buffer =
+                ByteBuffer.wrap(
+                    BaseEncoding.base16().decode(value().toString().toUpperCase(Locale.ROOT)));


minor: May just have a constant for BaseEncoding.base16()

…y Conversions (apache#14882)

singhpk234 requested review from amogh-jahagirdar and nastra December 18, 2025 11:59

github-actions Bot added spark core labels Dec 18, 2025

singhpk234 commented Dec 18, 2025

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/rest/CatalogHandlers.java Outdated

singhpk234 commented Dec 18, 2025

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/SingleValueParser.java Outdated

sfc-gh-prsingh force-pushed the binary-type-fix branch 4 times, most recently from 7d7dcaa to 51d4aab Compare December 19, 2025 00:01

singhpk234 marked this pull request as ready for review December 19, 2025 00:01

sfc-gh-prsingh force-pushed the binary-type-fix branch from 51d4aab to f071179 Compare December 19, 2025 02:56

nastra reviewed Dec 19, 2025

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/SingleValueParser.java Outdated

singhpk234 requested a review from Fokko December 22, 2025 17:29

sfc-gh-prsingh force-pushed the binary-type-fix branch from f071179 to 35ae73a Compare December 22, 2025 17:36

singhpk234 added this to the Iceberg 1.11.0 milestone Jan 11, 2026

amogh-jahagirdar reviewed Feb 10, 2026

View reviewed changes

amogh-jahagirdar changed the title ~~Spark: Handle complex type in expression in RemoteScanPlanning~~ Spark: Handle binary, fixed, and decimal types in expression in RemoteScanPlanning Feb 10, 2026

singhpk234 changed the title ~~Spark: Handle binary, fixed, and decimal types in expression in RemoteScanPlanning~~ Spark: Handle binary, fixed types in expression in RemoteScanPlanning Feb 11, 2026

sfc-gh-prsingh force-pushed the binary-type-fix branch from f6c0ac7 to 7b6465f Compare February 11, 2026 03:44

github-actions Bot added the API label Feb 11, 2026

Address amoghs feedback

f336aca

sfc-gh-prsingh force-pushed the binary-type-fix branch from 7b6465f to f336aca Compare February 11, 2026 03:52

spotlessApply

8302bac

nastra reviewed Feb 11, 2026

View reviewed changes

nastra approved these changes Feb 11, 2026

View reviewed changes

nastra changed the title ~~Spark: Handle binary, fixed types in expression in RemoteScanPlanning~~ API, Spark: Handle binary, fixed types in expression in RemoteScanPlanning Feb 11, 2026

sfc-gh-prsingh force-pushed the binary-type-fix branch from ecfda34 to 1f1eb11 Compare February 11, 2026 18:33

sfc-gh-prsingh force-pushed the binary-type-fix branch from 1f1eb11 to efd6006 Compare February 11, 2026 19:16

nastra approved these changes Feb 12, 2026

View reviewed changes

amogh-jahagirdar approved these changes Feb 12, 2026

View reviewed changes

Address review feedbacks from Amogh

c87392a

sfc-gh-prsingh force-pushed the binary-type-fix branch from 4af9fc7 to c87392a Compare February 12, 2026 17:25

amogh-jahagirdar changed the title ~~API, Spark: Handle binary, fixed types in expression in RemoteScanPlanning~~ API, Spark: Support StringLiteral to Fixed and StringLiteral to Binary Conversions Feb 12, 2026

amogh-jahagirdar merged commit ed39cee into apache:main Feb 12, 2026
33 checks passed

talatuyarer pushed a commit to talatuyarer/iceberg that referenced this pull request Apr 1, 2026

API, Spark: Support StringLiteral to Fixed and StringLiteral to Binar…

2a87d07

…y Conversions (apache#14882)

Uh oh!

Conversation

singhpk234 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About the change

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

singhpk234 commented Dec 18, 2025 •

edited

Loading

nastra Dec 19, 2025 •

edited

Loading

amogh-jahagirdar Feb 10, 2026 •

edited

Loading