Core: Add Variant implementation to read serialized objects by rdblue · Pull Request #11415 · apache/iceberg

rdblue · 2024-10-28T23:08:34Z

This PR adds an implementation of the Variant encoding spec that can read and construct serialized Variant buffers. This implementation was written using only the spec to validate that the spec is reasonably complete.

The public API interfaces are in core and consist of:

Variant: a wrapper interface of VariantMetadata and VariantValue
VariantMetadata: metadata dictionary for values
VariantValue: a generic interface for values that provides serialization to ByteBuffer
VariantPrimitive: a primitive value
VariantObject: a variant object value with get(String) to retrieve values by name
VariantArray: a variant array value with get(int) to retrieve values by position

The implementation uses ByteBuffer and avoids copying. Values are lazily loaded as they are accessed and are initialized using slices of the parent buffer. Reads do not modify the original buffer. All buffers must use little-endian.

Most testing is done by constructing variant cases as byte array constants. Many of these values can be used to check other implementations and may be added to the spec. This also includes test methods to create metadata, arrays, and objects for more complex cases such as multi-byte field IDs and offsets.

rdblue · 2024-10-28T23:09:36Z

@aihuaxu I cleaned up my implementation, added tests, and fixed quite a few bugs. Please take a look to help validate that it implements the spec correctly. Thanks!

RussellSpitzer · 2024-10-29T21:49:32Z

My main overall question on this is whether or not this implementation belongs in the Iceberg project or in the Parquet project? I'm a little worried about a proliferation of implementations especially if we potentially would use two different implementations within the same code path (1 for spark and 1 for core)

I'll do a real review later

rdblue · 2024-12-13T18:25:43Z

The Spark failures are a port conflict. I think it's unrelated to these changes. We'll see the next time CI runs (I'm sure we'll have more changes to trigger them)

aihuaxu · 2024-12-14T17:09:23Z

+  }
+
+  @Override
+  public VariantValue get(String field) {


I see what we are implementing in ShreddedObject: basically we are providing the same interface get(String field) as regular VariantObject.

Given the following example, assume event.location.latitude is shredded while event.location.longitude is not. How do we model shredded event object - where to place the field location ? Of course, from read side, it doesn't matter if the we place location in shredded or unshredded. We check both.

event { event_id; location { latitude; longitude; } }

If latitude is shredded and longitude is not, then location must be a partially shredded object. That object would contain longitude in the value field and would have a typed_value group that contains a latitude group that has a value and typed_value pair.

And because location is a partially shredded object, event is also a partially shredded object that has a shredded location group.

aihuaxu · 2024-12-14T17:13:11Z

+          "b",
+          Variants.of("iceberg"),
+          "c",
+          Variants.of(new BigDecimal("12.21")));


In our ShreddedObject model, the field values can be a VariantObject as well which could be a ShreddedObject, right?

Can we add such coverage?

What cases do you have in mind?

The tests here aren't intended to be exhaustive for the types that could be in the shredded fields. They just demonstrate the behavior between shredded and unshredded fields. For an object within a ShreddedObject, there are two cases. If it is not shredded, then it will be handled by the wrapped SerializedObject. And if it is shredded then it will be added to a ShreddedObject using put to place it in the shredded map. In both cases, we're just relying on the behavior of other classes to hold another SerializedObject or ShreddedObject, so I'm not sure what we would test that isn't already tested by the primitives here.

aihuaxu · 2024-12-16T17:34:46Z

I think I have a question for PrimitiveWrapper.sizeInBytes() for binary. Otherwise, looks good to me.

danielcweeks · 2024-12-18T21:28:09Z

+  }
+
+  static short readLittleEndianInt16(ByteBuffer buffer, int offset) {
+    return buffer.getShort(buffer.position() + offset);


Do we need to validate endianness on the buffer?

I'm avoiding endianness checks in these methods because it would happen many times for the same buffer. I think it's better to do the checks at a coarse level, like when new SerializedVariant instances are created from a buffer.

If we're expecting a specific endianness, then we should state that at the class level docs at a minimum. I feel that's necessary here since it's not the default.

aihuaxu

LGTM.

aihuaxu · 2024-12-19T03:29:54Z

+            new byte[] {primitiveHeader(8), 0x04, (byte) 0xD2, 0x02, (byte) 0x96, 0x49});
+
+    assertThat(value.type()).isEqualTo(PhysicalType.DECIMAL4);
+    assertThat(value.get()).isEqualTo(new BigDecimal("123456.7890"));


Based on the spec, DECIMAL4 should have the number with the precision between 1 and 9. 123456.7890 with precision 10 should store in DECIMAL8.

aihuaxu · 2024-12-19T05:44:35Z

+package org.apache.iceberg.variants;
+
+/** An variant array value. */
+public interface VariantArray extends VariantValue {


Seems we should add numElements() in the interface in order to call get(index).

I'm debating whether to do this or to implement more Java interfaces, like List or Map. That's why I haven't exposed them yet. Another option is to simply make these Iterable for consumption.

I think deferring the choice now is a good option. We can fill in some of these when we get to testing the read path.

rdblue · 2024-12-20T21:58:38Z

Thanks for reviewing, @aihuaxu and @danielcweeks!

github-actions Bot added API core labels Oct 28, 2024

rdblue requested a review from RussellSpitzer October 28, 2024 23:08

aihuaxu reviewed Oct 30, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/VariantPrimitive.java Outdated

Comment thread core/src/main/java/org/apache/iceberg/Variants.java Outdated

aihuaxu reviewed Dec 3, 2024

View reviewed changes

aihuaxu reviewed Dec 7, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/variants/SerializedObject.java Outdated

aihuaxu reviewed Dec 14, 2024

View reviewed changes

danielcweeks self-requested a review December 18, 2024 17:26

danielcweeks reviewed Dec 18, 2024

View reviewed changes

Comment thread api/src/test/java/org/apache/iceberg/util/RandomUtil.java Outdated

danielcweeks reviewed Dec 18, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/variants/VariantMetadata.java Outdated

danielcweeks reviewed Dec 18, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/variants/VariantUtil.java

danielcweeks reviewed Dec 18, 2024

View reviewed changes

aihuaxu approved these changes Dec 19, 2024

View reviewed changes

aihuaxu reviewed Dec 19, 2024

View reviewed changes

rdblue added 11 commits December 19, 2024 18:39

Core: Add Variant implementation to read serialized objects.

a7c3ecd

Move into variants module and clean up API interfaces.

6f171b5

Fix checkstyle warning

70f942b

Apply spotless

782aef8

Fix checkstyle

017923c

Fix more checkstyle

18e34cd

Add more tests for large objects, id, and offset sizes.

84b6045

Spotless.

588fb17

Add another test.

3e67722

Fix checkstyle in tests.

9dd7cd0

Fix checkstyle for main.

a3755db

rdblue added 2 commits December 19, 2024 18:39

Fix Java classes in PrimitiveType.

2a906ec

Fix test nits.

7312f19

rdblue force-pushed the variant-add-serialized-impl branch from 68a535f to 7312f19 Compare December 20, 2024 03:13

aihuaxu reviewed Dec 20, 2024

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/variants/SerializedObject.java Outdated

danielcweeks approved these changes Dec 20, 2024

View reviewed changes

Fix unsorted values.

cfb7304

rdblue force-pushed the variant-add-serialized-impl branch from ac10fb3 to cfb7304 Compare December 20, 2024 18:23

rdblue merged commit dea2fd1 into apache:main Dec 20, 2024

Uh oh!

Conversation

rdblue commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Oct 28, 2024

Uh oh!

RussellSpitzer commented Oct 29, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Dec 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aihuaxu Dec 14, 2024 • edited by rdblue Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aihuaxu commented Dec 16, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aihuaxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aihuaxu Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Dec 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rdblue commented Oct 28, 2024 •

edited

Loading

aihuaxu Dec 14, 2024 •

edited by rdblue

Loading

aihuaxu Dec 19, 2024 •

edited

Loading