Skip to content

Avro: Add variant readers and writers#12457

Merged
rdblue merged 7 commits into
apache:mainfrom
rdblue:avro-support-variant
Mar 17, 2025
Merged

Avro: Add variant readers and writers#12457
rdblue merged 7 commits into
apache:mainfrom
rdblue:avro-support-variant

Conversation

@rdblue

@rdblue rdblue commented Mar 5, 2025

Copy link
Copy Markdown
Contributor

This adds readers and writers for Variants in Avro. Avro generics, Iceberg generics, and the internal data model are supported.

Variant values are tested using RandomVariants to extend random data generation.

@rdblue rdblue force-pushed the avro-support-variant branch from 9aad1b4 to b18a880 Compare March 5, 2025 23:14
Comment thread core/src/main/java/org/apache/iceberg/data/avro/DataWriter.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/avro/ValueReaders.java Outdated
case ARRAY:
// for now, generate an object instead of an array
case OBJECT:
ShreddedObject object = Variants.object(metadata);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we only have fully shredded objects here. Should we add partial shredded objects?

@rdblue rdblue Mar 6, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces a variant in memory, not a serialized variant. The writer is free to serialize however it chooses, although this is currently not intended to exercise shredding paths that need records to have a more stable structure in order to make sense.

@aihuaxu aihuaxu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Override
public void write(T datum, Encoder encoder) throws IOException {
if (datum instanceof Serialized) {
encoder.writeBytes(((Serialized) datum).buffer());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is why we're making Serialized public? I was looking to see if there were some other way, but this looks as good as anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's correct. It seems reasonable to me to avoid the copy.


@Override
public Variant fromRecord(IndexedRecord record, Schema schema, LogicalType type) {
int metadataPos = schema.getField("metadata").pos();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we're using the field names in multiple places, it would be nice to pull them into a static member and not use a literal (e.g. VARIENT_METADATA_FIELD and VARIANT_VALUE_FIELD)

public class RandomVariants {
private RandomVariants() {}

public static Variant randomVariant(Random random) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little bit of docs around the expected structure and what this produces would be nice for future testing. It takes a bit of visual inspection to understand the shape of what is returned by this method.

At some point it might be nice to enhance this so you can provide a size so that it will keep generating values until you reach a certain size. More importantly, it might be nice to take an Iceberg schema since that will be important for shredding.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think when we add fuzzing tests for shredding, we'll want to pass in an Iceberg schema like you said.

@danielcweeks danielcweeks left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments/suggestions, but +1

@rdblue rdblue merged commit 3cc4b04 into apache:main Mar 17, 2025
@rdblue

rdblue commented Mar 17, 2025

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @danielcweeks and @aihuaxu!

I'll follow up on some of Dan's comments in later PRs. I'd like to get this one in so I can open PRs for other readers that will use the RandomVariants code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants