Skip to content

Build: Bump Parquet-Java to 1.16.0#13941

Merged
stevenzwu merged 17 commits into
apache:mainfrom
Fokko:fd-test-parquet-1-16
Sep 3, 2025
Merged

Build: Bump Parquet-Java to 1.16.0#13941
stevenzwu merged 17 commits into
apache:mainfrom
Fokko:fd-test-parquet-1-16

Conversation

@Fokko

@Fokko Fokko commented Aug 28, 2025

Copy link
Copy Markdown
Contributor

@Fokko

Fokko commented Sep 2, 2025

Copy link
Copy Markdown
Contributor Author

Looked into teh failed test, and it looks like it splits a single partition into two files:

image

Turns out there is a small difference in the size due to:

➜  parquet-java git:(d5f86d7c) ✗ git bisect bad                                                                              
d5f86d7c0e9894510e8af6dfd37444843e6d1bc4 is the first bad commit
commit d5f86d7c0e9894510e8af6dfd37444843e6d1bc4
Author: Gang Wu <[ustcwg@gmail.com](mailto:ustcwg@gmail.com)>
Date:   Tue Jan 21 16:18:19 2025 +0800

    GH-3133: Fix SizeStatistics to handle omitted histogram (#3134)

 .../apache/parquet/column/statistics/SizeStatistics.java |  6 ++++--
 .../parquet/column/statistics/TestSizeStatistics.java    | 16 ++++++++++++++++
 .../format/converter/ParquetMetadataConverter.java       | 10 ++++++++--

@Fokko Fokko force-pushed the fd-test-parquet-1-16 branch from 66bd095 to 6309f60 Compare September 2, 2025 09:15
Comment thread build.gradle Outdated
@Fokko Fokko marked this pull request as ready for review September 2, 2025 16:01
List<FileScanTask> files =
StreamSupport.stream(table.newScan().planFiles().spliterator(), false)
.collect(Collectors.toList());
assertThat(files.size()).as("Did not have the expected number of files").isEqualTo(numExpected);

@stevenzwu stevenzwu Sep 2, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I thought below is the preferred assertion style with better error msg

assertThat(files).as("Did not have the expected number of files").hasSize(numExpected);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to easily set a breakpoint and inspect files. I left it like that since it might be helpful in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I will leave it up to you.

changing the assertion doesn't prevent the capability of setting a breakpoint and inspecting files, since the files are collected before the assertion, as new lines 2171-2173 stay as they are in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    List<String> list = Arrays.asList("a", "b", "c");
    assertThat(list).hasSize(4);

Above code will fail with the following helpful error msg

Expected size: 4 but was: 3 in:
["a", "b", "c"]

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/TypeToMessageType.java Outdated
@kevinjqliu

kevinjqliu commented Sep 2, 2025

Copy link
Copy Markdown
Contributor

Could not find org.apache.parquet:parquet-avro:1.16.0.
Searched in the following locations:
- https://repo.maven.apache.org/maven2/org/apache/parquet/parquet-avro/1.16.0/parquet-avro-1.16.0.pom
- file:/home/runner/.m2/repository/org/apache/parquet/parquet-avro/1.16.0/parquet-avro-1.16.0.pom

1.16.0 not here yet https://repo.maven.apache.org/maven2/org/apache/parquet/parquet-avro/

1.16.0RC2 already has 3 binding votes, just not officially released yet
https://lists.apache.org/thread/rb0gorvx1lysch6yxks72h94kqhsp719

@github-actions github-actions Bot added the API label Sep 2, 2025
@Fokko Fokko changed the title Parquet: Test out the Parquet-Java 1.16.0 release Build: Bump Parquet-Java to 1.16.0 Sep 2, 2025
// added to Parquet
// Preconditions.checkArgument(
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType);
} else if (sType instanceof VariantType

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to allow both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we don't need both, but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema. The important part is that we set the annotations in TypeToMessageType.java.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if there is any data that has been written without the annotation, then we can fallback to the Iceberg schema.

Do we want to support this scenario?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally not :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is for us supporting any Variants written by Spark in 4.0, otherwise we couldn't possibly import them right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't think of that use-case. Spark 4 was released before the annotation, so I think you're right there 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in the Iceberg Schema / Spark Type first and if the annotation is missing, I think we should still just read and be happy :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko can we also add some comment to explain why we are checking both? other non-primitive types (like list and map) only checks the annotation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.

Also should we switch to fallback the old way as

LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and added the comments 👍

@Fokko Fokko force-pushed the fd-test-parquet-1-16 branch from f960d54 to 7a2bf7d Compare September 2, 2025 19:18

@stevenzwu stevenzwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. let' see how CI goes once Parquet binaries are released.

@kevinjqliu

Copy link
Copy Markdown
Contributor

@kevinjqliu

Copy link
Copy Markdown
Contributor

push an empty commit to trigger ci, it takes a while :)

// added to Parquet
// Preconditions.checkArgument(
// sType instanceof VariantType, "Invalid variant: %s is not a VariantType", sType);
} else if (sType instanceof VariantType

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need to handle both cases (missing annotation but sType is variant and the logical type is variant) to support the existing data.

Also should we switch to fallback the old way as

LogicalTypeAnnotation.variantType(Variant.VARIANT_SPEC_VERSION).equals(annotation) || sType instanceof VariantType.

Comment thread api/src/main/java/org/apache/iceberg/variants/Variant.java

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stevenzwu stevenzwu merged commit 12ab7fc into apache:main Sep 3, 2025
43 checks passed
@stevenzwu

Copy link
Copy Markdown
Contributor

thanks @Fokko for the contribution, and @aihuaxu @RussellSpitzer @nastra @kevinjqliu for the reviews

@RussellSpitzer

Copy link
Copy Markdown
Member

push an empty commit to trigger ci, it takes a while :)

Note that you can "re-run" failed tests from the github workloads ui too (if you are on the PMC)

@Fokko

Fokko commented Sep 3, 2025

Copy link
Copy Markdown
Contributor Author

(if you are on the PMC)

I think it should also be available for committers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants