Further refactor Parquet readers for v2 support by eric-maynard · Pull Request #13290 · apache/iceberg

eric-maynard · 2025-06-10T17:30:35Z

In issues like #7162 and #11371, it's reported that newer Parquet encodings like DELTA_BINARY_PACKED don't work with the current Parquet readers. #11661 recently refactored the Parquet readers to improve code re-use, but there a few more changes needed to prepare us for Parquet v2 support.

This refactor introduces a new interface VectorizedValuesReader and changes readers like TimestampMillisReader to work with this new type. After this change, new implementations of VectorizedValuesReader can be added to support encodings like DELTA_BINARY_PACKED.

This PR is a revival of @wgtmac's #9772, which based on our conversion he will not be able to continue work on. Thanks for the great work, @wgtmac.

RussellSpitzer · 2025-06-17T21:45:58Z

  @Override
  protected void initDataReader(Encoding dataEncoding, ByteBufferInputStream in, int valueCount) {
-    ValuesReader previousReader = plainValuesReader;
+    ValuesReader previousReader = (ValuesReader) valuesReader;


Explicit cast?

Yeah currently we need the cast due to a bit of a Java-based pickle, happy to change this up if you have any ideas on how it can be improved.

VectorizedValuesReader is currently not bound to ValuesReader. We can't just make VectorizedValuesReader extend ValuesReader because the former is an interface and the latter is a class.

We can't just make VectorizedValuesReader an ABC because then classes like VectorizedPlainValuesReader can't extend both it and PlainValuesReader.

Since ValuesReader is an ABC in Parquet but VectorizedValuesReader is an interface, we can't easily mark valuesReader as being both types.

We could create a generic method and bind the type to <T extends ValuesReader & VectorizedValuesReader> but that feels clunky

I see a couple of ways out of this but they mostly involve copying a lot of code and I thought this cast wasn't too high of a price to pay. It does mean that we need to more or less keep VectorizedValuesReader in sync with ValuesReader if it evolves.

(This cast is inherited from #9772)

RussellSpitzer · 2025-06-17T22:08:45Z

+import org.apache.parquet.io.api.Binary;
+
+/** Interface for value decoding that supports vectorized (aka batched) decoding. */
+interface VectorizedValuesReader {


In the Spark corresponding class there are some other methods available, do we need those as well? Part of a followup or un-needed?

https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java#L31-L72

A Spark expert can correct me, but I think the read methods with a WritableColumnVector parameter are for copying the data into a Spark WritableColumnVector. In Iceberg we use Arrow vectors instead of WritableColumnVectors. The skip methods are for Parquet page skipping. For the v2 read support, they are not needed. I'd like to resurrect my work on Parquet page skipping, so I may look into leveraging those skip methods myself, but for now, we don't need them.

Yeah that's exactly correct. I played with the idea of implementing something like WritableColumnVector within Iceberg (but for building something like a VectorizedColumnIterator), but I think we won't need all of that code after all.

Thanks @wypoon and @eric-maynard, that all sounds good to me

RussellSpitzer · 2025-06-17T22:09:43Z

I have some small questions about the Roadmap for where we go from here but this makes sense to me as a first step. As long as we are more or less copying the Spark approach I think we are probably safe here. @huaxingao Could you do a quick check as well?

RussellSpitzer · 2025-06-18T16:31:21Z

Some weird rebase happened here, git history looks scary now :)

…et-v2-refactor

huaxingao · 2025-06-19T00:51:43Z

@eric-maynard Thanks for the PR! The approach looks good to me and seems like a reasonable first step.

eric-maynard · 2025-06-23T17:15:36Z

Thanks @huaxingao! I've added Javadocs

About the scary diff @RussellSpitzer, it should be fixed but unfortunately I can't remove the tags which got auto-added when the diff was artificially massive

RussellSpitzer · 2025-06-24T17:26:27Z

@huaxingao and @wypoon do y'all have any other comment on this pr?

huaxingao

LGTM

wypoon

LGTM too.

RussellSpitzer · 2025-06-25T19:20:46Z

Merged, Thanks @eric-maynard for the pr, and @huaxingao and @wypoon for reviewing.

RussellSpitzer · 2025-06-25T19:21:13Z

Also thanks @wgtmac for starting this!

nastra · 2025-06-26T06:24:15Z

+
+class VectorizedPlainValuesReader extends ValuesAsBytesReader implements VectorizedValuesReader {
+
+  public static final int INT_SIZE = 4;


do these need to be public? They don't seem to be used anywhere outside of this class

They’re used in the follow up PR (though currently they could be protected), but they could have been private in this PR. Good catch

rebase

a69ec52

github-actions Bot added the arrow label Jun 10, 2025

lint

0bba5ef

RussellSpitzer reviewed Jun 17, 2025

View reviewed changes

Comment thread ...w/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedPlainValuesReader.java Outdated

RussellSpitzer reviewed Jun 17, 2025

View reviewed changes

github-actions Bot added spark core flink docs AWS Specification Issues that may introduce spec changes. GCP OPENAPI labels Jun 18, 2025

eric-maynard force-pushed the parquet-v2-refactor branch from a63ee5d to 9ecc2be Compare June 18, 2025 16:49

some changes per comments

9ecc2be

github-actions Bot removed AWS GCP OPENAPI labels Jun 18, 2025

Merge branch 'main' of ssh://github.com-oss/apache/iceberg into parqu…

3cd2819

…et-v2-refactor

huaxingao reviewed Jun 19, 2025

View reviewed changes

Comment thread arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedValuesReader.java Outdated

javadoc

8d186fe

eric-maynard requested a review from huaxingao June 23, 2025 17:15

eric-maynard added 2 commits June 23, 2025 10:18

lint

5ce8913

remove clash

6cecf96

eric-maynard requested a review from RussellSpitzer June 24, 2025 16:46

RussellSpitzer approved these changes Jun 24, 2025

View reviewed changes

huaxingao approved these changes Jun 24, 2025

View reviewed changes

wypoon approved these changes Jun 25, 2025

View reviewed changes

RussellSpitzer merged commit 4213a50 into apache:main Jun 25, 2025
39 checks passed

nastra reviewed Jun 26, 2025

View reviewed changes

jbewing mentioned this pull request Dec 8, 2025

Spark, Arrow, Parquet: Add vectorized read support for parquet v2 encodings #14800

Open

jbewing mentioned this pull request Feb 19, 2026

Spark, Arrow, Parquet: Add vectorized parquet read support for DELTA_LENGTH_BYTE_ARRAY & DELTA_BYTE_ARRAY encodings #15362

Merged


		class VectorizedPlainValuesReader extends ValuesAsBytesReader implements VectorizedValuesReader {

		public static final int INT_SIZE = 4;

Uh oh!

Conversation

eric-maynard commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RussellSpitzer Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

wypoon Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jun 17, 2025

Uh oh!

RussellSpitzer commented Jun 18, 2025

Uh oh!

Uh oh!

huaxingao commented Jun 19, 2025

Uh oh!

eric-maynard commented Jun 23, 2025

Uh oh!

RussellSpitzer commented Jun 24, 2025

Uh oh!

huaxingao left a comment

Choose a reason for hiding this comment

Uh oh!

wypoon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RussellSpitzer commented Jun 25, 2025

Uh oh!

RussellSpitzer commented Jun 25, 2025

Uh oh!

nastra Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eric-maynard commented Jun 10, 2025 •

edited

Loading

eric-maynard Jun 18, 2025 •

edited

Loading