Arrow: Fix vectorized reads for Parquet TIMESTAMP_MILLIS types by shubham19may · Pull Request #14499 · apache/iceberg

shubham19may · 2025-11-04T09:41:46Z

Description

This PR fixes: Reading Parquet files with TIMESTAMP_MILLIS

Error:

java.lang.ClassCastException: class org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector cannot be cast to class org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector (org.apache.iceberg.shaded.org.apache.arrow.vector.TimeStampMicroTZVector and org.apache.iceberg.shaded.org.apache.arrow.vector.BigIntVector are in unnamed module of loader 'app')

Cause: when reading parquet files with TIMESTAMP_MILLIS logical type annotation, the VectorizedArrowReader incorrectly allocated a TimeStampMicroTZVector based on iceberg schema (which expects microsecond precision), but the actual reader (TimeStampMicroTZVector) writes raw long values that require a BigIntVector.

Fix:

Explicitly create a Field with ArrowType.Int(Long.SIZE, true) type
Allocate a BigIntVector that matches what TimestampMillisReader expects and return ReadType.TIMESTAMP_MILLIS to trigger millisecond-to-microsecond conversion

Testing

Creating a Parquet file with spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS
Sending the parquet file path to the Iceberg Table to consider it as a data file (an AddFile() API)
Querying with spark.sql.iceberg.vectorization.enabled=true
Confirming successful vectorized reads with correct timestamp values

pvary · 2025-11-04T15:14:37Z

+    if (System.getProperty(ALLOCATION_MANAGER_TYPE_PROPERTY) == null) {
+      System.setProperty(ALLOCATION_MANAGER_TYPE_PROPERTY, "Netty");
+    }


Why is this change?

well the issue is, arrow's auto-detection of allocation manager type breaks when classes are shaded org.apache.iceberg.shaded.org.apache.arrow.*, because CheckAllocator.check() inspects JAR paths via ProtectionDomain.getCodeSource().getLocation() and doesn't recognize shaded package structures. Setting this property bypasses the broken-based detection, which is essential since our BigIntVector allocation for TIMESTAMP_MILLIS requires a working RootAllocator in shaded Spark runtime environments.

Could we add a brief comment explaining why we set arrow.memory.allocation.manager.type to Netty, and consider logging at debug when we default it to Netty so it’s visible at runtime?

nandorKollar · 2025-11-04T15:18:55Z

Is there any way to cover this with tests? One option I can think of is to add a test case to Spark where we write the file with spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS, and try to read it back via Arrow batch reader. An other idea, which came to my mind is that restructure the code, handle the special case where Iceberg type is timestamp, but the Parquet type is TIMESTAMP_MILLIS, then we can write a simple test case for that in ArrowSchemaUtil. Not sure if this is a good idea, just a thought.

huaxingao · 2025-11-06T04:07:47Z

+
+        org.apache.arrow.vector.FieldVector eventTimeVector = root.getVector("event_time");
+        assertThat(eventTimeVector).isNotNull();
+        assertThat(eventTimeVector).isInstanceOf(org.apache.arrow.vector.BigIntVector.class);


can we also assert the value?

done.

also, in VectorizedParquetDefinitionLevelReader.java, arrow validity buffer was not being properly set for non-null values. (bit should be 1 when Value is not null). Fixed that.

pvary · 2025-11-06T11:04:13Z

+    try (ParquetWriter<InternalRow> writer =
+        new NativeSparkWriterBuilder(outputFile)
+            .set("org.apache.spark.sql.parquet.row.attributes", sparkSchema.json())
+            .set("spark.sql.parquet.writeLegacyFormat", "false")
+            .set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
+            .set("spark.sql.parquet.fieldId.write.enabled", "true")
+            .build()) {
+      for (InternalRow row : rows) {
+        writer.write(row);
+      }
+    }


Do we have a native parquet writer in the base too?
Maybe we can create a test without spark

Well, the issue is, iceberg’s native parquet writers explicitly reject TIMESTAMP_MILLIS (check here), as by design iceberg standardizes on microsecond precision.

TIMESTAMP_MILLS support exists only for reading externally-produced files (like from Spark)

To write a test without Spark will add a lot of extra code, and that too using Parquet API. IMO, it’s better to have the current end to end test with Spark.

Could we use ExampleParquetWriter or something similar?

Or maybe use AvroParquetWriter to create the Parquet file? Just an idea, maybe it is easier to write the file that way, and it supports millisec precision.

@nandorKollar well, for

(a) ExampleParquetWriter : I fear, we have to manually construct Parquet group objects and manually define the schema with TIMESTAMP_MILLIS logical type, which is much more complex than our current approach. Moreover, Iceberg never used ExampleParquetWriter, using it would require writing low-level parquet code.

(b) AvroParquetWriter : Iceberg’s AvroSchemaUtil always uses timestampMicros , never timestampMillis check here. While avro supports timestamp-millis, you cannot use it through. We have to manually create the avro schema for it, and also handle the data conversion. Again, it will bring more unnecessary code.

IMO, we should stick to our current approach.

If we plan to support reading Parquet files with TIMESTAMP_MILLIS logical type, we need to support it in all of the readers.

Arrow

Spark

Flink

For this we need to find a way to test it.

@shubham19may it would be better not to rely on Spark module to cover this case with a test, but rather keep the the arrow module 'self-contained', cover it's functionality with tests there. TestVariantReaders already creates a custom Parquet writer with Avro object modell, it doesn't look too complicated. I'll try to put together an example with ExampleParquetWriter, it might be a bit more complicated. As Iceberg's Parquet writer doesn't write timestamp millis, unfortunately we need to implement our test writer for these types. Actually this is not the only case which is not covered, for example Parquet files with unsigned integer types are not covered either, probably there we can't even use the trick to test it in Spark, as unsigned types are unknown in Spark too.

Sure @pvary and @nandorKollar.

I have updated the test, moved it away from Spark to the arrow module, inside TestArrowReader.java. Please give it a review whenever you are free, and do tell me if any further changes are required.

Thanks @shubham19may!

pvary · 2025-11-17T09:04:56Z

+          if (setArrowValidityVector) {
+            BitVectorHelper.setBit(vector.getValidityBuffer(), bufferIdx);
+          }
+


nit: newline here is not needed.

pvary

LGTM.
Thanks @shubham19may!

One small formal ask only.

Please @nandorKollar and @huaxingao review.

Thanks,
Peter

nandorKollar · 2025-11-17T18:43:49Z

LGTM. Thanks @shubham19may!

One small formal ask only.

Please @nandorKollar and @huaxingao review.

Thanks, Peter

@shubham19may thanks for adding the test case, looks good!

nandorKollar · 2025-11-17T18:45:34Z

+
+    Table table = tables.create(schema, tableLocation);
+
+    File testFile = new File(tempDir, "timestamp-millis-test.parquet");


nit: would it be possible to use InMemoryOutputFile instead of temp files? We can maybe rewrite the Arrow tests in a subsequent followup PR.

nandorKollar · 2025-11-17T22:47:52Z

Oh, one more thing: can we please make the title of this change more precise, we no longer fix any 'shaded JAR initialization' problem, that's misleading.

shubham19may · 2025-11-18T05:49:21Z

Oh, one more thing: can we please make the title of this change more precise, we no longer fix any 'shaded JAR initialization' problem, that's misleading.

done

pvary · 2025-11-19T08:39:12Z

Merged to main.
Thanks for the fix @shubham19may and @huaxingao and @nandorKollar for the reviews!

shubham19may · 2025-11-19T08:49:35Z

Thanks @pvary @huaxingao @nandorKollar for the reviews and help.

…e#14499)

fix: arrow timestamp vectorization issue

26729fa

github-actions Bot added the arrow label Nov 4, 2025

pvary reviewed Nov 4, 2025

View reviewed changes

Comment thread arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

nandorKollar reviewed Nov 4, 2025

View reviewed changes

Comment thread arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java

fix: add test case

ed98c18

github-actions Bot added the spark label Nov 5, 2025

fix: minor spotlessApply check

61a910a

huaxingao reviewed Nov 6, 2025

View reviewed changes

shubham19may added 2 commits November 6, 2025 13:00

add: assertion of the values in the tests

fb537ea

add: debug log

1d72282