Core, Parquet, ORC: Fix missing data when writing unknown#12581
Conversation
| new Schema( | ||
| required(1, "id", LongType.get()), | ||
| optional(2, "test_type", type), | ||
| required(3, "trailing_data", Types.StringType.get()))); |
There was a problem hiding this comment.
Without correct alignment, this test will fail for unknown when the null value for field 2 is passed to the writer for field 3.
| } | ||
|
|
||
| public static <T extends StructLike> ParquetValueWriter<T> create( | ||
| public static <T extends StructLike> ParquetValueWriter<T> createWriter( |
There was a problem hiding this comment.
This was needed because InternalWriter::create could refer to either create(MessageType) or create(Schema, MessageType). Adding the ability to pass BiFunction<Schema, MessageType, ParquetValueWriter<?>> to Parquet.createWriterFunc caused compile failures because the calls were ambiguous.
To solve the problem, I renamed this method so it is unambiguous. This has not been in a release so it is safe. The create(MessageType) method has been in a release so it is deprecated for removal.
There was a problem hiding this comment.
I don't love this because we now have a confusing combination of create, createWriter and buildWriter floating around, but I also couldn't find a good alternative.
There was a problem hiding this comment.
I think we should just be more aggressive about fixing these. We should deprecate the old ones and support better names.
| public void write(int repetitionLevel, S value) { | ||
| for (int i = 0; i < writers.length; i += 1) { | ||
| Object fieldValue = get(value, i); | ||
| Object fieldValue = get(value, fieldIndexes[i]); |
There was a problem hiding this comment.
The fix is to map from writer index to field index. The index is created below by skipping unknown fields in the data struct.
|
You might want to run |
| RecordWriter(Types.StructType struct, List<OrcValueWriter<?>> writers) { | ||
| super(struct, writers); |
There was a problem hiding this comment.
Do we need a similar change in Flink/Spark ORC writers?
There was a problem hiding this comment.
We will. Right now, Spark writers and Flink ORC writers don't support the new types so they don't need to be updated in this PR.
|
Thanks for the reviews, @pvary and @danielcweeks! |
This fixes ORC and Parquet writers with unknown columns.
Previously, writers assumed that the Iceberg schema of the incoming data and the outgoing Parquet schema matched. However, unknown is not represented in file schemas, which led to an index mismatch. The value for an unknown column was being used by index and passed to the next writer. In some cases, this is not caught because the data is valid when the fields did not align (for example if there is only one more field and it is optional).
The fix is to pass the Iceberg schema into the write builder and account for columns that are present in data records but are not passed to a writer.
This updates all paths that called Parquet's
GenericParquetWriterorInternalWriterto pass the data schema. This was not necessary in ORC because the write builder already passes the data schema.Avro does not need to be fixed because unknown is converted to a NULL schema and a null writer is used.