Core: Improve GenericReader performance#4218
Conversation
|
CC: @rbalamohan |
|
@aokolnychyi: Could you please review when you have some time? I see that you created the other jmh tests, so you might be the best person to take a look. |
|
|
||
| List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
| Record result = GenericRecord.create(type); | ||
| Record result = template.copy(); |
There was a problem hiding this comment.
Can this result in NPE?
There was a problem hiding this comment.
Then there were an NPE with the original code as well. The template is only null, when the type is null.
private GenericRecord(StructType struct) {
this.struct = struct;
this.size = struct.fields().size(); <-- NPE here when the struct is null
this.values = new Object[size];
this.nameToPos = NAME_MAP_CACHE.get(struct);
}
There was a problem hiding this comment.
Indeed, the type should never be null here because the SchemaWithPartnerVistor guarantee it. See:
I think we can just add a Precondition.checkNotNull in the line128 instead of checking the null case in a if-else block.
There was a problem hiding this comment.
So here we use GenericRecord(GenericRecord toCopy) constructor rather than GenericRecord(StructType struct) to eliminate the access to NAME_MAP_CACHE ? I'm afraid the further PR will easily fallback to use the other because this tiny different is hard for the next developer and reviewer to notice. Is possible to add few doc ?
There was a problem hiding this comment.
There were an issue with positional deletes where the row was not present in the file. The code below still created reader for the empty struct.
Types.StructType struct = iType != null ? iType.asStructType() : null;
return visitor.struct(struct, group, visitFields(struct, group, visitor));
Changed it to this, let's see if it breaks something:
return iType != null ?
visitor.struct(iType.asStructType(), group, visitFields(iType.asStructType(), group, visitor)) : null;
There was a problem hiding this comment.
Had to revert the changes.
In the Deserializer we do not have null struct so I added the Preconditions.checkNotNull() check, but I had to revert to the null checks at the readers because the Spark reader expects the visitors/readers to handle null structs when the readers are created.
I had failures with TestSparkReadProjection.testListOfStructsProjection() and TestGenericAppenderFactory.testPosDeleteWriterWithRowSchema()
|
Seems like @aokolnychyi is occupied otherwise. @RussellSpitzer, or @shardulm94, any chance of taking a look? |
|
|
||
| List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
| Record result = GenericRecord.create(type); | ||
| Record result = template.copy(); |
There was a problem hiding this comment.
Indeed, the type should never be null here because the SchemaWithPartnerVistor guarantee it. See:
I think we can just add a Precondition.checkNotNull in the line128 instead of checking the null case in a if-else block.
|
|
||
| List<Object> data = ((StructObjectInspector) pair.sourceInspector()).getStructFieldsDataAsList(o); | ||
| Record result = GenericRecord.create(type); | ||
| Record result = template.copy(); |
There was a problem hiding this comment.
So here we use GenericRecord(GenericRecord toCopy) constructor rather than GenericRecord(StructType struct) to eliminate the access to NAME_MAP_CACHE ? I'm afraid the further PR will easily fallback to use the other because this tiny different is hard for the next developer and reviewer to notice. Is possible to add few doc ?
|
@openinx: Fixed the null check where I was able to, added the comments, and the tests are green. If you have some time, I think the PR is ready for another review. Thanks again for taking the time to check this out! |
|
iceberg/core/src/main/java/org/apache/iceberg/data/GenericRecord.java Lines 36 to 46 in c45b8b4 Note that this uses identity ( Otherwise if should perform similarly to a ConcurrentHashMap from the caller's perspective. |
| StructType struct) { | ||
| super(types, readers); | ||
| this.structType = struct; | ||
| this.template = struct != null ? GenericRecord.create(struct) : null; |
There was a problem hiding this comment.
Should these also have preconditions that validate the struct is not null?
There was a problem hiding this comment.
My original change did not contain the null check, and there were failing tests (org.apache.iceberg.TestGenericAppenderFactory > testPosDeleteWriterWithRowSchema[FileFormat=parquet, Partitioned=true] FAILED) with the exception below.
In nutshell, when we are reading the positional delete files we either have row in the data or we don't. I have tried to fix this by returning a null reader instead of creating a reader for a null struct. But this caused issues with Spark projection where we create the reader for the full schema, and having a null or missing reader caused issues when we tried to match the readers to the projected schema.
Fixing this is not trivial so I decided to move forward with the null checks to keep the backward compatibility:
- It should possible to create a reader with a
nullschema - Fail only when we try to actually read from this reader
java.lang.NullPointerException
at org.apache.iceberg.data.GenericRecord.<init>(GenericRecord.java:63)
at org.apache.iceberg.data.GenericRecord.create(GenericRecord.java:53)
at org.apache.iceberg.data.parquet.GenericParquetReaders$RecordReader.<init>(GenericParquetReaders.java:62)
at org.apache.iceberg.data.parquet.GenericParquetReaders.createStructReader(GenericParquetReaders.java:52)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:176)
at org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:114)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:141)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:169)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:183)
at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:67)
at org.apache.iceberg.data.parquet.BaseParquetReaders.createReader(BaseParquetReaders.java:58)
at org.apache.iceberg.data.parquet.GenericParquetReaders.buildReader(GenericParquetReaders.java:41)
at org.apache.iceberg.data.DeleteFilter.lambda$openDeletes$6(DeleteFilter.java:255)
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:118)
at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:66)
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:77)
at org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:38)
at org.apache.iceberg.util.Filter.lambda$filter$0(Filter.java:35)
at org.apache.iceberg.io.CloseableIterable$2.iterator(CloseableIterable.java:73)
at org.apache.iceberg.io.CloseableIterable$4$1.<init>(CloseableIterable.java:99)
at org.apache.iceberg.io.CloseableIterable$4.iterator(CloseableIterable.java:98)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:152)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:129)
at java.lang.Iterable.forEach(Iterable.java:74)
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:97)
at org.apache.iceberg.deletes.Deletes.toPositionIndex(Deletes.java:91)
at org.apache.iceberg.data.DeleteFilter.applyPosDeletes(DeleteFilter.java:231)
at org.apache.iceberg.data.DeleteFilter.filter(DeleteFilter.java:126)
at org.apache.iceberg.data.GenericReader.open(GenericReader.java:77)
at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:826)
at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:52)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:151)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.<init>(CloseableIterable.java:143)
at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable.iterator(CloseableIterable.java:138)
at org.apache.iceberg.data.GenericReader.open(GenericReader.java:65)
at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:41)
at org.apache.iceberg.data.TableScanIterable.iterator(TableScanIterable.java:29)
at java.lang.Iterable.forEach(Iterable.java:74)
We are reusing the
|
|
Thanks everyone for the review and the merge! |
(cherry picked from commit e75d732)
@rbalamohan identified that:
This causes running
GenericRecord.create()for every record expensive.Sometimes we can not reuse the containers in readers, but we still need better performance. If we create a template record in readers and copy this template record then we can avoid the cache retrieval of the
nameToPosmap.Created some basic jmh performance tests, and here are the results:
We can see that the change gained ~10-20% performance for the generic readers.