Arrow: Avoid extra dictionary buffer copy#5137
Merged
Merged
Conversation
156d23d to
5d497cc
Compare
2c0c7dd to
4d7e9ca
Compare
4d7e9ca to
3686ec3
Compare
kbendick
reviewed
Jun 27, 2022
| public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
| if (byteBuffer.hasArray()) { | ||
| return UTF8String.fromBytes( | ||
| byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
| public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
| if (byteBuffer.hasArray()) { | ||
| return UTF8String.fromBytes( | ||
| byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
Contributor
There was a problem hiding this comment.
Nit: over-indented (should be 4 spaces from the start of return on the line above).
Contributor
Author
There was a problem hiding this comment.
Thanks, I fixed these. Checkstyle didn't seem to mind...
Contributor
There was a problem hiding this comment.
Thanks for letting me know. I’ll see if I can add a checkstyle rule for that or update one to catch it!
| public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
| if (byteBuffer.hasArray()) { | ||
| return UTF8String.fromBytes( | ||
| byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
| public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
| if (byteBuffer.hasArray()) { | ||
| return UTF8String.fromBytes( | ||
| byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
| .toArray(genericArray(stringFactory.getGenericClass())); | ||
| this.dictionary = dictionary; | ||
| this.stringFactory = stringFactory; | ||
| this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1); |
Contributor
There was a problem hiding this comment.
Question: why are you adding 1 here?
Contributor
Author
There was a problem hiding this comment.
To support a 0-based index of getMaxId(), you need a size that is one bigger
| public String ofByteBuffer(ByteBuffer byteBuffer) { | ||
| if (byteBuffer.hasArray()) { | ||
| return new String(byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), | ||
| byteBuffer.remaining(), StandardCharsets.UTF_8); |
Contributor
There was a problem hiding this comment.
Nit: over-indented (should be 4 spaces from the start of return on the line above).
rdblue
approved these changes
Jun 27, 2022
Contributor
|
Looks great. Thanks for fixing this, @bryanck! |
namrathamyske
pushed a commit
to namrathamyske/iceberg
that referenced
this pull request
Jul 10, 2022
namrathamyske
pushed a commit
to namrathamyske/iceberg
that referenced
this pull request
Jul 10, 2022
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR changes the dictionary value accessors in the vectorized parquet reader so that the dictionary values are read from the underlying dictionary directly, rather than copying the values into a new buffer where relevant (this was already being done in the dictionary decimal accessor classes). The underlying parquet dictionary classes already load the values into a buffer, so copying them to a new buffer appears redundant in some cases.
This PR also makes a couple of changes to avoid binary buffer copies when building string values when possible.
In very limited testing, this shows a performance gain of over 20% in vectorized read performance in some scenarios, though more testing would be required to get more accurate metrics.