Skip to content

Arrow: Avoid extra dictionary buffer copy#5137

Merged
rdblue merged 4 commits into
apache:masterfrom
bryanck:dict-value-decode
Jun 27, 2022
Merged

Arrow: Avoid extra dictionary buffer copy#5137
rdblue merged 4 commits into
apache:masterfrom
bryanck:dict-value-decode

Conversation

@bryanck

@bryanck bryanck commented Jun 27, 2022

Copy link
Copy Markdown
Contributor

This PR changes the dictionary value accessors in the vectorized parquet reader so that the dictionary values are read from the underlying dictionary directly, rather than copying the values into a new buffer where relevant (this was already being done in the dictionary decimal accessor classes). The underlying parquet dictionary classes already load the values into a buffer, so copying them to a new buffer appears redundant in some cases.

This PR also makes a couple of changes to avoid binary buffer copies when building string values when possible.

In very limited testing, this shows a performance gain of over 20% in vectorized read performance in some scenarios, though more testing would be required to get more accurate metrics.

@github-actions github-actions Bot added the arrow label Jun 27, 2022
@bryanck bryanck force-pushed the dict-value-decode branch from 156d23d to 5d497cc Compare June 27, 2022 11:53
@github-actions github-actions Bot added the spark label Jun 27, 2022
@bryanck bryanck force-pushed the dict-value-decode branch from 2c0c7dd to 4d7e9ca Compare June 27, 2022 13:09
@bryanck bryanck force-pushed the dict-value-decode branch from 4d7e9ca to 3686ec3 Compare June 27, 2022 13:13

@kbendick kbendick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanck for working on this!

Mostly some style nits, plus a question for my own clarification. Thank you!

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: over-indented (should be 4 spaces from the start of return on the line above).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed these. Checkstyle didn't seem to mind...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. I’ll see if I can add a checkstyle rule for that or update one to catch it!

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

.toArray(genericArray(stringFactory.getGenericClass()));
this.dictionary = dictionary;
this.stringFactory = stringFactory;
this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why are you adding 1 here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support a 0-based index of getMaxId(), you need a size that is one bigger

public String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return new String(byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(),
byteBuffer.remaining(), StandardCharsets.UTF_8);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: over-indented (should be 4 spaces from the start of return on the line above).

@rdblue

rdblue commented Jun 27, 2022

Copy link
Copy Markdown
Contributor

Looks great. Thanks for fixing this, @bryanck!

@rdblue rdblue merged commit eff6556 into apache:master Jun 27, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants