Skip to content

Flink: Pre-create fieldGetters to avoid constructing them for each row#10565

Merged
pvary merged 4 commits into
apache:mainfrom
fengjiajie:field_getters
Jul 10, 2024
Merged

Flink: Pre-create fieldGetters to avoid constructing them for each row#10565
pvary merged 4 commits into
apache:mainfrom
fengjiajie:field_getters

Conversation

@fengjiajie

@fengjiajie fengjiajie commented Jun 25, 2024

Copy link
Copy Markdown
Contributor

I found that RowDataUtil.clone consumes a significant amount of CPU time by analyzing the flame graph. It constructs a FieldGetter object for each field of every row being processed. This commit optimizes this by constructing all FieldGetter objects upfront, avoiding repeated object creation.

@github-actions github-actions Bot added the flink label Jun 25, 2024
@fengjiajie

Copy link
Copy Markdown
Contributor Author

In my case, the average task execution time decreased from 344 seconds to 328 seconds.

reader

@fengjiajie

Copy link
Copy Markdown
Contributor Author

Hi @nastra , the failing test cases seem unrelated to this change. Could you please re-trigger the tests? Thanks!

@Fokko

Fokko commented Jul 3, 2024

Copy link
Copy Markdown
Contributor

@fengjiajie Rerunning the failed tests 👍

*/
public static RowData clone(
RowData from, RowData reuse, RowType rowType, TypeSerializer[] fieldSerializers) {
RowData from,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a deprecated version of the original public method to keep compatibility.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reminder! I've added the original function below. The reason I didn't directly use the new function is that it requires FieldGetter for all fields, while the old function only needs to construct FieldGetter for fields with values. Constructing FieldGetter for all fields might negatively impact performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that performance degradation is acceptable for the deprecated methods if it is justified otherwise. The number of the duplicated lines seems high enough for me to try to avoid them.

The users could always do the caching of the getters themselves, or find another solution soon, since we remove the deprecated methods regularly.

WDYT?

Also CC-ed, @stevenzwu, who might be also interested in this PR

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@pvary

pvary commented Jul 3, 2024

Copy link
Copy Markdown
Contributor

Thanks for the detailed explanation and the PR @fengjiajie!

Left a few small comments.

As a general rule we create a PR for a single version. Usually the latest. So if there are any review comments, they need to be fixed only once.
Once the first PR is merged, we create a backport PR for the other supported versions.

@fengjiajie

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed explanation and the PR @fengjiajie!

Left a few small comments.

As a general rule we create a PR for a single version. Usually the latest. So if there are any review comments, they need to be fixed only once. Once the first PR is merged, we create a backport PR for the other supported versions.

@pvary Thanks for taking the time to review this! Please let me know if you have any suggestions.

@fengjiajie fengjiajie requested a review from pvary July 4, 2024 04:12
Comment thread flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/data/RowDataUtil.java Outdated

@pvary pvary left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. Let's wait a bit, to see if someone else is interested in reviewing. If there're no more comments, I will merge next Tuesday. (Ping me if I forgot 😄)

@fengjiajie

Copy link
Copy Markdown
Contributor Author

+1 from me. Let's wait a bit, to see if someone else is interested in reviewing. If there're no more comments, I will merge next Tuesday. (Ping me if I forgot 😄)

Thanks

@pvary pvary merged commit 05923d6 into apache:main Jul 10, 2024
@pvary

pvary commented Jul 10, 2024

Copy link
Copy Markdown
Contributor

Thanks @fengjiajie for the improvement.
Could you please create a backport PR to the other Flink versions?

fengjiajie added a commit to fengjiajie/iceberg that referenced this pull request Jul 10, 2024
pvary pushed a commit that referenced this pull request Jul 10, 2024
jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
apache#10565)

* Flink: Pre-create fieldGetters to avoid constructing them for each row

* modified as suggested

* modified as suggested

* modified as suggested
jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
apache#10565)

* Flink: Pre-create fieldGetters to avoid constructing them for each row

* modified as suggested

* modified as suggested

* modified as suggested
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants