Skip to content

Flink:RewriteDataFiles support filter in plan#13669

Merged
pvary merged 3 commits into
apache:mainfrom
Guosmilesmile:rewritedatafiles_filter
Jul 29, 2025
Merged

Flink:RewriteDataFiles support filter in plan#13669
pvary merged 3 commits into
apache:mainfrom
Guosmilesmile:rewritedatafiles_filter

Conversation

@Guosmilesmile

Copy link
Copy Markdown
Contributor

Background

When an existing table is migrated to the Flink implementation of RewriteDataFiles and its historical data has not been compacted thoroughly,or modify the targetFileSize, the Flink maintenance job must first rewrite the entire table before it can follow the user’s intended schedule. This initial compaction can run for a very long time, while new data cannot trigger compaction, resulting in degraded query performance.

Purpose

This PR introduces an optional filter parameter to Flink’s RewriteDataFiles. Users can define predicates (e.g. data whose time is greater than a given timestamp) so that only the necessary subset of data is compacted, allowing historical data to be skipped.

@github-actions github-actions Bot added the flink label Jul 25, 2025
@pvary

pvary commented Jul 27, 2025

Copy link
Copy Markdown
Contributor

@mxm: Your thoughts?

@mxm mxm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handy. Thanks!

.maxFileSizeBytes(2_000_000L)
.minFileSizeBytes(500_000L)
.minInputFiles(2)
.filter(Expressions.in("id", 1, 2))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always nice to add a comment at the test-relevant parameters (here: the filter).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have add it .

}

@Test
void testRewriteUnPartitionedWithFilter() throws Exception {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "Unpartitioned" relevant for this test? If not, we could rename the test to testRewriteWithFilter.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not relevant. I have rename totestRewriteWithFilter

@pvary pvary merged commit d6f22d5 into apache:main Jul 29, 2025
18 checks passed
@pvary

pvary commented Jul 29, 2025

Copy link
Copy Markdown
Contributor

Merged to main.
Thanks @Guosmilesmile for the feature and @mxm for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants