Spark: Write DVs for V3 MoR tables#11561
Conversation
7df6e83 to
0b63df8
Compare
|
Still some failing tests, and figuring out a good pattern to extend the existing Delete/Merge/Update tests to run against DVs. Something also worth thinking about for V3 is preventing partition granularity from being set since it's at odds with DVs. |
0b63df8 to
b116637
Compare
b116637 to
0fdfd0c
Compare
0fdfd0c to
b698904
Compare
5e8e727 to
0f00dfc
Compare
0f00dfc to
546f6b8
Compare
546f6b8 to
e6806d0
Compare
e6806d0 to
4292b0f
Compare
4292b0f to
fe7b053
Compare
|
|
||
| @Override | ||
| public void close() throws IOException { | ||
| fileWriter.close(); |
There was a problem hiding this comment.
I know we usually check if result is null to see if the writer was closed and we are not re-assigning the result object. It probably works in this case (knowing the underlying implementation), but I wonder whether making this check explicit will make the behavior of this method clear even without checking how the delegate writer works.
if (result == null) {
fileWriter.close();
this.result = fileWriter.result();
}
| FileScanTask fileScanTask = task.asFileScanTask(); | ||
| for (DeleteFile deleteFile : fileScanTask.deletes()) { | ||
| if (ContentFileUtil.isFileScoped(deleteFile)) { | ||
| if (ContentFileUtil.isFileScoped(deleteFile) || ContentFileUtil.isDV(deleteFile)) { |
There was a problem hiding this comment.
Aren't DVs considered file-scoped? I think isFileScoped will always be true for DVs (please check).
There was a problem hiding this comment.
Actually, if DVs are enabled, we have to include all position deletes, not just file-scoped. The first ever produced DV must include all previous deletes. The only difference is that we can drop file-scoped from the table state (decided in the writer) and still have to keep partition-scoped deletes as they may apply to other data files.
Can we please test migration to DVs for tables that have a mix of file-scoped and partition-scoped position deletes?
There was a problem hiding this comment.
Great catch, yes to both points. the file scoped Util correctly handles the DV case by checking the referenced data file, and we for DVs we do need to merge all existing position deletes, regardless of scope. I've fixed that and added a test case for update/delete/merge which will create a mix of file scoped and partition scoped deletes, and then finally produce a DV which has the positions merged from these existing deletes.
0be3306 to
a619d51
Compare
aokolnychyi
left a comment
There was a problem hiding this comment.
Minor comments. Feel free to merge whenever you are ready, @amogh-jahagirdar. Great work!
| for (DeleteFile deleteFile : fileScanTask.deletes()) { | ||
| if (ContentFileUtil.isFileScoped(deleteFile)) { | ||
| // Both file-scoped and partition-scoped position deletes must be rewritten for DVs | ||
| if (forDVs || ContentFileUtil.isFileScoped(deleteFile)) { |
There was a problem hiding this comment.
What about equality deletes? Those should be ignored. If so, what about shouldRewrite method?
for (DeleteFile deleteFile : fileScanTask.deletes()) {
if (shouldRewrite(deleteFile, forDVs)) {
...
}
}
private boolean shouldRewrite(DeleteFile deleteFile, boolean dvsEnabled) {
...
}
There was a problem hiding this comment.
It would be nice to also cover this with tests.
There was a problem hiding this comment.
Ah I fixed to avoid broadcasting eq. deletes unnecessarily but need to figure out how to add a test which would inspect the broadcast contents, will a publish a separate PR for that
a619d51 to
1c82bfa
Compare
This change adds support for writing DVs in SparkPositionDeltaWrite for V3 tables.