Core: Avoid generating a large ManifestFile when committing#6335
Conversation
|
Hi @szehon-ho @rdblue @pvary @stevenzwu @chenjunjiedada pls help to review this when you are free. Thanks a lot. |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @ConeyLiu! just had some questions/comments on this, but this behavior is something I observed as well.
|
It is mentioned in the docs that |
|
@nastra |
|
This seems like a reasonable thing to add to me. I'm actually more concerned about the use case that caused this. @ConeyLiu, what was the case where you were creating huge manifests? How many data files were committed? |
3307397 to
6b21694
Compare
|
@rdblue thanks for the reviewing. It happens for the larger table (has several PBs data) and with many columns(thousand columns which is very common for log data or feature data). And the size of the DataFile metrics' is increased with the number of columns. So the manifest file could be really large when the user adds into/overwrites a larger number of data (eg, sync hourly data). |
6b21694 to
85c05cc
Compare
RussellSpitzer
left a comment
There was a problem hiding this comment.
I see two sets of changes here, one around "caching" and one around rolling over manifest files. Can we separate out the changes?
| private final List<ManifestFile> appendManifests = Lists.newArrayList(); | ||
| private final List<ManifestFile> rewrittenAppendManifests = Lists.newArrayList(); | ||
| private ManifestFile newManifest = null; | ||
| private List<ManifestFile> cachedNewManifests = null; |
There was a problem hiding this comment.
@RussellSpitzer Do you mean the caching part here? Here just rename the variable because it is plural now.
85c05cc to
7ac0847
Compare
|
@rdblue @RussellSpitzer @nastra @amogh-jahagirdar Would you mind taking another look at this when you are free? Thanks a lot. |
|
@rdblue @RussellSpitzer @szehon-ho can we have another look at this review and merge if possible ? thanks. large manifests have huge performance penalties for PetaByte tables. |
d1e1f1d to
617647e
Compare
|
Yea sorry let me try to review this tomorrow. |
9074d09 to
e27ba9b
Compare
| import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
|
|
||
| /** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */ | ||
| abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> { |
There was a problem hiding this comment.
@szehon-ho pls take a look if this is the right direction. Here implements FileAppender instead of FileWriter that's because the ManifestWriter implements FileAppender.
There was a problem hiding this comment.
Thanks, this definitely looks cleaner.
Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.
Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.
The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?
CC @aokolnychyi .
There was a problem hiding this comment.
Actually on second read, this may be ok, as its a composition, and not duplicating that much code. Lets see if Anton has any comments
| // // Operation succeeds and calls cleanUncommitted' | ||
| // txn.newFastAppend().appendFile(...).commit(); | ||
| // // Some other operations ... | ||
| // // Commit fails and needs to clean up newManifests |
There was a problem hiding this comment.
Nit: should this comment be after next one? (seems comments refer to the line above)
There was a problem hiding this comment.
I add the blank line between the comments to separate them.
| import org.apache.iceberg.relocated.com.google.common.collect.Lists; | ||
|
|
||
| /** As opposed to {@link ManifestWriter}, a rolling writer could produce multiple manifest file. */ | ||
| abstract class RollingManifestWriter<T extends ContentFile<T>> implements FileAppender<T> { |
There was a problem hiding this comment.
Thanks, this definitely looks cleaner.
Im just not sure what to do with existing hierarchy (ManifestWriter), as there are many duplications.
Did we think about modifying ManifestWriter, to take extra variable 'targetFileSizeBytes', then trigger tryRollover only in this case.
The problem may be the method manifestFile() in writer? It should have been manifestFiles(). Maybe we can deprecate existing method?
CC @aokolnychyi .
|
I will take a look with fresh eyes by the end of tomorrow. |
| private void tryRollingToNewFile() { | ||
| currentFileRows++; | ||
|
|
||
| if (currentWriter.length() > targetFileSizeInBytes) { |
There was a problem hiding this comment.
Will it be expensive to call for every file? In rolling file writers, we check only every 1000 records. I understand that value may be too big but I still don't think it is a good idea to check the length for every entry.
There was a problem hiding this comment.
How about checking every 16? Just simply calculated as: 1000 / (512 / 8).
There was a problem hiding this comment.
Could you check the average cost of an entry in a manifest in some of your real tables? My assumption was something around 250 but let's check the average size of an entry.
There was a problem hiding this comment.
I have checked some tables. The average size has a lot to do with the schema of the table, the more columns take up more storage space. That's because of the column metrics. I noticed the biggest one (with thousand columns) needs about 5KB for one entry. The small tables (about 10 columns) need about several hundred bytes. So 250 should be an acceptable number.
|
I feel like this would be super valuable for CTAS kind of use cases where a large chunk of data is being added at the same time. I had a few suggestions on how to simplify the rolling writer. Nice work, @ConeyLiu! |
966a7f8 to
d83ad26
Compare
| private boolean closed = false; | ||
|
|
||
| public RollingManifestWriter( | ||
| FileIO fileIO, |
There was a problem hiding this comment.
Used to delete the empty file.
There was a problem hiding this comment.
Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO here and simplify closeCurrentWriter.
There was a problem hiding this comment.
Sorry my bad, the empty file checking is not necessary.
aokolnychyi
left a comment
There was a problem hiding this comment.
This change looks good to me. I had one question. @ConeyLiu, could you check if I missed something?
| deleteFile(cachedNewDataManifest.path()); | ||
| this.cachedNewDataManifest = null; | ||
| if (cachedNewDataManifests != null) { | ||
| List<ManifestFile> committedNewManifests = Lists.newArrayList(); |
There was a problem hiding this comment.
Optional: committedNewManifests -> committedNewDataManifests?
| private boolean closed = false; | ||
|
|
||
| public RollingManifestWriter( | ||
| FileIO fileIO, |
There was a problem hiding this comment.
Do you think it is still needed? We switched to a lazy writer that is initialized only when there is a new entry to write. The same also applies to cases when we roll to a new file. It can only happen if there is a pending entry to write. If I am not missing anything, we can drop FileIO here and simplify closeCurrentWriter.
|
@aokolnychyi thanks for the details reviewing. The empty file checking is not necessary. Reverted related code changes. |
|
Thanks, @ConeyLiu! Thanks everyone for reviewing! |
|
Thanks @aokolnychyi for mering this. Thanks @rdblue @RussellSpitzer @szehon-ho @amogh-jahagirdar for reviewing. |
In our production env, we noticed the manifest files have a large random size, ranging from several KB to larger than 100 MB. It seems the
MANIFEST_TARGET_SIZE_BYTEShas not worked during the commit phase.In this PR, we avoid generating a manifest file larger than
MANIFEST_TARGET_SIZE_BYTESfor newly added content files and will generate multiple manifest files when the size is covered.